Re: QSAR - Sparse or Uneven Biological Data - What to do?

From: Curt M. Breneman <"Curt>
Date: Fri, 27 Feb 2004 13:25:20 -0500

Dear Jim and other QSAR folks,

This is not an uncommon tale, and I think it can to cause computational
chemists to move away from target-driven ligand-centric model building
and towards the development of more generally applicable filters and
models (ADMET...etc). On the other hand, there are established
techniques for handling biased datasets (such as the situation you
mention) that involve boosting (similar to your weighting idea) within
machine learning-based approaches. While I understand that it doesn't
help much in establishing corporate funding priorities, I believe that
retrospective QSAR modeling (and the resulting resume-thickening set of
J. Med. Chem. or JCICS papers) will ultimately lead to the development
of better overall ligand-centric modeling techniques. Clearly, more
shared data (in whatever anonymized form necessary) would serve to
accelerate this process.

Regards,

Curt Breneman
RPI Chemistry

james.metz,abbott.com wrote:

>
> QSAR Society Colleagues,
>
> I have a general question concerning the unfortunate, yet
> common problem of sparse, un-evenly distributed
> biological data that one often obtains especially during the early
> phases of discovery research programs in the pharmaceutical
> industry. In the later (or end) phases of the program, typically
> there is alot of data and one may be fortunate to have nice
> data sets where one can at least find compounds with activities spread
> out over (hopefully) a few orders of magnitude.
> But, if one builds (predictive) models near the end of the program,
> there is very little chance of having a significant impact in
> terms of suggesting/warning against compounds that the chemists
> should/should not make. Of course, the analysis may
> contribute to a nice after-the-fact publication in J. Med. Chem., but
> .... ahem....who cares? (other than adding another publication
> to my CV).
>
> For example, perhaps 100 compounds may be tested in an assay,
> perhaps 95 compounds are "dead" - meaning
> high IC50 values (maybe >100 uM or so) and perhaps only 5 compounds
> have "interesting" activities, perhaps IC50s in the 1-10 uM
> or perhaps < 1 uM range.
>
> To clarify the problem more, let us also assume that one does
> NOT have X-ray structures of the 5 compounds bound
> to a target, so this is NOT simply a matter of figuring out what
> pocket or region of an active site or a receptor that the chemists
> have not exploited very well.
>
> In other words... this is more of a ligand-based
> structure-activity problem.
>
> OK, so now what do you decide to do?
>
> Quit, move on to the next project, or Stick your neck out and
> try to build models, or ....
>
> One idea that has been kicked around goes something like
> this: "Since I really only care about the active compounds, why not
> pay MORE attention to them?"
>
> Statistically, translated, this might mean - Change the
> weighting of my 5 most active compounds instead of weighting all
> compounds
> evenly. Or, maybe throwing out some compounds near the mean value
> (high IC50 in this case), since they (seemingly ?) are not
> contributing
> much "information."
>
> I can see pros and cons with butchering or artificially
> modifying the data set, hance I do not see a clear answer.
>
> So.... Does anyone have any thoughts on this approach, or
> perhaps other ideas about dealing with this general problem of poorly
> distributed biological data?
>
>
> Best Regards,
> Jim Metz
>
>
> James T. Metz, Ph.D.
> Research Investigator Chemist
>
> GPRD R46Y AP10-2
> Abbott Laboratories
> 100 Abbott Park Road
> Abbott Park, IL 60064-6100
> U.S.A.
>
> Office (847) 936 - 0441
> FAX (847) 935 - 0548
>
> james.metz(~)abbott.com
Received on 2004-02-27 - 15:26 GMT

This archive was generated by hypermail 2.2.0 : 2005-11-24 - 10:21 GMT