QSAR - Sparse or Uneven Biological Data - What to do?

From: james.metz*|*abbott.com
Date: Fri, 27 Feb 2004 11:29:46 -0600

QSAR Society Colleagues,

        I have a general question concerning the unfortunate, yet common
problem of sparse, un-evenly distributed
biological data that one often obtains especially during the early phases
of discovery research programs in the pharmaceutical
industry. In the later (or end) phases of the program, typically there is
alot of data and one may be fortunate to have nice
data sets where one can at least find compounds with activities spread out
over (hopefully) a few orders of magnitude.
But, if one builds (predictive) models near the end of the program, there
is very little chance of having a significant impact in
terms of suggesting/warning against compounds that the chemists
should/should not make. Of course, the analysis may
contribute to a nice after-the-fact publication in J. Med. Chem., but ....
ahem....who cares? (other than adding another publication
to my CV).

        For example, perhaps 100 compounds may be tested in an assay,
perhaps 95 compounds are "dead" - meaning
high IC50 values (maybe >100 uM or so) and perhaps only 5 compounds have
"interesting" activities, perhaps IC50s in the 1-10 uM
or perhaps < 1 uM range.

        To clarify the problem more, let us also assume that one does NOT
have X-ray structures of the 5 compounds bound
to a target, so this is NOT simply a matter of figuring out what pocket or
region of an active site or a receptor that the chemists
have not exploited very well.

        In other words... this is more of a ligand-based
structure-activity problem.

        OK, so now what do you decide to do?

        Quit, move on to the next project, or Stick your neck out and try
to build models, or ....

        One idea that has been kicked around goes something like this:
"Since I really only care about the active compounds, why not
pay MORE attention to them?"

        Statistically, translated, this might mean - Change the weighting
of my 5 most active compounds instead of weighting all compounds
evenly. Or, maybe throwing out some compounds near the mean value (high
IC50 in this case), since they (seemingly ?) are not contributing
much "information."

        I can see pros and cons with butchering or artificially modifying
the data set, hance I do not see a clear answer.

        So.... Does anyone have any thoughts on this approach, or perhaps
other ideas about dealing with this general problem of poorly
distributed biological data?

 
        Best Regards,
        Jim Metz

James T. Metz, Ph.D.
Research Investigator Chemist

GPRD R46Y AP10-2
Abbott Laboratories
100 Abbott Park Road
Abbott Park, IL 60064-6100
U.S.A.

Office (847) 936 - 0441
FAX (847) 935 - 0548

james.metz,,abbott.com
Received on 2004-02-27 - 14:32 GMT

This archive was generated by hypermail 2.2.0 : 2005-11-24 - 10:21 GMT