Re: QSAR - How to statistically determine when variables are "working well together"

From: algorithms <alanas::ap-algorithms.com>
Date: Fri, 24 Oct 2003 20:02:34 -0400

MessageDear Jim and Mark,

The question can be simplified if we consider two types of QSAR predictivity - "statistical induction" and "mechanistic deduction". Statistical induction increases with increasing the number of descriptors, assuming that similar compounds act by similar mechanisms. On the contrary, mechanistic deduction requires using minimum number of descriptors to formulate the simplest possible hypotheses.

The "maximum-descriptor" approach is only useful for compounds that are similar to the training set, whereas the "minimum-descriptor" approach should be used when new compounds are dissimilar. In the latter case we can only aim at predicting crude qualitative effects, so QSARs should be replaced with minimum-descriptor Classification SARs. (If C-SARs use too many descriptors, they are no different from maximum-descriptor QSARs).

Validation of QSAR descriptors cannot be based solely on statistics. Statistics only help us to formulate scientific hypotheses (as any combination of descriptors means certain hypothesis). These hypotheses must be verified using the independent, but related data. Otherwise we lock ourselves in the world of "structural similarity" that is full of chance correlations.

More considerations on this can be found in Japertas et al, Mini Rev Med Chem 2003, 3, 797-808.

Alanas Petrauskas
Pharma Algorithms

----- Original Message -----
  From: Mark Earll
  To: 'qsar_society .. accelrys.com'
  Sent: Friday, October 24, 2003 8:47 AM
  Subject: RE: QSAR - How to statistically determine when variables are "working well together"

  Dear Jim,

  This is a highly interesting and contentious area within data analysis! Coming from the PLS side of things I personally believe in the 'leaving everything in' philosophy though I know that there are many ideas on this subject.

  Cutting down the number of descriptors in a QSAR is useful for interpretation purposes but for prediction, using a method such as PLS (Partial Least Squares Regression) it is often advantageous to leave all the terms in. Although internal cross-validation may give an apparent improvement in Q2 with a lower number of variables, when it comes to predicting external data sets the results are hardly ever better than predicted by the full variable model. The reason for this is that PLS is pulling out the useful correlated information from the data and leaving the noise behind.

  Including more descriptors makes the model more stable, particularly if a particular parameter cannot be calculated for a molecule in the prediction set. Predictions may still be made using PLS for an observation with a small amount of missing data, due to the correlations and redundancy in the data. Variable selection also weakens the power of the Distance to Model (DmodX) to find outliers in new data. If a variable is close to constant in the training set and it is therefore removed, any deviating number in that removed variable for a new sample, will not be detected.

  Useful PLS parameters for showing the importance of a variable are the VIP, Variable Influence on Projection and the regression coefficients. The VIP summarises the importance of the variable to the whole model whereas the regression coefficients show the magnitude and the sign of influence of the X variables to the Y. Another useful parameter is the Jack-Knifed confidence interval which is derived from internal cross-validation. This may be used in conjunction with the VIP to assess the importance and significance of a variable.

  I was very surprised when I first started using PLS that leaving all variables in gave as good, if not better predictions than selecting only the most 'important' ones. There is a danger of getting overly optimistic models by trusting cross validation only when selecting variables. It all comes back to the trade off between fit and prediction and that many small effects can add up.

  A good idea when building a model based on selected variables, is to run the full-variable model in parallel as a 'sanity check'. Some authors have taken this a step further and built populations of models, selected variables by genetic algorithms or used DoE type approaches to find representative descriptors. Many of these ideas may be rather unwieldy for many practical situations and the risk of removing useful information is always present. Unless a particular variable is 'costly' to calculate or measure I tend to leave it in.

  Those are my gut feelings from my practical experience of building models, if you want a more theoretical discussion then the following references may be of interest to you..

  Interactive Variable Selection (IVS) for PLS 2. Chemical Applications Fredrik Lindgren, Paul Geladi,Anders Berglund, Michael Sjöström and Svante Wold. Research Group for Chemometrics Umea University.
  JOURNAL OF CHEMOMETRICS, 1995, 9(5), 331-342

  Multi- and Megavariate Data Analysis: Principles and Applications
  L. Eriksson, E. Johansson, N. Kettaneh-Wold, and S. Wold
  ISBN 91-973730-1-X (Comprehensive text on PCA, PLS and its applications and extensions)
  Multivariate Calibration and Classification Tormold Naes, Tomas Isaksson, Tom Fearn and Tony Davies
  NIR Publications ISBN 09528666 2 5 (A very readable introduction to PCA, PLS etc)

  Krzanowski, WJ.Selection of variables to preserve multivariate data structure, using principal components.1987 (36) pp 22-33 Applied Statistics
  (Mont Carlo simulation, principal component analysis, procrustes rotation, singular value decomposition, variable selection)

  Massimo Baroni, Gabriele Costantino, Gabriele CrucGenerating Optimal Linear PLS Esimations (GOLPE): An Advanced Cheometric Tool for Handing 3D-QSAR Problems1993 vol 12 pp 9-20Journal, Universitā di Perugia GOLPE,
  (PLS, Variable selection, CoMFA, 3D-QSAR)

  Baroni, M. Clementi, S. Cruciani, G. Costantiono,Predictive ability of regression models. Part II: Selection of the best predictive PLS model.1992, vol 6. pp. 347-356Journal of Chemometrics
  (Golpe, PLS, Regression, SDEP, variable selectionJohan Wiley & Sons, Ltd.)

  Rännar, S.Many variables in multivariate projection methods.1996
  PhD Thesis SSV Umeå universitet
  (many variables, multivariate projection methods, sequence models, kernel algorithm, variable selection, PCA, PLS, ACC, IVS,)

  Best regards,

  Mark Earll
  -- -----------------------------------------------------------------------------------
  Mark Earll CChem MRSC Umetrics
  Senior Consultant (Scientific Data Analysis)
  Umetrics UK Ltd
  Woodside House, Woodside Road,
  Winkfield, Windsor, SL4 2DX

  Phone: 01344 885615 Mobile: 07765 402673
  Email: mark.earll.:.umetrics.co.uk
  Fax: 01344 885410
  Web: http://www.umetrics.com
  --------------------------------------------------------------------------------------

    -----Original Message-----
    From: james.metz^^abbott.com [mailto:james.metz]^[abbott.com]
    Sent: 23 October 2003 17:34
    To: qsar_society*accelrys.com
    Cc: james.metz^_^abbott.com
    Subject: QSAR - How to statistically determine when variables are "working well together"

    QSAR Society,

            Is anyone aware of any publications, "white" papers, or presentations which discuss the concept of how to
    judge when molecular descriptors are "working well together" in a QSAR equation, to reduce errors and especially
    improve predictive power for external prediction sets?

            For example, I am well-aware of the more trivial case of building QSAR equations with say 2 terms, then 3
    terms, then 4 terms, then 5 terms, etc. and then monitoring the R^2, Q^2, etc.. Then, we say that if the R^2 or Q^2 or
    perhaps the F statistic has "improved significantly", this justifies the use of a higher order equation. We may then use
    an Occam's Razor argument to use simpler models when a model with fewer terms has about the same predictive power
    as a set of equations with more terms, etc.

            However, I am looking for (perhaps) something more sophisticated and thoughtful than this approach! Is there something
    like examining the synergism between variables in QSAR equations that reduces errors in a way that suggests that the
    variables "work well together" e.g., some kind of cancellation of errors? Is there a mathematical formalism for this?

            Thoughts, ideas, leads, comments, etc. are much appreciated here.

            Regards,
            Jim Metz

    James T. Metz, Ph.D.
    Research Investigator Chemist

    GPRD R46Y AP10-2
    Abbott Laboratories
    100 Abbott Park Road
    Abbott Park, IL 60064-6100
    U.S.A.

    Office (847) 936 - 0441
    FAX (847) 935 - 0548

    james.metz###abbott.com
Received on 2003-10-25 - 09:10 GMT

This archive was generated by hypermail 2.2.0 : 2005-11-24 - 10:21 GMT