RE: QSAR - How to statistically determine when variables are "wor king well together" from Mark on 2003-10-24 (QSAR and MS List)

From: Mark
Date: Fri, 24 Oct 2003 13:47:57 +0100

Dear Jim,

This is a highly interesting and contentious area within data analysis!
Coming from the PLS side of things I personally believe in the 'leaving
everything in' philosophy though I know that there are many ideas on this
subject.

Cutting down the number of descriptors in a QSAR is useful for
interpretation purposes but for prediction, using a method such as PLS
(Partial Least Squares Regression) it is often advantageous to leave all the
terms in. Although internal cross-validation may give an apparent
improvement in Q2 with a lower number of variables, when it comes to
predicting external data sets the results are hardly ever better than
predicted by the full variable model. The reason for this is that PLS is
pulling out the useful correlated information from the data and leaving the
noise behind.

Including more descriptors makes the model more stable, particularly if a
particular parameter cannot be calculated for a molecule in the prediction
set. Predictions may still be made using PLS for an observation with a small
amount of missing data, due to the correlations and redundancy in the data.
Variable selection also weakens the power of the Distance to Model (DmodX)
to find outliers in new data. If a variable is close to constant in the
training set and it is therefore removed, any deviating number in that
removed variable for a new sample, will not be detected.

Useful PLS parameters for showing the importance of a variable are the VIP,
Variable Influence on Projection and the regression coefficients. The VIP
summarises the importance of the variable to the whole model whereas the
regression coefficients show the magnitude and the sign of influence of the
X variables to the Y. Another useful parameter is the Jack-Knifed confidence
interval which is derived from internal cross-validation. This may be used
in conjunction with the VIP to assess the importance and significance of a
variable.

I was very surprised when I first started using PLS that leaving all
variables in gave as good, if not better predictions than selecting only the
most 'important' ones. There is a danger of getting overly optimistic models
by trusting cross validation only when selecting variables. It all comes
back to the trade off between fit and prediction and that many small effects
can add up.

A good idea when building a model based on selected variables, is to run the
full-variable model in parallel as a 'sanity check'. Some authors have taken
this a step further and built populations of models, selected variables by
genetic algorithms or used DoE type approaches to find representative
descriptors. Many of these ideas may be rather unwieldy for many practical
situations and the risk of removing useful information is always present.
Unless a particular variable is 'costly' to calculate or measure I tend to
leave it in.

Those are my gut feelings from my practical experience of building models,
if you want a more theoretical discussion then the following references may
be of interest to you..

Interactive Variable Selection (IVS) for PLS 2. Chemical Applications
Fredrik Lindgren, Paul Geladi,Anders Berglund, Michael Sjöström and Svante
Wold. Research Group for Chemometrics Umea University.
JOURNAL OF CHEMOMETRICS, 1995, 9(5), 331-342

Multi- and Megavariate Data Analysis: Principles and Applications
L. Eriksson, E. Johansson, N. Kettaneh-Wold, and S. Wold
ISBN 91-973730-1-X (Comprehensive text on PCA, PLS and its applications and
extensions)
Multivariate Calibration and Classification Tormold Naes, Tomas Isaksson,
Tom Fearn and Tony Davies
NIR Publications ISBN 09528666 2 5 (A very readable introduction to PCA, PLS
etc)

Krzanowski, WJ.Selection of variables to preserve multivariate data
structure, using principal components.1987 (36) pp 22-33 Applied Statistics
(Mont Carlo simulation, principal component analysis, procrustes rotation,
singular value decomposition, variable selection)

Massimo Baroni, Gabriele Costantino, Gabriele CrucGenerating Optimal Linear
PLS Esimations (GOLPE): An Advanced Cheometric Tool for Handing 3D-QSAR
Problems1993 vol 12 pp 9-20Journal, Università di Perugia GOLPE,
(PLS, Variable selection, CoMFA, 3D-QSAR)

Baroni, M. Clementi, S. Cruciani, G. Costantiono,Predictive ability of
regression models. Part II: Selection of the best predictive PLS model.1992,
vol 6. pp. 347-356Journal of Chemometrics
(Golpe, PLS, Regression, SDEP, variable selectionJohan Wiley & Sons, Ltd.)

Rännar, S.Many variables in multivariate projection methods.1996
PhD Thesis SSV Umeå universitet
(many variables, multivariate projection methods, sequence models, kernel
algorithm, variable selection, PCA, PLS, ACC, IVS,)

Best regards,

Mark Earll

--
----------------------------------------------------------------------------
-------
Mark Earll CChem MRSC          Umetrics
Senior Consultant                (Scientific Data Analysis)
Umetrics UK Ltd                   
Woodside House, Woodside Road,
Winkfield, Windsor, SL4 2DX
Phone:  01344 885615         Mobile: 07765 402673
Email:   mark.earll##umetrics.co.uk 
Fax:       01344 885410    
Web:     http://www.umetrics.com <http://www.umetrics.com/> 
----------------------------------------------------------------------------
---------- 
-----Original Message-----
From: james.metz.:.abbott.com [mailto:james.metz,abbott.com] 
Sent: 23 October 2003 17:34
To: qsar_society.@.accelrys.com
Cc: james.metz---abbott.com
Subject: QSAR - How to statistically determine when variables are "working
well together"
QSAR Society, 
        Is anyone aware of any publications, "white" papers, or
presentations which discuss the concept of how to 
judge when molecular descriptors are "working well together" in a QSAR
equation, to reduce errors and especially 
improve predictive power for external prediction sets? 
        For example, I am well-aware of the more trivial case of building
QSAR equations with say 2 terms, then 3 
terms, then 4 terms, then 5 terms, etc. and then monitoring the R^2, Q^2,
etc..  Then, we say that if the R^2 or Q^2 or 
perhaps the F statistic has "improved significantly", this justifies the use
of a higher order equation.  We may then use 
an Occam's Razor argument to use simpler models when a model with fewer
terms has about the same predictive power 
as a set of equations with more terms, etc. 
        However, I am looking for (perhaps) something more sophisticated and
thoughtful than this approach!  Is there something 
like examining the synergism between variables in QSAR equations that
reduces errors in a way that suggests that the 
variables "work well together" e.g., some kind of cancellation of errors?
Is there a mathematical formalism for this? 
        Thoughts, ideas, leads, comments, etc. are much appreciated here. 
        Regards, 
        Jim Metz 
James T. Metz, Ph.D.
Research Investigator Chemist
GPRD R46Y AP10-2
Abbott Laboratories
100 Abbott Park Road
Abbott Park, IL  60064-6100
U.S.A.
Office (847) 936 - 0441
FAX    (847) 935 - 0548
james.metz%a%abbott.com

Received on 2003-10-24 - 09:50 GMT