Dear Jim,
This is a highly interesting and contentious area within data analysis!
Coming from the PLS side of things I personally believe in the 'leaving
everything in' philosophy though I know that there are many ideas on this
subject.
Cutting down the number of descriptors in a QSAR is useful for
interpretation purposes but for prediction, using a method such as PLS
(Partial Least Squares Regression) it is often advantageous to leave all the
terms in. Although internal cross-validation may give an apparent
improvement in Q2 with a lower number of variables, when it comes to
predicting external data sets the results are hardly ever better than
predicted by the full variable model. The reason for this is that PLS is
pulling out the useful correlated information from the data and leaving the
noise behind.
Including more descriptors makes the model more stable, particularly if a
particular parameter cannot be calculated for a molecule in the prediction
set. Predictions may still be made using PLS for an observation with a small
amount of missing data, due to the correlations and redundancy in the data.
Variable selection also weakens the power of the Distance to Model (DmodX)
to find outliers in new data. If a variable is close to constant in the
training set and it is therefore removed, any deviating number in that
removed variable for a new sample, will not be detected.
Useful PLS parameters for showing the importance of a variable are the VIP,
Variable Influence on Projection and the regression coefficients. The VIP
summarises the importance of the variable to the whole model whereas the
regression coefficients show the magnitude and the sign of influence of the
X variables to the Y. Another useful parameter is the Jack-Knifed confidence
interval which is derived from internal cross-validation. This may be used
in conjunction with the VIP to assess the importance and significance of a
variable.
I was very surprised when I first started using PLS that leaving all
variables in gave as good, if not better predictions than selecting only the
most 'important' ones. There is a danger of getting overly optimistic models
by trusting cross validation only when selecting variables. It all comes
back to the trade off between fit and prediction and that many small effects
can add up.
A good idea when building a model based on selected variables, is to run the
full-variable model in parallel as a 'sanity check'. Some authors have taken
this a step further and built populations of models, selected variables by
genetic algorithms or used DoE type approaches to find representative
descriptors. Many of these ideas may be rather unwieldy for many practical
situations and the risk of removing useful information is always present.
Unless a particular variable is 'costly' to calculate or measure I tend to
leave it in.
Those are my gut feelings from my practical experience of building models,
if you want a more theoretical discussion then the following references may
be of interest to you..
Interactive Variable Selection (IVS) for PLS 2. Chemical Applications
Fredrik Lindgren, Paul Geladi,Anders Berglund, Michael Sjöström and Svante
Wold. Research Group for Chemometrics Umea University.
JOURNAL OF CHEMOMETRICS, 1995, 9(5), 331-342
Multi- and Megavariate Data Analysis: Principles and Applications
L. Eriksson, E. Johansson, N. Kettaneh-Wold, and S. Wold
ISBN 91-973730-1-X (Comprehensive text on PCA, PLS and its applications and
extensions)
Multivariate Calibration and Classification Tormold Naes, Tomas Isaksson,
Tom Fearn and Tony Davies
NIR Publications ISBN 09528666 2 5 (A very readable introduction to PCA, PLS
etc)
Krzanowski, WJ.Selection of variables to preserve multivariate data
structure, using principal components.1987 (36) pp 22-33 Applied Statistics
(Mont Carlo simulation, principal component analysis, procrustes rotation,
singular value decomposition, variable selection)
Massimo Baroni, Gabriele Costantino, Gabriele CrucGenerating Optimal Linear
PLS Esimations (GOLPE): An Advanced Cheometric Tool for Handing 3D-QSAR
Problems1993 vol 12 pp 9-20Journal, Universitŕ di Perugia GOLPE,
(PLS, Variable selection, CoMFA, 3D-QSAR)
Baroni, M. Clementi, S. Cruciani, G. Costantiono,Predictive ability of
regression models. Part II: Selection of the best predictive PLS model.1992,
vol 6. pp. 347-356Journal of Chemometrics
(Golpe, PLS, Regression, SDEP, variable selectionJohan Wiley & Sons, Ltd.)
Rännar, S.Many variables in multivariate projection methods.1996
PhD Thesis SSV Umeĺ universitet
(many variables, multivariate projection methods, sequence models, kernel
algorithm, variable selection, PCA, PLS, ACC, IVS,)
Best regards,
Mark Earll
-- ---------------------------------------------------------------------------- ------- Mark Earll CChem MRSC Umetrics Senior Consultant (Scientific Data Analysis) Umetrics UK Ltd Woodside House, Woodside Road, Winkfield, Windsor, SL4 2DX Phone: 01344 885615 Mobile: 07765 402673 Email: mark.earll##umetrics.co.uk Fax: 01344 885410 Web: http://www.umetrics.com <http://www.umetrics.com/> ---------------------------------------------------------------------------- ---------- -----Original Message----- From: james.metz.:.abbott.com [mailto:james.metz,abbott.com] Sent: 23 October 2003 17:34 To: qsar_society.@.accelrys.com Cc: james.metz---abbott.com Subject: QSAR - How to statistically determine when variables are "working well together" QSAR Society, Is anyone aware of any publications, "white" papers, or presentations which discuss the concept of how to judge when molecular descriptors are "working well together" in a QSAR equation, to reduce errors and especially improve predictive power for external prediction sets? For example, I am well-aware of the more trivial case of building QSAR equations with say 2 terms, then 3 terms, then 4 terms, then 5 terms, etc. and then monitoring the R^2, Q^2, etc.. Then, we say that if the R^2 or Q^2 or perhaps the F statistic has "improved significantly", this justifies the use of a higher order equation. We may then use an Occam's Razor argument to use simpler models when a model with fewer terms has about the same predictive power as a set of equations with more terms, etc. However, I am looking for (perhaps) something more sophisticated and thoughtful than this approach! Is there something like examining the synergism between variables in QSAR equations that reduces errors in a way that suggests that the variables "work well together" e.g., some kind of cancellation of errors? Is there a mathematical formalism for this? Thoughts, ideas, leads, comments, etc. are much appreciated here. Regards, Jim Metz James T. Metz, Ph.D. Research Investigator Chemist GPRD R46Y AP10-2 Abbott Laboratories 100 Abbott Park Road Abbott Park, IL 60064-6100 U.S.A. Office (847) 936 - 0441 FAX (847) 935 - 0548 james.metz%a%abbott.comReceived on 2003-10-24 - 09:50 GMT
This archive was generated by hypermail 2.2.0 : 2005-11-24 - 10:21 GMT