RE: QSAR - How to statistically determine when variables are "wor king well together" from Hugo Kubinyi on 2003-10-26 (QSAR and MS List)

From: Hugo Kubinyi <kubinyi-*-t-online.de>
Date: 26 Oct 2003 09:17 GMT

Dear Mark, dear Lennart, dear all,

regarding the comments on variable selection, I fully agree
with the comment of Lennart Eriksson that the "question of
far from trivial" but not so much with the comment of Mark
Earll, that "PLS leaving all variables in gives as good, if
not better predictions than selecting only the most
important ones".

Let me argue:
First, the sometimes applied selection of variables that
explain as single variables (!) at least to some extent the
variance in the data, is nonsense. One can construct
examples where single variables do not explain anything, but
a proper combination of them gives a perfect fit and
prediction (see e.g. H. Kubinyi and U. Abraham, Practical
problems in PLS analyses, in H. Kubinyi, 3D in Drug Design.
Theory, Methods and Applications, ESCOM, Leiden, 1993, pp.
717-728; sorry, no reprints or PDF available).

A proper variable selection procedure, which seemingly
avoids this problem, is the generation of all three variable
combinations (if the number of variables to select from is
not too large, say, smaller than 100) or a GOLPE procedure
(if the number of variables is larger) to find those
variables that contribute in combination with others;
certain variables show up more often and these should be
selected, also for PLS analysis. The prodecure is described
in detail in:
H. Kubinyi, Variable Selection in QSAR Studies. II. A Highly
Efficient Combination of Systematic Search and Evolution,
Quant. Struct.-Act. Relat. 13, 393-401(1994),
but also in:
H. Kubinyi, Variable Selection in QSAR Studies, in: Trends
in QSAR and Molecular Modelling ´94 (Proceedings of the 10th
European Symposium on Structure-Activity Relationships: QSAR
and Molecular Modelling, F. Sanz, Ed., Prous Publishers,
Barcelona, Spain, 1995, pp. 27-29, and H. Kubinyi,
Evolutionary Variable Selection in Regression and PLS
Analyses, J. Chemometrics 10, 119-133 (1996).

To proof this, I show here only the results of a PLS
analysis of the (nasty) Selwood data set, including all
variables, of the "best" three variable model, and of a PLS
analysis, which included only the "relevant" variables:

a) PLS, all variables (5 components; = "best" model)
     r = 0.929; s = 0.335; F = 31.58
     Q2 = 0.279; sPRESS = 0.768
b) Regression (best 3-variable model)
     r = 0.849; s = 0.460; F = 23.27
     Q2 = 0.647; sPRESS = 0.518
c) PLS, reduced variable set (5 components)
     r = 0.909; s = 0.376; F = 23.91
     Q2 = 0.671; sPRESS = 0.519

As you can see, even the Q2 of the PLS analysis, including
all variables, is just lousy, not to discuss an r2-pred.
Thus, the recommendation, to use PLS analysis without any
variable selection may be useful only if you have an X block
where all, at least most, variables have some significance.
In the most common situation that many irrelevant variables
are included, some proper variable selection is recommended
even in PLS analysis.

Kind regards

Hugo

"Mark Earll" <mark.earll(0)umetrics.co.uk> schrieb:
> Dear Jim,
>
> This is a highly interesting and contentious area within data analysis!
> Coming from the PLS side of things I personally believe in the 'leaving
> everything in' philosophy though I know that there are many ideas on this
> subject.
>
> Cutting down the number of descriptors in a QSAR is useful for
> interpretation purposes but for prediction, using a method such as PLS
> (Partial Least Squares Regression) it is often advantageous to leave all the
> terms in. Although internal cross-validation may give an apparent
> improvement in Q2 with a lower number of variables, when it comes to
> predicting external data sets the results are hardly ever better than
> predicted by the full variable model. The reason for this is that PLS is
> pulling out the useful correlated information from the data and leaving the
> noise behind.
>
> Including more descriptors makes the model more stable, particularly if a
> particular parameter cannot be calculated for a molecule in the prediction
> set. Predictions may still be made using PLS for an observation with a small
> amount of missing data, due to the correlations and redundancy in the data.
> Variable selection also weakens the power of the Distance to Model (DmodX)
> to find outliers in new data. If a variable is close to constant in the
> training set and it is therefore removed, any deviating number in that
> removed variable for a new sample, will not be detected.
>
> Useful PLS parameters for showing the importance of a variable are the VIP,
> Variable Influence on Projection and the regression coefficients. The VIP
> summarises the importance of the variable to the whole model whereas the
> regression coefficients show the magnitude and the sign of influence of the
> X variables to the Y. Another useful parameter is the Jack-Knifed confidence
> interval which is derived from internal cross-validation. This may be used
> in conjunction with the VIP to assess the importance and significance of a
> variable.
>
> I was very surprised when I first started using PLS that leaving all
> variables in gave as good, if not better predictions than selecting only the
> most 'important' ones. There is a danger of getting overly optimistic models
> by trusting cross validation only when selecting variables. It all comes
> back to the trade off between fit and prediction and that many small effects
> can add up.
>
> A good idea when building a model based on selected variables, is to run the
> full-variable model in parallel as a 'sanity check'. Some authors have taken
> this a step further and built populations of models, selected variables by
> genetic algorithms or used DoE type approaches to find representative
> descriptors. Many of these ideas may be rather unwieldy for many practical
> situations and the risk of removing useful information is always present.
> Unless a particular variable is 'costly' to calculate or measure I tend to
> leave it in.
>
> Those are my gut feelings from my practical experience of building models,
> if you want a more theoretical discussion then the following references may
> be of interest to you..
>
> Interactive Variable Selection (IVS) for PLS 2. Chemical Applications
> Fredrik Lindgren, Paul Geladi,Anders Berglund, Michael Sjöström and Svante
> Wold. Research Group for Chemometrics Umea University.
> JOURNAL OF CHEMOMETRICS, 1995, 9(5), 331-342
>
> Multi- and Megavariate Data Analysis: Principles and Applications
> L. Eriksson, E. Johansson, N. Kettaneh-Wold, and S. Wold
> ISBN 91-973730-1-X (Comprehensive text on PCA, PLS and its applications and
> extensions)
> Multivariate Calibration and Classification Tormold Naes, Tomas Isaksson,
> Tom Fearn and Tony Davies
> NIR Publications ISBN 09528666 2 5 (A very readable introduction to PCA, PLS
> etc)
>
> Krzanowski, WJ.Selection of variables to preserve multivariate data
> structure, using principal components.1987 (36) pp 22-33 Applied Statistics
> (Mont Carlo simulation, principal component analysis, procrustes rotation,
> singular value decomposition, variable selection)
>
> Massimo Baroni, Gabriele Costantino, Gabriele CrucGenerating Optimal Linear
> PLS Esimations (GOLPE): An Advanced Cheometric Tool for Handing 3D-QSAR
> Problems1993 vol 12 pp 9-20Journal, Università di Perugia GOLPE,
> (PLS, Variable selection, CoMFA, 3D-QSAR)
>
> Baroni, M. Clementi, S. Cruciani, G. Costantiono,Predictive ability of
> regression models. Part II: Selection of the best predictive PLS model.1992,
> vol 6. pp. 347-356Journal of Chemometrics
> (Golpe, PLS, Regression, SDEP, variable selectionJohan Wiley & Sons, Ltd.)
>
> Rännar, S.Many variables in multivariate projection methods.1996
> PhD Thesis SSV Umeå universitet
> (many variables, multivariate projection methods, sequence models, kernel
> algorithm, variable selection, PCA, PLS, ACC, IVS,)
>
> Best regards,
>
> Mark Earll
> --
> ----------------------------------------------------------------------------
> -------
> Mark Earll CChem MRSC Umetrics
> Senior Consultant (Scientific Data Analysis)
> Umetrics UK Ltd
> Woodside House, Woodside Road,
> Winkfield, Windsor, SL4 2DX
>
> Phone: 01344 885615 Mobile: 07765 402673
> Email: mark.earll++umetrics.co.uk
> Fax: 01344 885410
> Web: http://www.umetrics.com <http://www.umetrics.com/>
> ----------------------------------------------------------------------------
> ----------
>
>
>
> -----Original Message-----
> From: james.metz]=[abbott.com [mailto:james.metz.:.abbott.com]
> Sent: 23 October 2003 17:34
> To: qsar_society+/-accelrys.com
> Cc: james.metz{=}abbott.com
> Subject: QSAR - How to statistically determine when variables are "working
> well together"
>
>
>
> QSAR Society,
>
> Is anyone aware of any publications, "white" papers, or
> presentations which discuss the concept of how to
> judge when molecular descriptors are "working well together" in a QSAR
> equation, to reduce errors and especially
> improve predictive power for external prediction sets?
>
> For example, I am well-aware of the more trivial case of building
> QSAR equations with say 2 terms, then 3
> terms, then 4 terms, then 5 terms, etc. and then monitoring the R^2, Q^2,
> etc.. Then, we say that if the R^2 or Q^2 or
> perhaps the F statistic has "improved significantly", this justifies the use
> of a higher order equation. We may then use
> an Occam's Razor argument to use simpler models when a model with fewer
> terms has about the same predictive power
> as a set of equations with more terms, etc.
>
> However, I am looking for (perhaps) something more sophisticated and
> thoughtful than this approach! Is there something
> like examining the synergism between variables in QSAR equations that
> reduces errors in a way that suggests that the
> variables "work well together" e.g., some kind of cancellation of errors?
> Is there a mathematical formalism for this?
>
> Thoughts, ideas, leads, comments, etc. are much appreciated here.
>
> Regards,
> Jim Metz
>
>
>
> James T. Metz, Ph.D.
> Research Investigator Chemist
>
> GPRD R46Y AP10-2
> Abbott Laboratories
> 100 Abbott Park Road
> Abbott Park, IL 60064-6100
> U.S.A.
>
> Office (847) 936 - 0441
> FAX (847) 935 - 0548
>
> james.metz^-^abbott.com
>
>
>

-- 
Prof. Dr. Hugo Kubinyi,   Donnersbergstrasse 9
D-67256 Weisenheim am Sand,   Germany
FAX  +49-6353-508233,   E-mail   kubinyi===t-online.de
HomePage   http://home.t-online.de/home/kubinyi

Received on 2003-10-26 - 06:23 GMT