QSAR - The variable selection debate

From: Mark Earll <mark.earll+/-umetrics.co.uk>
Date: Fri, 31 Oct 2003 11:53:28 -0000

Dear Alanas, Hugo, Stefan, Isaac and QSAR members,

I am in total agreement with you all that for good interpretability, a model
with as few descriptors as possible is the best one. This is particularly
true if you are perhaps passing the model on for use by a medicinal
chemistry team where you want the model to relate to chemically meaningful
parameters. This is the beauty of models like the Mike Abrahams 5 parameter
MLR based solvation equation.

The main point of my earlier contribution was that PLS does a surprisingly
good initial job of finding a predictive model from huge numbers of
descriptors. A reduced dataset may be produced which appears to increase
cross validated Q2, however when used to predict new external data, the
predictive ability is often not far from the original full descriptor model.
Therefore one must be careful to always validate the pruned model with new
data and not put absolute faith in the cross validation. If dramatic
increases in predictability do occur with a pruned model it is wise to check
to see if one of the excluded variables might have included a gross outlier
or incorrect value.

The publication question is an interesting one, with many issues that may
not be so important for industrial research usage. Where fundamental
understanding is needed I agree that the chemical meaning of the variables
must be clear. A relatively new idea with PLS is to use a hierarchical
approach where local models are made on blocks of similar descriptors. The
top level model is then much more interpretable as the loadings plots are
then showing a condensed summary of the lower level models. Variables can be
grouped into say, electronic, topological, physico-chemical blocks and the
main trends within each group observed. In this way the interpretability is
improved while retaining the stability of a full variable model. This
approach was described by my colleague Lennart Ericsson at last years QSAR
meeting in Bournemouth, the proceedings of which are about to be published.

It's good to see a lively debate on this topic and I will certainly be
reading many of the interesting references cited. Maybe one thing we can all
agree on is that we can never have enough Y data to test our models on!

Best regards,

Mark

--
----------------------------------------------------------------------------
-------
Mark Earll CChem MRSC 	       Umetrics 
Senior Consultant	         (Scientific Data Analysis)
Umetrics UK Ltd                    
Woodside House, Woodside Road, 
Winkfield, Windsor, SL4 2DX
Phone:  01344 885615         Mobile: 07765 402673
Email:	 mark.earll.@.umetrics.co.uk  
Fax:       01344 885410     
Web:	 http://www.umetrics.com
----------------------------------------------------------------------------
----------
Received on 2003-10-31 - 08:56 GMT

This archive was generated by hypermail 2.2.0 : 2005-11-24 - 10:21 GMT