*From*: Michel Petitjean <ptitjean*at*itodys.jussieu.fr>*Subject*: CCL: Re: Software for pattern recognition in QSAR studies?*Date*: Mon, 6 Dec 2004 17:16:00 +0100 (MET)

To: chemistry*at*ccl.net Subj: CCL: Re: Software for pattern recognition in QSAR studies? > > Renxiao Wang wrote: > > > I am looking for a program that can apply standard pattern recognition > > > techniques. Basically, I want to study a number of samples, each of > > > which can be characterized by some properties. I would like to classify > > > these samples into several groups based on these properties, and then > > > derive a QSAR model for each group. > > >... > > > > When all propoerties are non-numerical (e.g. property 1 takes values > > A or B or C, property 2 takes value red or green or blue, property 3 > > takes values: alpha or beta or gamma or delta, etc...), there is > > a classification method able to compute the optimal partition, > > including the number of classes: very few methods can do this. > > Freeware with reference and documentation: > > http://petitjeanmichel.free.fr/itoweb.petitjean.freeware.html#POP > "E.L. Willighagen" <e.willighagen*at*science.ru.nl> replied: > Classification and Regression Trees (CART) can be do that... there are two or > three packages (one is tree) for R available. See http://cran.r-project.org/. Hiearachical classification methods need to cut the tree. The problem is that cutting the tree is done with the help of arbitrary parameters, these latter being NOT computed from the data only. E.g., in CART, the decision to split a group needs a test, this latter being based most time upon an arbitrary value set by the user. It means that the final number of classes depends on an arbitrary selection of values, done by the user. But even the experienced user cannot be sure to do a suitable selection of parameters. The POP freeware above works without "external" parameters, and compute the number of classes from data only. This dependance of external parameters occurs in many molecular modeling problems, and also occurs in many other fields. Actually, a number of descriptive statisticians work on this problem in the case of numerical variables. It is far from being solved. The solution known for categorical variables is due to F. Marcotorchino in 1981. The scientific community is waiting for an elegant solution in the numerical case. Michel Petitjean, Email: petitjean*at*itodys.jussieu.fr ITODYS (CNRS, UMR 7086) ptitjean*at*ccr.jussieu.fr 1 rue Guy de la Brosse Phone: +33 (0)1 44 27 48 57 75005 Paris, France. FAX : +33 (0)1 44 27 68 14 http://petitjeanmichel.free.fr/itoweb.petitjean.html