CCL: Re: Software for pattern recognition in QSAR studies?



To: chemistry*at*ccl.net
 Subj: CCL: Re: Software for pattern recognition in QSAR studies?
 > > Renxiao Wang wrote:
 > > > I am looking for a program that can apply standard pattern
 recognition
 > > > techniques. Basically, I want to study a number of samples, each
 of
 > > > which can be characterized by some properties. I would like to
 classify
 > > > these samples into several groups based on these properties, and
 then
 > > > derive a QSAR model for each group.
 > > >...
 > >
 > > When all propoerties are non-numerical (e.g. property 1 takes values
 > > A or B or C, property 2 takes value red or green or blue, property 3
 > > takes values: alpha or beta or gamma or delta, etc...), there is
 > > a classification method able to compute the optimal partition,
 > > including the number of classes: very few methods can do this.
 > > Freeware with reference and documentation:
 > > http://petitjeanmichel.free.fr/itoweb.petitjean.freeware.html#POP
 > "E.L. Willighagen" <e.willighagen*at*science.ru.nl>
 replied:
 > Classification and Regression Trees (CART) can be do that... there are two
 or
 > three packages (one is tree) for R available. See http://cran.r-project.org/.
 Hiearachical classification methods need to cut the tree. The problem
 is that cutting the tree is done with the help of arbitrary parameters,
 these latter being NOT computed from the data only. E.g., in CART,
 the decision to split a group needs a test, this latter being based
 most time upon an arbitrary value set by the user. It means that the
 final number of classes depends on an arbitrary selection of values,
 done by the user. But even the experienced user cannot be sure to do
 a suitable selection of parameters. The POP freeware above works without
 "external" parameters, and compute the number of classes from data
 only. This dependance of external parameters occurs in many molecular
 modeling problems, and also occurs in many other fields.
 Actually, a number of descriptive statisticians work on this problem
 in the case of numerical variables. It is far from being solved.
 The solution known for categorical variables is due to
 F. Marcotorchino in 1981. The scientific community is waiting
 for an elegant solution in the numerical case.
 Michel Petitjean,                     Email: petitjean*at*itodys.jussieu.fr
 ITODYS (CNRS, UMR 7086)                      ptitjean*at*ccr.jussieu.fr
 1 rue Guy de la Brosse                Phone: +33 (0)1 44 27 48 57
 75005 Paris, France.                  FAX  : +33 (0)1 44 27 68 14
 http://petitjeanmichel.free.fr/itoweb.petitjean.html