**Peter C. Jurs and Donald V. Eldred**

*Prediction of Chemical or Biological Properties
of Organic Compounds from Molecular Structure*

`Chemistry Department,
Penn State University, University Park, PA 16802, USA.`

The relationships between the molecular structures
of organic compounds and their chemical or biological
properties can be investigated using quantitative
structure-property relationship (QSPR) methods. This
approach uses induction to seek generalities by examining
large sets of compounds (a training set) for which the
property of interest is known. Such QSPR studies involve
three major activities: *representation, feature
selection* and *mapping*. *Representation*
involves calculating molecular structure descriptors to
encode the compounds. General classes of descriptors include
topological, geometrical, electronic, and hybrid representations
of the molecules. Topological descriptors are calculated
directly from the connection table representation
of the structure and employ methods drawn from
mathematical graph theory. Geometric descriptors
are calculated from three-dimensional molecular
models which are generated with molecular orbital
methods. Electronic descriptors come from empirical
or molecular orbital calculations. Hybrid descriptors are
calculated using several of these representations, and
they encode the molecule's ability to participate
in polar interactions or hydrogen bonding. Intermediate
between representation and mapping is *feature selection*,
or descriptor selection, which involves finding the most
informative subsets of descriptors from those in the
descriptor pool using statistical methods, simulated
annealing, or the genetic algorithm. *Mapping*
involves analysis of the descriptors using multivariate
statistical methods or computational neural networks (CNNs)
to build mathematical models linking the descriptors directly
to the chemical property under investigation. After their
development from a training set, these models then can be
used for predicting the property of interest for unknown
compounds. Monte Carlo experiments are used to explore the
role of chance in the generation of the models. This general
QSPR methodology can be applied, in principle, to any chemical
property that is determined by the molecular structure,
and it has been applied to a wide variety of chemical properties.

Recent investigations involving computational neural networks and genetic algorithms serve as examples of the application of the QSPR methods. Three-layer, feed-forward neural networks trained with a quasi-Newton method have provided excellent results in several QSPR studies. The genetic algorithm has been shown to be very effective in performing descriptor selection. Specific applications to be discussed include an aqueous solubility study and a pesticide toxicity study.

The aqueous solubility study involves a set of 332 diverse organic compounds taken from the Aquasol data base. The solubility was expressed as the -log(molarity) and spanned the range from -2 to +12 log units. Successful models were found using the genetic algoroithm and computational neural networks. A nonlinear CNN 9-6-1 model using nine calculated structural descriptors (four topological, one geometric, three charged partial surface areas, and one electronic descriptor) was developed that had an rms error of 0.39 log units for the 265-compound training set, an rms error of 0.36 log units for the 30-compound cross-validation set, and an rms error of 0.34 log units for the 32-compound external prediction set.

The toxicity study involves the successful prediction
of acute toxicity rat LD_{50} values [0 < -log(LD_{50}) < 4]
for 54 pesticides which contain the structural features R_{3}P=O
or R_{3}P=S. The data were taken from three papers by T. B. Gaines.
Successful models were generated using the combination of genetic algorithm
feature selection and computational neural networks. The best model
used seven molecular structure descriptors (four geometric and three
topological descriptors) and had an rms error of 0.20 log units for
the 44-member training set, an rms error of 0.23 log units for
the five-member cross-validation set, and an rms error of 0.25
log units for the five-member external prediction set.

Back to Main Page