Peter C. Jurs and Donald V. Eldred

Prediction of Chemical or Biological Properties of Organic Compounds from Molecular Structure

Chemistry Department, Penn State University, University Park, PA 16802, USA.

The relationships between the molecular structures of organic compounds and their chemical or biological properties can be investigated using quantitative structure-property relationship (QSPR) methods. This approach uses induction to seek generalities by examining large sets of compounds (a training set) for which the property of interest is known. Such QSPR studies involve three major activities: representation, feature selection and mapping. Representation involves calculating molecular structure descriptors to encode the compounds. General classes of descriptors include topological, geometrical, electronic, and hybrid representations of the molecules. Topological descriptors are calculated directly from the connection table representation of the structure and employ methods drawn from mathematical graph theory. Geometric descriptors are calculated from three-dimensional molecular models which are generated with molecular orbital methods. Electronic descriptors come from empirical or molecular orbital calculations. Hybrid descriptors are calculated using several of these representations, and they encode the molecule's ability to participate in polar interactions or hydrogen bonding. Intermediate between representation and mapping is feature selection, or descriptor selection, which involves finding the most informative subsets of descriptors from those in the descriptor pool using statistical methods, simulated annealing, or the genetic algorithm. Mapping involves analysis of the descriptors using multivariate statistical methods or computational neural networks (CNNs) to build mathematical models linking the descriptors directly to the chemical property under investigation. After their development from a training set, these models then can be used for predicting the property of interest for unknown compounds. Monte Carlo experiments are used to explore the role of chance in the generation of the models. This general QSPR methodology can be applied, in principle, to any chemical property that is determined by the molecular structure, and it has been applied to a wide variety of chemical properties.

Recent investigations involving computational neural networks and genetic algorithms serve as examples of the application of the QSPR methods. Three-layer, feed-forward neural networks trained with a quasi-Newton method have provided excellent results in several QSPR studies. The genetic algorithm has been shown to be very effective in performing descriptor selection. Specific applications to be discussed include an aqueous solubility study and a pesticide toxicity study.

The aqueous solubility study involves a set of 332 diverse organic compounds taken from the Aquasol data base. The solubility was expressed as the -log(molarity) and spanned the range from -2 to +12 log units. Successful models were found using the genetic algoroithm and computational neural networks. A nonlinear CNN 9-6-1 model using nine calculated structural descriptors (four topological, one geometric, three charged partial surface areas, and one electronic descriptor) was developed that had an rms error of 0.39 log units for the 265-compound training set, an rms error of 0.36 log units for the 30-compound cross-validation set, and an rms error of 0.34 log units for the 32-compound external prediction set.

The toxicity study involves the successful prediction of acute toxicity rat LD50 values [0 < -log(LD50) < 4] for 54 pesticides which contain the structural features R3P=O or R3P=S. The data were taken from three papers by T. B. Gaines. Successful models were generated using the combination of genetic algorithm feature selection and computational neural networks. The best model used seven molecular structure descriptors (four geometric and three topological descriptors) and had an rms error of 0.20 log units for the 44-member training set, an rms error of 0.23 log units for the five-member cross-validation set, and an rms error of 0.25 log units for the five-member external prediction set.

Back to Program Page
Back to Main Page