William C. Herndon, Hung-Ta Chen, Gabrielle Rum, and Yumei Zhang
QSAR Case Studies Using a Generalized MODL for Molecular Similarity Analysis: (1) Antimalarial Activities of Phenanthrene-(Alkylamino)Carbinols, (2) Carcinogenic Activities of Polycyclic Aromatic Hydrocarbons
Department of Chemistry, The University of Texas at El Paso, El Paso, Texas 79968
Several QSAR methodologies have been developed which make use of hierarchical sets of molecular descriptors and multilinear regression analysis of physical or biological properties. Our procedures advance through enumerations of types of atoms and bonds (level 1), rings and functional groups (level 2), larger structural fragments and steric interactions (level 3), and end by testing the addition of level 4 descriptors based on the results of semiempirical or ab initio molecular orbital calculations. Experimental properties (e.g. logP, boiling points, etc.) are an additional possible source of descriptors, not tested in the present work. In general, the levels of hierarchical structural descriptors are augmented and tested sequentially to obtain information regarding the lowest levels of description necessary for statistically significant rectification of a particular dependent variable property. High quality, structure/property and structure/activity relationships are normally found that use significant terms from several descriptor levels [1-5].
In previous work we have also shown how various types of molecular structure codes or molecular descriptors can be used to calculate measures of molecular similarity [6-9]. In this presentation, a more general, simpler and universal protocol will be described which can be used to obtain molecular similarity measures for an arbitrary set of compounds, either globally or at any chosen level of molecular structure analysis or description. The starting point for the analysis is the usual type of N-by-M data matrix, where N is the number of rows (compounds) and M is the number of columns containing numerical measures of descriptors. The Pearson correlation matrix of this data table is an M-by-M square matrix which describes the linear correlations of the descriptors with each other based on the set of N compounds. In many previous applications, the Pearson correlation matrix has been utilized to select subsets of descriptors for use as trial independent variables in QSAR multilinear regression studies.
The Pearson correlation matrix methodology can also be employed to define a (molecular) similarity matrix for the set of N compounds as follows. In the first step, the descriptor data matrix is standardized by subtracting means and dividing by the standard deviations. This puts all the descriptors on a common scale by removing the undue influence of descriptors with large outlying numerical values. Then, for N compounds, an N-by-N similarity matrix is defined to be the Pearson correlation matrix for the transpose of the standardized matrix of the M molecular structure descriptors. Each column in this similarity matrix represents pairwise numerical values of (+) similarity or (-) dissimilarity to a single compound. Multilinear regression analysis is then used to identify statistically significant similarities and dissimilarities to a small set of reference molecules, which provide the independent variables for a QSAR model equation [8,9].
These concepts are illustrated with two examples. The first set of data is comprised of antimalarial activities of 208 phenanthrene derivatives containing a variety of substituent groups . The first three levels of hierarchical descriptors are determined directly from molecular structure drawings. The fourth level consists of quantum mechanical descriptors derived from AM1 calculations using the QSAR module of the SPARTAN software. The molecular similarity matrix is generated as outlined above. Similarities to particular molecules are chosen to be independent variables in a QSAR equation by stepwise regression analysis. Cross-validated predictions of the antimalarial activities are obtained using the leave-one-out methodology, giving results comparable or superior to those from previous studies . Numerical similarities and dissimilarities to the reference structures defined by this procedure can be used to predict antimalarial activities for new compounds.
This protocol is also tested with biological data consisting of animal studies of carcinogenic activities of polycyclic aromatic hydrocarbons [PAHs] containing a large variety of alkyl substituents . A detailed review of the extant animal assay data through 1991 (210 active compounds of 312 tested) was undertaken , and an index of carcinogenicity was assigned to every compound where the latent periods were measured (90 compounds). The carcinogenicity index is defined analogous to the Iball index, proportional to the percent of animals developing cancer and inversely proportional to latent period, except that experiments with promoters are weighted with a factor of 0.5. As before, the first three levels of descriptors are derived from molecular structure drawings, and a fourth level consists of quantum descriptors derived from AM1 calculations. The entire set of descriptors is used to calculate a similarity matrix for the 90 compound set, which provides the independent variables (similarities to particular compounds) for the final QSAR model equation. Cross-validated correlations of the carcinogenesis data are very good, especially for the more active compounds. Some very weakly active compounds are predicted to be inactive by this procedure .
 M. Garbalena and W. C. Herndon, "Graph Theoretical Models for Enthalpic Properties of Alkanes." J. Chem. Inf. Comp. Sci., 32, 37-42 (1992).
 W. C. Herndon and S. L. Knott, "Structure/Enthalpy Relationships for Hydrocarbons Containing Benzene Rings." Polycycl. Arom. Compds., 11, 229-236 (1996).
 U. J. Urquidi, "Structure/Property and Structure/Activity Analyses of PCBs, PCDDs, and PCDFs.", M. S. Thesis (Univ. of Texas at El Paso, Dec., 1994).
 H.-T. Chen, "Structure/Activity Analyses of Antimalarial Compounds.", M. S. Thesis (Univ. of Texas at El Paso, Dec., 1995).
 Y. Zhang, "Studies of Aromatic Hydrocarbon Carcinogenicity.", M. S. Thesis (Univ. of Texas at El Paso, Dec., 1996).
 W. C. Herndon and S. H. Bertz, "Linear Notations and Molecular Graph Similarity.", J. Comp. Chem., 8, 367-374 (1987).
 A. J. Bruce, "Benzenoid Carcinogenicity and Abstract Definitions of Molecular Similarity.", B. S. Honors Thesis with (Univ. of Texas at El Paso, Aug., 1990).
 G. Rum and W. C. Herndon , "Molecular Similarity Concepts 5. Analysis of Steroid-Protein Binding Constants." J. Am. Chem. Soc., 113, 9055-9060 (1991).
 W. C. Herndon and G. Rum, "Three-Dimensional Topological Descriptors and Similarity of Molecular Structures: Binding Affinities of Corticosteroids." QSAR and Molecular Modeling., Prous Science Publishers, Madrid, 1996, pp. 380-384.
 K. H. Kim, C. Hansch, J. Y. Fukanaga, E. S. Steller, P. Y. C. Jow, P. N. Craig, and J. Page, "Quantitative Structure-Activity Relationships in 1-Aryl-2-(alkylamino)ethanol Antimaliarials." J. Med. Chem., 22, 366-391 (1979).
 "Survey of Compounds Which Have Been Tested For Carcinogenic Activity." Public Health Service Publication No. 149. 15 volumes and two supplements, 1951-1992.Back to Program Page