Alexander Tropsha, Weifan Zheng, and Sung Jin Cho

Alexander Tropsha

The Development and Comparative Analysis of Variable Selection QSAR Methods.

The Laboratory for Molecular Modeling, School of Pharmacy, CB # 7360, Beard Hall, University of North Carolina, Chapel Hill, NC 27599, USA.

Rapid growth of chemical databases comprising compounds with known biological or toxicological activities requires the development of fast and automated QSAR methods. We compare several 3D and 2D QSAR methods mostly developed in this laboratory in terms of their robustness, efficiency, and applicability to database mining. q²-Guided Region Selection (q²-GRS) optimizes the performance of 3D Comparative Molecular Field Analysis (CoMFA); however, it does not solve the key problem of CoMFA, i.e alignment. The 2D QSAR methods include Genetic Algorithms -- Partial Least Squares (GA-PLS) and K-Nearest Neighbors (KNN). Both methods are alignment-free and employ multiple topological descriptors of chemical structures and stochastic optimization algorithms to develop robust QSAR models, which are characterized by the highest value of cross-validated R² (q²). The GA-PLS method uses a combination of Genetic Algorithms and PLS to evolve an initial library of the QSAR equations to the final library with the highest average q². The KNN method formally employs the active analog principle and predicts the activity of a compound as the average activity of K most chemically similar compounds using the optimized subset of descriptors to characterize the similarity. These QSAR methods can be used to search for bioactive compounds in chemical databases on the basis of either their (i) (high) activity predicted from the QSAR model, or (ii) similarity to a probe (lead molecule) evaluated using only variables selected by the QSAR model. Using several training datasets, we show that all QSAR methods generate models of comparable quality. However, due to their relative simplicity and higher degree of automation the 2D methods are substantially better suited to the mining of very large datasets, e.g. provided by high throughput screening.

Back to Program Page

Back to Main Page