Application of the SDF_Toolkit:  Combining NCI Open Database structures and biological test data

Introduction

This document describes  some of the operations performed to generate the  downloadable bulk files from the NCI Open Database
structures and biological test data (cancer and AIDS, see http://cactus.cit.nih.gov/ncidb/download.html for more information). The aim of this document is to show how to combine tools of  the SDF_Toolkit  and to provide tricks and recipes by showing  real examples. All these examples shoud be run on a Unix system.

Input files

All input files are based on the publicly and freely available data from NCI's Developmental Therapeutics Program (DTP,  http://dtp.nci.nih.gov).
We collected the structures and biological data from DTP (cancer data as of August 1999, AIDS data as of October 1999), combined them where applicable, and generated  MDL SD files from this information.

SDF_Toolkit

The SDF_Toolkit can be downloaded at http://cactus.cit.nih.gov/SDF_toolkit/index.html . You'll need version 1.06 or later.

Examples

Merge and remove duplicates from two SD files

Objective:

Merge and remove duplicates from two SD files. Duplicates are recognized by the same identifier  (a non chemical data entry in the SD files). The identifier (here: NSC number) must be present in both input files.

Input files:

  • nciopen_LMCH_aug99_0D.sdf  : August 1999 SD file without 3D/2D and stereo information
  • aids_o99_chemical_structs.sdf :  chemical structures  from the DTP site for which AIDS data is available
  • Commands:

     append_sdf -prop NSC nciopen_LMCH_aug99_0D.sdf aids_o99_chemical_structs.sdf > new.sdf

    Merge and remove duplicates from two SD files and make a list of the new entries

    Objective:

    Merge and remove duplicates from two SD files. Duplicates are recognized by the same identifier  (a non chemical data entry in the SD files). The identifier must be present in both input files. Make a list of the new entries.

    Input files:

  • nciopen_LMCH_aug99_0D.sdf  : august 99 SD file without 3D/2D and stereo information
  • aids_o99_chemical_structs.sdf :  chemical structures  from the DTP site for which AIDS data is available
  • Commands:

     append_sdf -prop NSC nciopen_LMCH_aug99_0D.sdf aids_o99_chemical_structs.sdf | extract_prop_sdf -prop NSC> temp.list
     42687 entries read and 2212 entries added from aids_o99_chemical_structs.sdf

    tail -2212 temp.list > 2212_oct99.list
     
     

    Select entries from an SD file using a file containing a list of identifiers

    Objective:

    Select a subset of  an SD file using the NSC number as the identifier.

    Input files:

  • aids_o99_chemical_structs.sdf :  chemical structures  from the DTP site for which AIDS data is available
  • 2212_oct99.list file created here.
  • Commands:

    select_sdf -labelfile 2212_oct99.list -property_name NSC < aids_o99_chemical_structs.sdf >2212_oct99_3D.sdf
     
     

    Remove hydrogens, charges, stereo information and 3D coordinates

    Objective:

    See title.

    Input files:

    Commands:

    remove_h_sdf <  2212_oct99_3D.sdf | remove_charge_sdf | tee 2212_oct99_3D_no_H.sdf | zero_sdf  > 2212_oct99_0D.sdf

    cactus_2d_nci 2212_oct99_0D.sdf | remove_stereo_sdf > 2212_oct99_2D.sdf

    #Redo the same thing for the 689 file:

    select_sdf -labelfile 689_aug99.list -property_name NSC < cancer_screened_a99_chemical_structs.sdf > 689_aug99_3D.sdf

    remove_h_sdf <  689_aug99_3D.sdf | remove_charge_sdf | tee 689_aug99_3D_no_H.sdf | zero_sdf  > 689_aug99_0D.sdf

    cactus_2d_nci 689_aug99_0D.sdf | remove_stereo_sdf > 689_aug99_2D.sdf
     

    Notes:

    tee is a standard Unix command which reads from standard input, writes to standard output and saves to a file. cactus_2d_nci  is a TCL script (not part of the SDF_Toolkit) which calculate 2D coordinates. This script makes use of the CACTVS system.

     

    Remove entries with a special filter

    Objective:

    Remove entries from the NCI files that have an NSC number greater or equal  than 900,000 (these are combinatorial library entries)

    Input files:

    Source code of remove_900000.pm:

    ##################################
    sub is_sdf_record_kept
    {
            my $sdf_entry = shift ;
            my $record_number = shift ;

            defined $sdf_entry || die "Assertion failed" ;
            my $value = $sdf_entry->data_for_field_name("NSC");

            defined $value || die "Assertion failed: undefined property" ;
    #       print STDERR $value, "\n" ;
            return $value < 900000 ; #Keep NSC's < 9000000
    }
    1;
    ##################################
     
     

    Command:

    cat open_397.mol 689_aug99_0D.sdf 2212_oct99_0D.sdf |  select_sdf -perlfile  remove_900000.pm  > temp.sdf

    Notes:

    The special filter is loaded and compiled at run time.

    Sort an SD file using a numerical property

    Objective:

    Sort  NCI files by  NSC number

    Input files:

    Commands:

    cat open_397.sdf 689_aug99_3D.sdf 2212_oct99_3D.sdf| sort_sdf -prop NSC  >nciopen_LMCH_oct99_3D.sdf
     

    Notes:

    sort_sdf might require a lot of memory (the whole input file is stored in memory). For example, sorting the entire NCI database (about 250,000 entries with biological data added, a ~800 MB SD file) by NSC number required 1.5GB of memory and about 20 min. of computer time (this was done on galaxy.nih.gov, an SGI computer with 32 x 250 MHz R10000 processors (only one CPU was used) and 8GB RAM)

    Prepare biological data file

    Objective:

    The NCI cancer screen data are comma separated value files, which unfortunately, cannot be used directly by the add_propd_sdf tool. The problem is that data for one molecule (NSC number) are split over several lines.  The solution is to combine in one line all the data which belongs to one entry. The Perl script nciscreen2csv was written for that purpose.

    Input files:

    Command:

    nciscreen2csv < cancer_screened_gi50_a99 > cancer_screened_gi50_a99.csv
    nciscreen2csv < cancer_screened_lc50_a99 > cancer_screened_lc50_a99.csv
    nciscreen2csv < cancer_screened_tgi_a99 > cancer_screened_tgi_a99.csv
     
     

    Notes:

     See the file  nciscreen2csv in the toolkit.
     

    Add biological data to an SD file

    Objective:

    Add AIDS and cancer cell data to an SD file in one operation.

    Input files:

    Command:

    add_prop_sdf < nciopen_LMCH_oct99_2D.sdf -match NSC -table cancer_screened_gi50_a99.csv -noskip -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_lc50_a99.csv -noskip -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_tgi_a99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ec50_oct99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ic50_oct99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_conc_oct99.csv -noskip -silent > nciopen_LMCH_oct99_2D_AIDS_cancer.sdf
     
     
     
     

    Notes:

    The command shown above consists of only one line!
    -perlclass is a special option for the tool add_prop_sdf. The argument to the option -perlclass, NCI_screen,  is a  the name of a customized Perl class which derives from  the class that process  standard CSV (comma separated value) table files. Its purpose is to reformat the biological data. See the file NCI_screen.pm in the toolkit (this will interest probably only Perl 5 programmers).

    -noskip : is an option that instructs to keep all entries even if biological data is not available

     

    Add biological data to a SD file and filter out entries for which biological data is not available

    Objective:

    Add AIDS and cancer cell data to a SD file in one operation. Same as before, but now only  the structures for which all biological data  (AIDS and cancer cells) is available

    Input files:

    Command:

    add_prop_sdf < nciopen_LMCH_oct99_2D.sdf -match NSC -table cancer_screened_gi50_a99.csv  -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_lc50_a99.csv  -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_tgi_a99.csv -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ec50_oct99.csv  -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ic50_oct99.csv  -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_conc_oct99.csv  -silent > nciopen_LMCH_oct99_2D_all_have_AIDS_cancer.sdf
     
     

    Notes:

    The command shown above consists of only one line!
    -perlclass is a special option for the tool add_prop_sdf. The argument to the option -perlclass, NCI_screen,  is a  the name of a customized Perl class which derives from  the class that process  standard CSV (comma separated value) table files. Its purpose is to reformat the biological data. See the file NCI_screen.pm in the toolkit (this will interest probably only Perl 5 programmers).
     

    The -noskip  option is not used
     


    Bruno Bienfait    1-11-2000