DCD and software reuse (was Qmol: Molecular viewer for Windows)



Rick Venable <rvenable()at()gandalf.cber.nih.gov> said:
 >As a long time academic CHARMM user, it was *years* before it became
 >apparent to non-Quanta users that a DCD file was the MSI/Polygen name
 >for a CHARMM binary trajectory file.  We always referred to them as
 >"trajectory files", or .trj files, an extension in common usage
 prior to
 >CHARMm, the commercial variant.  I think it would be clearer if your
 >blurb referred to CHARMM/CHARMm explicitly, in addition to file
 >extension that has meaning mostly to Quanta users.
 I suspect part of the problem is due to "us", meaning the
 people at UIUC who helped write VMD, NAMD and related programs.
 The group was mostly XPLOR based, which shows in the use
 by those programs of XPLOR-specific formats.  I say suspect
 because I see a comment in the qmol source that it bases
 the connectivity search from ideas in VMD so assume that
 part of the trajectory reading code may by similarly influenced.
 Quoting from the X-PLOR manual (version 3.1, chapter 11, p.143 :)
    The file format is identical to the CHARMm-DCD format (Brooks
    et. al. 1983), which can be read by QUANTA and a variety of
    other programs.  The only exception is that the number of
    coordinate sets written to the trajectory is explicitly
    written into the header of CHARMm-DCD files, whereas XPLOR
    writes a zero instead.
 and on page 147 gives an example
    write trajectory
         output=trajectory.dcd
         ascii=false
 although on page 146 and elsewhere it uses ".crd" as an
 extension for a trajectory file.
 Since the UIUC programs all write a zero as the number of
 coordinate sets, that places them specifically as using
 the "XPLOR-DCD" format rather than "CHARMm-DCD" or
 "trajectory
 files".
 However, it has been a long time since I looked that that
 code, so I brought up the VMD ReadDCD.C code and compared it
 to qmol's.  It appears to be a completely different
 implementation of reading and writing DCD files and not
 based on any of the UIUC code.  (The coding and error
 checking styles are quite different from even the first
 version Mark Nelson wrote in 1995.)  This means my assumption
 of an impact by the UIUC developers is wrong.
 Reading it over, I believe that qmol specifically parses a
 subset of the CHARMm trajectory format, in that it requires
 a count of the number of coordinate sets.  In addition, I see
 that the VMD DCD reader has been extended to handle parts
 of the CHARMm trajectory format not mentioned in the XPLOR
 documentation, like the ability to hold an "extra block"
 (whatever that means) and allow 4 dimensional coordinates for
 the atoms.  The qmol reader doesn't handle these cases.
 (Also, it seems there's a place where CHARMm uses a double
 as compared to XPLOR's float - must have been found during
 the conversion of the UIUC tools to 64 bit.  qmol believes
 that number is an integer.)
 This whole post was really done to lead to a pet peeve of
 mine, which is that these formats, which we take for
 granted, are rarely documentated so it's often a matter
 of reverse engineering the binary file to figure out how
 things work.  Even those files which are documented are
 not fully documented, as these special cases point out.
 Oh, and FORTRAN code does not make good documentation
 because details of FORTRAN file I/O are left up to the
 implementation, making it harder for C to read those files.
 Luckily, I believe there is an accepted standard of how
 to do that I/O so FORTRAN code from different vendors can
 talk to each other.  This makes it somewhat less complicated
 for other languages to support those formats.
 Another thing which would be nice is for there to be a
 set of libraries for reading and writing those formats -
 even in FORTRAN :)  That would make it easier for someone
 else to come along and start working with it, or at least
 use them as a reference.
 But there are obviously two major problems with that.  Many
 groups don't release any source or when they do (and UIUC
 is, sadly, one of those) make it difficult to redistribute
 that code elsewhere.
 UIUC actually allows people to use and redistribute VMD
 code freely so long as it is a relatively minor part of
 both projects.  The DCD reader and writer would be a small
 part of VMD and of qmol so would fall under that exemption.
 However, qmol doesn't use it and instead uses one which is
 more error prone, less portable, and handles only one
 variant of the trajectory format.  Yet the API exposed
 by VMD's DCD reader is almost exactly superset of that
 needed by qmol.
 I don't know what can be done to limit this (in my mind)
 needless duplication of effort.
 CCL of course is one such attempt, being an archive with
 both messages and software. There are requests for
 information about various trajectory formats going back
 at least 6 years, but no detailed description posted
 (although some sources are obvious, like looking at
 gOpenMol because of Leif Laaksonen's requests for format
 descriptions; although his code doesn't explictly allow
 redistribution.  Hmm, his code and VMD's aren't exactly
 the same either.)
 There is some source code in CCL, though most are not
 available as libraries.  In nearly all cases, the code is
 quite out of date.  For example, VMD is available from CCL
 but is at version 1.2, which is several years old.  I
 believe this shows the CCL is not used as a primary resource
 for software for at least structural biology.
 There are other places to look for software besides CCL,
 but none that I know of are specific to this field.  Pehaps
 the most relevant general site is freshmeat.net.  Searching
 for "trajectory format" found three programs - from UIUC.
 For "DCD" found 3 relevant hits from 6 matches - all 3 from
 UIUC.  For "quantum chemistry" found 1 (not from UIUC :).
 Searching for "Babel" found 1 relevant hit in 10 matches -
 to xpovchem, which uses povchem and babel.
 So freshmeat does not well serve this community.
 I must point out that the bioinformatics developers have
 very similar problems - undocumented formats, many diverse
 programs, no centralized, no recognized "portal" for
 information, etc.  What they do have that chemistry seems
 to be missing are projects like bioperl, biopython, biojava,
 etc. which try and least to collect and develop freely
 available code for bioinformatics.  The closest would
 probably be Konrad Hinsen's MMTK which focuses on strutural
 biology, but isn't all that useful the chemical informatics
 I do these days.
 BTW, I suspect the difference is a combination of many things.
 First, I believe that fewer people are developing chemistry
 software (MD, QC, chemical informatics, etc.) than bio
 (sequence analysis, gene expression, etc.)
 Second, I actually think chemistry software is more complicated
 on average!) than bioinformatics both in theory and in
 diversity of projects.  I know that I don't want to write a
 QC program - MD was tricky enough.  Compare also that I know 3
 very distinct ways people represent molecular models (orbitals,
 small molecule (bond types are important) and large molecule
 (only atom types/force fields are important)) while there's
 only one general model for sequences.
 I suspect there are also effects because comp. chemistry is
 more established and has more direct ties with proprietary
 industry and because bioinformatics is currently getting a
 lot of hype and funding.
 There is one final option in reducing duplicitous non-science
 work in this field, which is to develop such a centralized
 site myself.  I'm already working on the biopython site as
 well as hired by paying clients, and as Jan can no doubt
 point out, working on these things sucks up a lot of time.
 Maybe this message will prod others?
                     Andrew Dalke
                     dalke()at()acm.org
 P.S.
   BTW, on some days when I have to muck through layers of
 code doing who knows what, I get the feeling that we all
 work on top of a house of cards stuck together by voodoo.
 All I'm trying to to is fix the first couple of floors to
 make a taller building.