DCD and software reuse (was Qmol: Molecular viewer for Windows)
- From: "Andrew Dalke" <dalke()at()acm.org>
- Subject: DCD and software reuse (was Qmol: Molecular viewer for
- Date: Sun, 18 Mar 2001 02:29:43 -0700
Rick Venable <rvenable()at()gandalf.cber.nih.gov> said:
>As a long time academic CHARMM user, it was *years* before it became
>apparent to non-Quanta users that a DCD file was the MSI/Polygen name
>for a CHARMM binary trajectory file. We always referred to them as
>"trajectory files", or .trj files, an extension in common usage
>CHARMm, the commercial variant. I think it would be clearer if your
>blurb referred to CHARMM/CHARMm explicitly, in addition to file
>extension that has meaning mostly to Quanta users.
I suspect part of the problem is due to "us", meaning the
people at UIUC who helped write VMD, NAMD and related programs.
The group was mostly XPLOR based, which shows in the use
by those programs of XPLOR-specific formats. I say suspect
because I see a comment in the qmol source that it bases
the connectivity search from ideas in VMD so assume that
part of the trajectory reading code may by similarly influenced.
Quoting from the X-PLOR manual (version 3.1, chapter 11, p.143 :)
The file format is identical to the CHARMm-DCD format (Brooks
et. al. 1983), which can be read by QUANTA and a variety of
other programs. The only exception is that the number of
coordinate sets written to the trajectory is explicitly
written into the header of CHARMm-DCD files, whereas XPLOR
writes a zero instead.
and on page 147 gives an example
although on page 146 and elsewhere it uses ".crd" as an
extension for a trajectory file.
Since the UIUC programs all write a zero as the number of
coordinate sets, that places them specifically as using
the "XPLOR-DCD" format rather than "CHARMm-DCD" or
However, it has been a long time since I looked that that
code, so I brought up the VMD ReadDCD.C code and compared it
to qmol's. It appears to be a completely different
implementation of reading and writing DCD files and not
based on any of the UIUC code. (The coding and error
checking styles are quite different from even the first
version Mark Nelson wrote in 1995.) This means my assumption
of an impact by the UIUC developers is wrong.
Reading it over, I believe that qmol specifically parses a
subset of the CHARMm trajectory format, in that it requires
a count of the number of coordinate sets. In addition, I see
that the VMD DCD reader has been extended to handle parts
of the CHARMm trajectory format not mentioned in the XPLOR
documentation, like the ability to hold an "extra block"
(whatever that means) and allow 4 dimensional coordinates for
the atoms. The qmol reader doesn't handle these cases.
(Also, it seems there's a place where CHARMm uses a double
as compared to XPLOR's float - must have been found during
the conversion of the UIUC tools to 64 bit. qmol believes
that number is an integer.)
This whole post was really done to lead to a pet peeve of
mine, which is that these formats, which we take for
granted, are rarely documentated so it's often a matter
of reverse engineering the binary file to figure out how
things work. Even those files which are documented are
not fully documented, as these special cases point out.
Oh, and FORTRAN code does not make good documentation
because details of FORTRAN file I/O are left up to the
implementation, making it harder for C to read those files.
Luckily, I believe there is an accepted standard of how
to do that I/O so FORTRAN code from different vendors can
talk to each other. This makes it somewhat less complicated
for other languages to support those formats.
Another thing which would be nice is for there to be a
set of libraries for reading and writing those formats -
even in FORTRAN :) That would make it easier for someone
else to come along and start working with it, or at least
use them as a reference.
But there are obviously two major problems with that. Many
groups don't release any source or when they do (and UIUC
is, sadly, one of those) make it difficult to redistribute
that code elsewhere.
UIUC actually allows people to use and redistribute VMD
code freely so long as it is a relatively minor part of
both projects. The DCD reader and writer would be a small
part of VMD and of qmol so would fall under that exemption.
However, qmol doesn't use it and instead uses one which is
more error prone, less portable, and handles only one
variant of the trajectory format. Yet the API exposed
by VMD's DCD reader is almost exactly superset of that
needed by qmol.
I don't know what can be done to limit this (in my mind)
needless duplication of effort.
CCL of course is one such attempt, being an archive with
both messages and software. There are requests for
information about various trajectory formats going back
at least 6 years, but no detailed description posted
(although some sources are obvious, like looking at
gOpenMol because of Leif Laaksonen's requests for format
descriptions; although his code doesn't explictly allow
redistribution. Hmm, his code and VMD's aren't exactly
the same either.)
There is some source code in CCL, though most are not
available as libraries. In nearly all cases, the code is
quite out of date. For example, VMD is available from CCL
but is at version 1.2, which is several years old. I
believe this shows the CCL is not used as a primary resource
for software for at least structural biology.
There are other places to look for software besides CCL,
but none that I know of are specific to this field. Pehaps
the most relevant general site is freshmeat.net. Searching
for "trajectory format" found three programs - from UIUC.
For "DCD" found 3 relevant hits from 6 matches - all 3 from
UIUC. For "quantum chemistry" found 1 (not from UIUC :).
Searching for "Babel" found 1 relevant hit in 10 matches -
to xpovchem, which uses povchem and babel.
So freshmeat does not well serve this community.
I must point out that the bioinformatics developers have
very similar problems - undocumented formats, many diverse
programs, no centralized, no recognized "portal" for
information, etc. What they do have that chemistry seems
to be missing are projects like bioperl, biopython, biojava,
etc. which try and least to collect and develop freely
available code for bioinformatics. The closest would
probably be Konrad Hinsen's MMTK which focuses on strutural
biology, but isn't all that useful the chemical informatics
I do these days.
BTW, I suspect the difference is a combination of many things.
First, I believe that fewer people are developing chemistry
software (MD, QC, chemical informatics, etc.) than bio
(sequence analysis, gene expression, etc.)
Second, I actually think chemistry software is more complicated
on average!) than bioinformatics both in theory and in
diversity of projects. I know that I don't want to write a
QC program - MD was tricky enough. Compare also that I know 3
very distinct ways people represent molecular models (orbitals,
small molecule (bond types are important) and large molecule
(only atom types/force fields are important)) while there's
only one general model for sequences.
I suspect there are also effects because comp. chemistry is
more established and has more direct ties with proprietary
industry and because bioinformatics is currently getting a
lot of hype and funding.
There is one final option in reducing duplicitous non-science
work in this field, which is to develop such a centralized
site myself. I'm already working on the biopython site as
well as hired by paying clients, and as Jan can no doubt
point out, working on these things sucks up a lot of time.
Maybe this message will prod others?
BTW, on some days when I have to muck through layers of
code doing who knows what, I get the feeling that we all
work on top of a house of cards stuck together by voodoo.
All I'm trying to to is fix the first couple of floors to
make a taller building.