From chemistry-request # - at - # server.ccl.net Sun Mar 18 04:32:04 2001 Received: from johnson.mail.mindspring.net ([207.69.200.177]) by server.ccl.net (8.11.0/8.11.0) with ESMTP id f2I9W3D11285 for ; Sun, 18 Mar 2001 04:32:04 -0500 Received: from josiah (pool-63.50.172.185.dnvr.grid.net [63.50.172.185]) by johnson.mail.mindspring.net (8.9.3/8.8.5) with ESMTP id EAA06476 for ; Sun, 18 Mar 2001 04:32:00 -0500 (EST) Message-ID: <062801c0af8d$f9300fe0$b9ac323f: at :josiah> From: "Andrew Dalke" To: Subject: DCD and software reuse (was Qmol: Molecular viewer for Windows) Date: Sun, 18 Mar 2001 02:29:43 -0700 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 4.72.3155.0 X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3155.0 Rick Venable said: >As a long time academic CHARMM user, it was *years* before it became >apparent to non-Quanta users that a DCD file was the MSI/Polygen name >for a CHARMM binary trajectory file. We always referred to them as >"trajectory files", or .trj files, an extension in common usage prior to >CHARMm, the commercial variant. I think it would be clearer if your >blurb referred to CHARMM/CHARMm explicitly, in addition to file >extension that has meaning mostly to Quanta users. I suspect part of the problem is due to "us", meaning the people at UIUC who helped write VMD, NAMD and related programs. The group was mostly XPLOR based, which shows in the use by those programs of XPLOR-specific formats. I say suspect because I see a comment in the qmol source that it bases the connectivity search from ideas in VMD so assume that part of the trajectory reading code may by similarly influenced. Quoting from the X-PLOR manual (version 3.1, chapter 11, p.143 :) The file format is identical to the CHARMm-DCD format (Brooks et. al. 1983), which can be read by QUANTA and a variety of other programs. The only exception is that the number of coordinate sets written to the trajectory is explicitly written into the header of CHARMm-DCD files, whereas XPLOR writes a zero instead. and on page 147 gives an example write trajectory output=trajectory.dcd ascii=false although on page 146 and elsewhere it uses ".crd" as an extension for a trajectory file. Since the UIUC programs all write a zero as the number of coordinate sets, that places them specifically as using the "XPLOR-DCD" format rather than "CHARMm-DCD" or "trajectory files". However, it has been a long time since I looked that that code, so I brought up the VMD ReadDCD.C code and compared it to qmol's. It appears to be a completely different implementation of reading and writing DCD files and not based on any of the UIUC code. (The coding and error checking styles are quite different from even the first version Mark Nelson wrote in 1995.) This means my assumption of an impact by the UIUC developers is wrong. Reading it over, I believe that qmol specifically parses a subset of the CHARMm trajectory format, in that it requires a count of the number of coordinate sets. In addition, I see that the VMD DCD reader has been extended to handle parts of the CHARMm trajectory format not mentioned in the XPLOR documentation, like the ability to hold an "extra block" (whatever that means) and allow 4 dimensional coordinates for the atoms. The qmol reader doesn't handle these cases. (Also, it seems there's a place where CHARMm uses a double as compared to XPLOR's float - must have been found during the conversion of the UIUC tools to 64 bit. qmol believes that number is an integer.) This whole post was really done to lead to a pet peeve of mine, which is that these formats, which we take for granted, are rarely documentated so it's often a matter of reverse engineering the binary file to figure out how things work. Even those files which are documented are not fully documented, as these special cases point out. Oh, and FORTRAN code does not make good documentation because details of FORTRAN file I/O are left up to the implementation, making it harder for C to read those files. Luckily, I believe there is an accepted standard of how to do that I/O so FORTRAN code from different vendors can talk to each other. This makes it somewhat less complicated for other languages to support those formats. Another thing which would be nice is for there to be a set of libraries for reading and writing those formats - even in FORTRAN :) That would make it easier for someone else to come along and start working with it, or at least use them as a reference. But there are obviously two major problems with that. Many groups don't release any source or when they do (and UIUC is, sadly, one of those) make it difficult to redistribute that code elsewhere. UIUC actually allows people to use and redistribute VMD code freely so long as it is a relatively minor part of both projects. The DCD reader and writer would be a small part of VMD and of qmol so would fall under that exemption. However, qmol doesn't use it and instead uses one which is more error prone, less portable, and handles only one variant of the trajectory format. Yet the API exposed by VMD's DCD reader is almost exactly superset of that needed by qmol. I don't know what can be done to limit this (in my mind) needless duplication of effort. CCL of course is one such attempt, being an archive with both messages and software. There are requests for information about various trajectory formats going back at least 6 years, but no detailed description posted (although some sources are obvious, like looking at gOpenMol because of Leif Laaksonen's requests for format descriptions; although his code doesn't explictly allow redistribution. Hmm, his code and VMD's aren't exactly the same either.) There is some source code in CCL, though most are not available as libraries. In nearly all cases, the code is quite out of date. For example, VMD is available from CCL but is at version 1.2, which is several years old. I believe this shows the CCL is not used as a primary resource for software for at least structural biology. There are other places to look for software besides CCL, but none that I know of are specific to this field. Pehaps the most relevant general site is freshmeat.net. Searching for "trajectory format" found three programs - from UIUC. For "DCD" found 3 relevant hits from 6 matches - all 3 from UIUC. For "quantum chemistry" found 1 (not from UIUC :). Searching for "Babel" found 1 relevant hit in 10 matches - to xpovchem, which uses povchem and babel. So freshmeat does not well serve this community. I must point out that the bioinformatics developers have very similar problems - undocumented formats, many diverse programs, no centralized, no recognized "portal" for information, etc. What they do have that chemistry seems to be missing are projects like bioperl, biopython, biojava, etc. which try and least to collect and develop freely available code for bioinformatics. The closest would probably be Konrad Hinsen's MMTK which focuses on strutural biology, but isn't all that useful the chemical informatics I do these days. BTW, I suspect the difference is a combination of many things. First, I believe that fewer people are developing chemistry software (MD, QC, chemical informatics, etc.) than bio (sequence analysis, gene expression, etc.) Second, I actually think chemistry software is more complicated on average!) than bioinformatics both in theory and in diversity of projects. I know that I don't want to write a QC program - MD was tricky enough. Compare also that I know 3 very distinct ways people represent molecular models (orbitals, small molecule (bond types are important) and large molecule (only atom types/force fields are important)) while there's only one general model for sequences. I suspect there are also effects because comp. chemistry is more established and has more direct ties with proprietary industry and because bioinformatics is currently getting a lot of hype and funding. There is one final option in reducing duplicitous non-science work in this field, which is to develop such a centralized site myself. I'm already working on the biopython site as well as hired by paying clients, and as Jan can no doubt point out, working on these things sucks up a lot of time. Maybe this message will prod others? Andrew Dalke dalke \\at// acm.org P.S. BTW, on some days when I have to muck through layers of code doing who knows what, I get the feeling that we all work on top of a house of cards stuck together by voodoo. All I'm trying to to is fix the first couple of floors to make a taller building.