CCL: W:Re:Test driven development/XP etc. in scientific software
- From: "Perry E. Metzger"
- Subject: CCL: W:Re:Test driven development/XP etc. in scientific
- Date: Thu, 13 Oct 2005 11:38:26 -0400
Sent to CCL by: "Perry E. Metzger" [perry ~~ piermont.com]
"Chas Simpson" writes:
> With regards to the non-testing related tools (IDEs/source
> control/code generation etc) I'm interested in which are the most
> popular and how widely these are employed. Personally, Ive found
> that some of the IDEs are glorified text editors and hamper
> development more than anything else. Source control, debugger and
> compiler integration is often poor or non-existent (another sweeping
I've already said enough about IDEs, but you bring up source control
systems, and I don't think those have been discussed here recently.
As with some of my other messages, I would like to make it clear that
this message is not intended for people who are mere users of
computational chemistry systems. It is intended for those of you who
A good source control system (sometimes called a revision control
system) is a critical part of modern software development. If you have
just one developer involved in a project, source control allows you to
stop keeping track of versions and concentrate solely on the
programming. With multiple developers, a good source control
infrastructure can become a major communications tool among the
project members as well as a method of assuring a consistent source
This is a quick dump of my brain on this topic.
Many of you are first and foremost chemists and not computer
scientists, so let me begin by explaining what a source control system
is. A source control system is a way of tracking all the changes you
have ever made to a set of files. Those files could be sources for
your computational chemistry system, a bunch of web pages, your lab
notes, a set of PDB files, some paper you are collaborating on with
co-authors, anything at all really.
In a source control system, every time you make a change to a file
(and are reasonably happy with it) you do a "check in" -- you tell the
system "here is a new version of this file, please remember it for
me". Typically every version of a file gets a version number of some
sort, and, here is the nice part, you can ask for a copy of any
previous version of a file. Typically, you also supply a log message
every time you check in a new version of a file, so that later on you
can remember (or, if someone else made the change, learn) why the
change was made. Check-ins are also called "commits" in some systems,
by analogy to a database commit -- i.e. they are a point at which you
are committing a change to the source database.
Such systems were originally developed to manage source code, but as I
said, they are often useful in other contexts. It is not entirely
obvious to the amateur why you would want such a thing, so let me
explain the motivation a bit.
Lets say you're writing a paper and you decide you don't need some
section and you remove it. Later, maybe even months or years later,
you realize you would like to get that section back, if only to steal
some of the language for another document. If you used a source
control system, you could retrieve that old version. If you did not,
you have to rely on whether you saved a copy of the old version, try
to remember where it might be, etc.
Lets say you are writing a program. You release a version, and some
people note that there are bugs in it, and you fix them, and release a
new version. Then, people say "hey, between version 1.4 and 1.5, you
somehow broke this other feature." You then wonder "gee, what exactly
did I do between version 1.4 and 1.5 that could have done that?" With
a source control system, you can just ask the machine to tell you what
you did between various revisions and look at them. You can even
revert changes that turned out to be bad ideas.
Now, for both of these things, you could carefully save a copy of your
work every day, in some huge set of subdirectories, with copious notes
attached to what every version was, but that gets very very tedious
and people will not, in practice, actually do such a thing. Also, if
you do things by hand, the machine can't actually help you do things
like automatically keeping multiple branches synchronized (see below.)
As chemists, most of you are familiar with the notion of a lab
notebook as an object that is never modified -- you keep putting new
things in a lab notebook, but you have the pages numbered and the
notebook exists as an indelible record of all your work. A source
control system provides such a thing, in a sense, in the electronic
world -- dutifully saving all intermediate versions of what you did so
they can be later be fully referred to. Changes go in and they remain
in, forever. The system faithfully records the actual history of all
changes ever made to the files under control. It is a tool that must
be used for a while to understand the depth of its utility.
Here is another example. Lets say you have several versions of a
program out. Lets say you have a user community who have bought
version 1 of your program. Meanwhile you are working on the
not-yet-released version 2, but you still need to sometimes give bug
fixes to the users who only are on version 1. (Think of Microsoft, who
have to send you patches for Windows XP even as they work on Windows
Vista.) You can, in a good source control system, maintain multiple
different branches of development of your program, and, in the
snazziest systems, automatically pull changes between them on
request. When you do a release, you branch your system, and you don't
have to keep track of what is in the old version, the newer version
and the development version of the system -- the source control system
keeps track of it for you. If a bug is found in the released version,
you can make a new revision *on the branch* to fix it, or (also
interestingly) make a revision in the development line, and request
that the system automatically update the branch with that particular
fix, with the system assisting you in remembering which fixes have
been "pulled up" to the branch and which have not.
As source code control systems have grown over the years, they've
gained all sorts of features beyond merely storing code. For example,
one source repository I've dealt with for managing a web site has a
script that automatically ran when commits were made to update the
site from the source files. If you, say, fixed the spelling of
something on 25 pages, and then did a commit, you didn't need to take
further action -- the web site simply updated automatically. A typical
modern source control system provides lots of hooks for scripts
written in languages like Bourne shell, perl, etc., to be run before a
commit is accepted (for instance, to make sure it conforms to
particular coding standards or that the user has permission to make
changes to a particular file), after a commit is accepted, etc. With
such mechanisms, all sorts of very interesting "side effects" to
commits become possible, like running your test suite automatically
every time a commit happens, or sending mail to all the other
developers when commits happen, etc.
By the way, I'm quite serious in saying that source code control
systems are valuable in managing things like a book you are writing,
web sites, paperwork in a law practice, you name it. Once you have
gotten used to them, they become an amazingly valuable tool for
tracking information over many years.
Okay, enough of an introduction. You probably want to know a bit more
now about what source control systems are out there and which one you
might try using.
There are several different major "styles" of source control system.
First, there are very old school systems like SCCS (the first real
source control system I'm aware of, now available in an open source
clone, though I'm not sure anyone rational would want it any more
except for reading old SCCS repositories), RCS (which is open source),
and similar systems.
Such systems don't really do terribly much. RCS is a typical example
-- when you use RCS, there is a subdirectory in every directory of
source code you manage named (funnily enough) "RCS". Every file being
managed in the directory has a corresponding database file in the RCS
directory -- for example, a file named "code.c" would also have a file
called RCS/code.c,v associated with it that stored all the old
versions of the file. You check in a new version of a file by using
the self contained "ci" program, retrieve an old version with the
program, etc. The system really just manipulates a couple of files
locally with every command. It is straightforward, easy to understand,
and not very flexible if you have a large project with lots of
developers associated with it. Systems like RCS and SCCS are largely
obsolete for big software development projects, though I do use RCS
for managing a few config files on my systems where CVS or SVN would
be too heavyweight.
Second, there is the CVS paradigm, which is also largely followed by
newer systems like Subversion (SVN). In this source control paradigm,
you maintain a "repository" that contains the entire set of sources
associated with a system, and developers check out, modify and check
in sources to the repository, typically from remote systems using
network protocols like ssh as the transport.
I happen to work part of the time on an open source operating system
called NetBSD, and NetBSD is managed in a big CVS repository. The
repository sits physically on a machine in San Francisco, but
developers do work on their local boxes around the world. If I want to
"check out" the latest version of the sources, I run a command that
connects to the CVS server and downloads all the changes that have
been made since I last updated. If I am editing a part of the code and
someone else changes the file I'm working on, the system automatically
merges their changes into my local version of the file. If it can't
merge the changes in because they conflict with my own work, it alerts
me so I can take manual action to merge the changes. Every time
someone checks in a change to the repository, an email message with
what they've changed and their log message gets sent to the full set
of developers, acting as a coordination mechanism within the project.
CVS, which is open source, is probably the world's most popular source
control system right now, though it is getting a bit old, and has
certain design defects that get annoying after a while. Subversion
(also known as SVN) is a newer open source source control system that
is in a lot of ways much better than CVS, and if someone is starting
> from scratch, I recommend looking at it.
CVS follows a centralized paradigm for source management in which
there is a single true copy of the canonical sources on the repository
system. A different paradigm was pioneered by Larry McVoy's
"Bitkeeper" (a commercial product) which allows groups of developers
to maintain very disparate sets of source trees that are automatically
merged into each other over the network when the various developers
wish to do merges. I won't get too deeply into this way of working
because I have to admit I don't like it much so I'm not a good
spokesman for it, but many people seem to like the mechanism a lot --
enough so that there are now several open source source control
systems that follow this model, such as arch and monotone. Lots of
people seem to really like them, so it is probably a mistake to pay
terribly much attention to my prejudice against such systems.
You may have noticed that I've mentioned a number of open source
systems and not many commercial ones. That's for a good reason -- the
open source systems are usually better than the commercial ones these
days, or are at worst as good as them. For completeness, though, I
should mention the commercial ones.
Most importantly, what to avoid: Microsoft's Visual Source Safe is
garbage. I don't know how else to say that. It is buggy, horribly
misbegotten and has not been maintained for years. Even Microsoft does
not use it for their own work (I believe they may use Perforce -- I'm
not entirely sure if I remember that right.) I cannot stress enough
that there is no excuse on earth for using it.
There are some partisans for Rational's "ClearCase" product but I
really don't like it much and it requires heroic efforts to
administer. ClearCase operates by pretending to be a file system and
noticing every write you make to a file, so it "automatically"
versions everything. Unfortunately, it turns out that this is not a
particularly easy trick to accomplish. As I said, I try to avoid
it. It is also pretty expensive considering that it doesn't seem to
work as well as simpler and easier systems.
Perforce and Bitkeeper are both decent products, with Perforce being
more in the RCS/CVS style and Bitkeeper being the origin of the
What do I use? Some projects I work on picked CVS years ago and there
is a lot of inertia about changing, so we use CVS for those. Newer
projects I deal with usually use Subversion, which is a very nice
piece of software. For very small (one or two files) projects of my
own that only I touch, I sometimes use RCS.