CCL: W:Re:Test driven development/XP etc. in scientific software development



 Sent to CCL by: "Perry E. Metzger" [perry ~~ piermont.com]
 "Chas Simpson" writes:
 > With regards to the non-testing related tools (IDEs/source
 > control/code generation etc) I'm interested in which are the most
 > popular and how widely these are employed. Personally, Ive found
 > that some of the IDEs are glorified text editors and hamper
 > development more than anything else. Source control, debugger and
 > compiler integration is often poor or non-existent (another sweeping
 > generalisation?).
 I've already said enough about IDEs, but you bring up source control
 systems, and I don't think those have been discussed here recently.
 As with some of my other messages, I would like to make it clear that
 this message is not intended for people who are mere users of
 computational chemistry systems. It is intended for those of you who
 write them.
 A good source control system (sometimes called a revision control
 system) is a critical part of modern software development. If you have
 just one developer involved in a project, source control allows you to
 stop keeping track of versions and concentrate solely on the
 programming. With multiple developers, a good source control
 infrastructure can become a major communications tool among the
 project members as well as a method of assuring a consistent source
 base.
 This is a quick dump of my brain on this topic.
 Many of you are first and foremost chemists and not computer
 scientists, so let me begin by explaining what a source control system
 is. A source control system is a way of tracking all the changes you
 have ever made to a set of files. Those files could be sources for
 your computational chemistry system, a bunch of web pages, your lab
 notes, a set of PDB files, some paper you are collaborating on with
 co-authors, anything at all really.
 In a source control system, every time you make a change to a file
 (and are reasonably happy with it) you do a "check in" -- you tell the
 system "here is a new version of this file, please remember it for
 me". Typically every version of a file gets a version number of some
 sort, and, here is the nice part, you can ask for a copy of any
 previous version of a file. Typically, you also supply a log message
 every time you check in a new version of a file, so that later on you
 can remember (or, if someone else made the change, learn) why the
 change was made. Check-ins are also called "commits" in some systems,
 by analogy to a database commit -- i.e. they are a point at which you
 are committing a change to the source database.
 Such systems were originally developed to manage source code, but as I
 said, they are often useful in other contexts. It is not entirely
 obvious to the amateur why you would want such a thing, so let me
 explain the motivation a bit.
 Lets say you're writing a paper and you decide you don't need some
 section and you remove it. Later, maybe even months or years later,
 you realize you would like to get that section back, if only to steal
 some of the language for another document. If you used a source
 control system, you could retrieve that old version. If you did not,
 you have to rely on whether you saved a copy of the old version, try
 to remember where it might be, etc.
 Lets say you are writing a program. You release a version, and some
 people note that there are bugs in it, and you fix them, and release a
 new version. Then, people say "hey, between version 1.4 and 1.5, you
 somehow broke this other feature." You then wonder "gee, what exactly
 did I do between version 1.4 and 1.5 that could have done that?"  With
 a source control system, you can just ask the machine to tell you what
 you did between various revisions and look at them. You can even
 revert changes that turned out to be bad ideas.
 Now, for both of these things, you could carefully save a copy of your
 work every day, in some huge set of subdirectories, with copious notes
 attached to what every version was, but that gets very very tedious
 and people will not, in practice, actually do such a thing. Also, if
 you do things by hand, the machine can't actually help you do things
 like automatically keeping multiple branches synchronized (see below.)
 As chemists, most of you are familiar with the notion of a lab
 notebook as an object that is never modified -- you keep putting new
 things in a lab notebook, but you have the pages numbered and the
 notebook exists as an indelible record of all your work. A source
 control system provides such a thing, in a sense, in the electronic
 world -- dutifully saving all intermediate versions of what you did so
 they can be later be fully referred to. Changes go in and they remain
 in, forever. The system faithfully records the actual history of all
 changes ever made to the files under control. It is a tool that must
 be used for a while to understand the depth of its utility.
 Here is another example. Lets say you have several versions of a
 program out. Lets say you have a user community who have bought
 version 1 of your program. Meanwhile you are working on the
 not-yet-released version 2, but you still need to sometimes give bug
 fixes to the users who only are on version 1. (Think of Microsoft, who
 have to send you patches for Windows XP even as they work on Windows
 Vista.) You can, in a good source control system, maintain multiple
 different branches of development of your program, and, in the
 snazziest systems, automatically pull changes between them on
 request. When you do a release, you branch your system, and you don't
 have to keep track of what is in the old version, the newer version
 and the development version of the system -- the source control system
 keeps track of it for you. If a bug is found in the released version,
 you can make a new revision *on the branch* to fix it, or (also
 interestingly) make a revision in the development line, and request
 that the system automatically update the branch with that particular
 fix, with the system assisting you in remembering which fixes have
 been "pulled up" to the branch and which have not.
 As source code control systems have grown over the years, they've
 gained all sorts of features beyond merely storing code. For example,
 one source repository I've dealt with for managing a web site has a
 script that automatically ran when commits were made to update the
 site from the source files. If you, say, fixed the spelling of
 something on 25 pages, and then did a commit, you didn't need to take
 further action -- the web site simply updated automatically. A typical
 modern source control system provides lots of hooks for scripts
 written in languages like Bourne shell, perl, etc., to be run before a
 commit is accepted (for instance, to make sure it conforms to
 particular coding standards or that the user has permission to make
 changes to a particular file), after a commit is accepted, etc. With
 such mechanisms, all sorts of very interesting "side effects" to
 commits become possible, like running your test suite automatically
 every time a commit happens, or sending mail to all the other
 developers when commits happen, etc.
 By the way, I'm quite serious in saying that source code control
 systems are valuable in managing things like a book you are writing,
 web sites, paperwork in a law practice, you name it. Once you have
 gotten used to them, they become an amazingly valuable tool for
 tracking information over many years.
 Okay, enough of an introduction. You probably want to know a bit more
 now about what source control systems are out there and which one you
 might try using.
 There are several different major "styles" of source control system.
 First, there are very old school systems like SCCS (the first real
 source control system I'm aware of, now available in an open source
 clone, though I'm not sure anyone rational would want it any more
 except for reading old SCCS repositories), RCS (which is open source),
 and similar systems.
 Such systems don't really do terribly much. RCS is a typical example
 -- when you use RCS, there is a subdirectory in every directory of
 source code you manage named (funnily enough) "RCS". Every file being
 managed in the directory has a corresponding database file in the RCS
 directory -- for example, a file named "code.c" would also have a file
 called RCS/code.c,v associated with it that stored all the old
 versions of the file. You check in a new version of a file by using
 the self contained "ci" program, retrieve an old version with the
 "co"
 program, etc. The system really just manipulates a couple of files
 locally with every command. It is straightforward, easy to understand,
 and not very flexible if you have a large project with lots of
 developers associated with it. Systems like RCS and SCCS are largely
 obsolete for big software development projects, though I do use RCS
 for managing a few config files on my systems where CVS or SVN would
 be too heavyweight.
 Second, there is the CVS paradigm, which is also largely followed by
 newer systems like Subversion (SVN). In this source control paradigm,
 you maintain a "repository" that contains the entire set of sources
 associated with a system, and developers check out, modify and check
 in sources to the repository, typically from remote systems using
 network protocols like ssh as the transport.
 I happen to work part of the time on an open source operating system
 called NetBSD, and NetBSD is managed in a big CVS repository. The
 repository sits physically on a machine in San Francisco, but
 developers do work on their local boxes around the world. If I want to
 "check out" the latest version of the sources, I run a command that
 connects to the CVS server and downloads all the changes that have
 been made since I last updated. If I am editing a part of the code and
 someone else changes the file I'm working on, the system automatically
 merges their changes into my local version of the file. If it can't
 merge the changes in because they conflict with my own work, it alerts
 me so I can take manual action to merge the changes. Every time
 someone checks in a change to the repository, an email message with
 what they've changed and their log message gets sent to the full set
 of developers, acting as a coordination mechanism within the project.
 CVS, which is open source, is probably the world's most popular source
 control system right now, though it is getting a bit old, and has
 certain design defects that get annoying after a while. Subversion
 (also known as SVN) is a newer open source source control system that
 is in a lot of ways much better than CVS, and if someone is starting
 > from scratch, I recommend looking at it.
 CVS follows a centralized paradigm for source management in which
 there is a single true copy of the canonical sources on the repository
 system. A different paradigm was pioneered by Larry McVoy's
 "Bitkeeper" (a commercial product) which allows groups of developers
 to maintain very disparate sets of source trees that are automatically
 merged into each other over the network when the various developers
 wish to do merges. I won't get too deeply into this way of working
 because I have to admit I don't like it much so I'm not a good
 spokesman for it, but many people seem to like the mechanism a lot --
 enough so that there are now several open source source control
 systems that follow this model, such as arch and monotone. Lots of
 people seem to really like them, so it is probably a mistake to pay
 terribly much attention to my prejudice against such systems.
 You may have noticed that I've mentioned a number of open source
 systems and not many commercial ones. That's for a good reason -- the
 open source systems are usually better than the commercial ones these
 days, or are at worst as good as them. For completeness, though, I
 should mention the commercial ones.
 Most importantly, what to avoid: Microsoft's Visual Source Safe is
 garbage. I don't know how else to say that. It is buggy, horribly
 misbegotten and has not been maintained for years. Even Microsoft does
 not use it for their own work (I believe they may use Perforce -- I'm
 not entirely sure if I remember that right.) I cannot stress enough
 that there is no excuse on earth for using it.
 There are some partisans for Rational's "ClearCase" product but I
 really don't like it much and it requires heroic efforts to
 administer. ClearCase operates by pretending to be a file system and
 noticing every write you make to a file, so it "automatically"
 versions everything. Unfortunately, it turns out that this is not a
 particularly easy trick to accomplish. As I said, I try to avoid
 it. It is also pretty expensive considering that it doesn't seem to
 work as well as simpler and easier systems.
 Perforce and Bitkeeper are both decent products, with Perforce being
 more in the RCS/CVS style and Bitkeeper being the origin of the
 distributed model.
 What do I use? Some projects I work on picked CVS years ago and there
 is a lot of inertia about changing, so we use CVS for those. Newer
 projects I deal with usually use Subversion, which is a very nice
 piece of software. For very small (one or two files) projects of my
 own that only I touch, I sometimes use RCS.
 Perry