From owner-chemistry |-at-| ccl.net Thu Oct 13 13:16:00 2005 From: "Perry E. Metzger perry-$-piermont.com" To: CCL Subject: CCL: W:Re:Test driven development/XP etc. in scientific software development Message-Id: <-29594-051013113832-12209-rGpBYstNKShNqp4m6kf/3Q---server.ccl.net> X-Original-From: "Perry E. Metzger" Content-Type: text/plain; charset=us-ascii Date: Thu, 13 Oct 2005 11:38:26 -0400 MIME-Version: 1.0 Sent to CCL by: "Perry E. Metzger" [perry ~~ piermont.com] "Chas Simpson" writes: > With regards to the non-testing related tools (IDEs/source > control/code generation etc) I'm interested in which are the most > popular and how widely these are employed. Personally, Ive found > that some of the IDEs are glorified text editors and hamper > development more than anything else. Source control, debugger and > compiler integration is often poor or non-existent (another sweeping > generalisation?). I've already said enough about IDEs, but you bring up source control systems, and I don't think those have been discussed here recently. As with some of my other messages, I would like to make it clear that this message is not intended for people who are mere users of computational chemistry systems. It is intended for those of you who write them. A good source control system (sometimes called a revision control system) is a critical part of modern software development. If you have just one developer involved in a project, source control allows you to stop keeping track of versions and concentrate solely on the programming. With multiple developers, a good source control infrastructure can become a major communications tool among the project members as well as a method of assuring a consistent source base. This is a quick dump of my brain on this topic. Many of you are first and foremost chemists and not computer scientists, so let me begin by explaining what a source control system is. A source control system is a way of tracking all the changes you have ever made to a set of files. Those files could be sources for your computational chemistry system, a bunch of web pages, your lab notes, a set of PDB files, some paper you are collaborating on with co-authors, anything at all really. In a source control system, every time you make a change to a file (and are reasonably happy with it) you do a "check in" -- you tell the system "here is a new version of this file, please remember it for me". Typically every version of a file gets a version number of some sort, and, here is the nice part, you can ask for a copy of any previous version of a file. Typically, you also supply a log message every time you check in a new version of a file, so that later on you can remember (or, if someone else made the change, learn) why the change was made. Check-ins are also called "commits" in some systems, by analogy to a database commit -- i.e. they are a point at which you are committing a change to the source database. Such systems were originally developed to manage source code, but as I said, they are often useful in other contexts. It is not entirely obvious to the amateur why you would want such a thing, so let me explain the motivation a bit. Lets say you're writing a paper and you decide you don't need some section and you remove it. Later, maybe even months or years later, you realize you would like to get that section back, if only to steal some of the language for another document. If you used a source control system, you could retrieve that old version. If you did not, you have to rely on whether you saved a copy of the old version, try to remember where it might be, etc. Lets say you are writing a program. You release a version, and some people note that there are bugs in it, and you fix them, and release a new version. Then, people say "hey, between version 1.4 and 1.5, you somehow broke this other feature." You then wonder "gee, what exactly did I do between version 1.4 and 1.5 that could have done that?" With a source control system, you can just ask the machine to tell you what you did between various revisions and look at them. You can even revert changes that turned out to be bad ideas. Now, for both of these things, you could carefully save a copy of your work every day, in some huge set of subdirectories, with copious notes attached to what every version was, but that gets very very tedious and people will not, in practice, actually do such a thing. Also, if you do things by hand, the machine can't actually help you do things like automatically keeping multiple branches synchronized (see below.) As chemists, most of you are familiar with the notion of a lab notebook as an object that is never modified -- you keep putting new things in a lab notebook, but you have the pages numbered and the notebook exists as an indelible record of all your work. A source control system provides such a thing, in a sense, in the electronic world -- dutifully saving all intermediate versions of what you did so they can be later be fully referred to. Changes go in and they remain in, forever. The system faithfully records the actual history of all changes ever made to the files under control. It is a tool that must be used for a while to understand the depth of its utility. Here is another example. Lets say you have several versions of a program out. Lets say you have a user community who have bought version 1 of your program. Meanwhile you are working on the not-yet-released version 2, but you still need to sometimes give bug fixes to the users who only are on version 1. (Think of Microsoft, who have to send you patches for Windows XP even as they work on Windows Vista.) You can, in a good source control system, maintain multiple different branches of development of your program, and, in the snazziest systems, automatically pull changes between them on request. When you do a release, you branch your system, and you don't have to keep track of what is in the old version, the newer version and the development version of the system -- the source control system keeps track of it for you. If a bug is found in the released version, you can make a new revision *on the branch* to fix it, or (also interestingly) make a revision in the development line, and request that the system automatically update the branch with that particular fix, with the system assisting you in remembering which fixes have been "pulled up" to the branch and which have not. As source code control systems have grown over the years, they've gained all sorts of features beyond merely storing code. For example, one source repository I've dealt with for managing a web site has a script that automatically ran when commits were made to update the site from the source files. If you, say, fixed the spelling of something on 25 pages, and then did a commit, you didn't need to take further action -- the web site simply updated automatically. A typical modern source control system provides lots of hooks for scripts written in languages like Bourne shell, perl, etc., to be run before a commit is accepted (for instance, to make sure it conforms to particular coding standards or that the user has permission to make changes to a particular file), after a commit is accepted, etc. With such mechanisms, all sorts of very interesting "side effects" to commits become possible, like running your test suite automatically every time a commit happens, or sending mail to all the other developers when commits happen, etc. By the way, I'm quite serious in saying that source code control systems are valuable in managing things like a book you are writing, web sites, paperwork in a law practice, you name it. Once you have gotten used to them, they become an amazingly valuable tool for tracking information over many years. Okay, enough of an introduction. You probably want to know a bit more now about what source control systems are out there and which one you might try using. There are several different major "styles" of source control system. First, there are very old school systems like SCCS (the first real source control system I'm aware of, now available in an open source clone, though I'm not sure anyone rational would want it any more except for reading old SCCS repositories), RCS (which is open source), and similar systems. Such systems don't really do terribly much. RCS is a typical example -- when you use RCS, there is a subdirectory in every directory of source code you manage named (funnily enough) "RCS". Every file being managed in the directory has a corresponding database file in the RCS directory -- for example, a file named "code.c" would also have a file called RCS/code.c,v associated with it that stored all the old versions of the file. You check in a new version of a file by using the self contained "ci" program, retrieve an old version with the "co" program, etc. The system really just manipulates a couple of files locally with every command. It is straightforward, easy to understand, and not very flexible if you have a large project with lots of developers associated with it. Systems like RCS and SCCS are largely obsolete for big software development projects, though I do use RCS for managing a few config files on my systems where CVS or SVN would be too heavyweight. Second, there is the CVS paradigm, which is also largely followed by newer systems like Subversion (SVN). In this source control paradigm, you maintain a "repository" that contains the entire set of sources associated with a system, and developers check out, modify and check in sources to the repository, typically from remote systems using network protocols like ssh as the transport. I happen to work part of the time on an open source operating system called NetBSD, and NetBSD is managed in a big CVS repository. The repository sits physically on a machine in San Francisco, but developers do work on their local boxes around the world. If I want to "check out" the latest version of the sources, I run a command that connects to the CVS server and downloads all the changes that have been made since I last updated. If I am editing a part of the code and someone else changes the file I'm working on, the system automatically merges their changes into my local version of the file. If it can't merge the changes in because they conflict with my own work, it alerts me so I can take manual action to merge the changes. Every time someone checks in a change to the repository, an email message with what they've changed and their log message gets sent to the full set of developers, acting as a coordination mechanism within the project. CVS, which is open source, is probably the world's most popular source control system right now, though it is getting a bit old, and has certain design defects that get annoying after a while. Subversion (also known as SVN) is a newer open source source control system that is in a lot of ways much better than CVS, and if someone is starting > from scratch, I recommend looking at it. CVS follows a centralized paradigm for source management in which there is a single true copy of the canonical sources on the repository system. A different paradigm was pioneered by Larry McVoy's "Bitkeeper" (a commercial product) which allows groups of developers to maintain very disparate sets of source trees that are automatically merged into each other over the network when the various developers wish to do merges. I won't get too deeply into this way of working because I have to admit I don't like it much so I'm not a good spokesman for it, but many people seem to like the mechanism a lot -- enough so that there are now several open source source control systems that follow this model, such as arch and monotone. Lots of people seem to really like them, so it is probably a mistake to pay terribly much attention to my prejudice against such systems. You may have noticed that I've mentioned a number of open source systems and not many commercial ones. That's for a good reason -- the open source systems are usually better than the commercial ones these days, or are at worst as good as them. For completeness, though, I should mention the commercial ones. Most importantly, what to avoid: Microsoft's Visual Source Safe is garbage. I don't know how else to say that. It is buggy, horribly misbegotten and has not been maintained for years. Even Microsoft does not use it for their own work (I believe they may use Perforce -- I'm not entirely sure if I remember that right.) I cannot stress enough that there is no excuse on earth for using it. There are some partisans for Rational's "ClearCase" product but I really don't like it much and it requires heroic efforts to administer. ClearCase operates by pretending to be a file system and noticing every write you make to a file, so it "automatically" versions everything. Unfortunately, it turns out that this is not a particularly easy trick to accomplish. As I said, I try to avoid it. It is also pretty expensive considering that it doesn't seem to work as well as simpler and easier systems. Perforce and Bitkeeper are both decent products, with Perforce being more in the RCS/CVS style and Bitkeeper being the origin of the distributed model. What do I use? Some projects I work on picked CVS years ago and there is a lot of inertia about changing, so we use CVS for those. Newer projects I deal with usually use Subversion, which is a very nice piece of software. For very small (one or two files) projects of my own that only I touch, I sometimes use RCS. Perry