Re: CCL:PMEMD 3.1 Release - High Scalability Update to PMEMD

Jim -
 It was not my goal to be unfair in the comparisons, but to make the
 comparisons that I could, given the available benchmarks.  I very
 specifically said something to the effect of the 90K atom benchmarks being
 "apples and oranges", but that I was assuming that the NAMD team would
 picked a 90K atom benchmark representative of 90K atom production runs of
 NAMD.  To my mind, the choice of a 12 A cutoff is unfortunate, in that it
 consumes more time in direct force calcs than is typically necessary for a
 PME simulation; we more typically see folks use 8 to 9 A values, and all the
 previously published PMEMD benchmarks indeed do use 8 A cutoffs.  Note
 however, that the 3-fold difference is not strictly correct; the direct
 force computation typically scales well, whereas this is not so true of the
 reciprocal force computation.  I would like to see a 92K atom benchmark with
 8 A cutoffs in which you all get 10 nsec/day on 1024 processors at EPCC
 (;-}).  I presumed you all were actually using the larger cutoff to
 compensate for errors associated with the use of a multi-timestep approach
 for PME, or simply because using a larger timestep would make the simulation
 scale better, though perhaps have worse low-end performance.  I have never
 seen it to be necessary to use such a large cutoff, and indeed pmemd and
 sander 6/7 would have more trouble with really large cutoffs because they
 use pairlists.  We used a 1.5 fs step for both the PME and direct force
 parts of the calculation, which gives us good results, even with highly
 ionic systems (the FIX benchmark is for a protein with
 gamma-carboxyglutamates and Ca++'s all over the place; the system is from
 published work).
 Well, lets do apples and apples on your favorite machine, lemieux, which
 NAMD has been optimized for, and which PMEMD has NOT been optimized for
 (stock mpi; no fancy attempts to structure the code to the machine
 architecture or compiler; in fact I do not know the machine well at all and
 have limited access).  Over the weekend I ran a series of benchmarks on
 lemieux on the JAC benchmark which you referenced below.  I deviated from
 the exact benchmark specification in two inconsequential ways: 1) I did 4000
 step simulations to get more accurate times, and 2) I used a "skinnb"
 of 1.0
 instead of 2.0.  PMEMD and Sander 6/7 by default check for atom movement
 exceeding the skin, so the skinnb parameter is an optimization parameter,
 and PMEMD does better with it set to 1.0 (ie., this in no way affects
 results of the simulation, PMEMD just builds smaller lists more frequently).
 I mostly used 4 processors per node, but specify total cpu's and nodes below
 so you can distinguish (for folks unfamiliar with the alphaservers, or sp4's
 for that matter, you can sometimes improve performance by not using all the
 cpu's associated with a node, which is basically a clump of cpu's with some
 shared components).  Results are in seconds for 1000 steps.
 JAC Benchmark, 23558 atoms on, results in wall clock seconds
 per 1000 steps.
 #cpu   #nodes   PMEMD 3.1 wc sec    NAMD 2.4 wc sec   PMEMD/NAMD speedup
     1            1          828.5                          1385
     2            1          431.5                            750
     4            1          218.25                          390
     8            2          118.5                            198
   16            4            67.25                          105
   32            8            37                                 61
   64          16            22.25                            40
   72          24            19                                 nd
   96          32            17.75                            nd
 120          40            18.75                            nd
 128          32            22                                 23
 Now, there is a published NAMD 2.5b2 benchmark on the SGI Origin that shows
 2.5b2 outperforming 2.4 by 24% at 1 proc, tarpering down to 10% at 64 procs
 and 0% at 126 procs (and doing worse at 252 procs).  So taking that into
 account, PMEMD 3.1 is probably roughly 50% faster at the low end, and
 exceeds the NAMD max throughput slightly using about half the processors
 required by NAMD.  By the way, I dislike this specific benchmark because 1)
 it is a small system, which means performance is less critical and you will
 also hit top end performance at a lower processor count, 2) it is a constant
 volume simulation; at least for us our longer runs are constant pressure and
 constant pressure is harder to scale, 3) it really does almost no i/o, which
 is unrealistic; reasonable amounts of i/o will impact high end scaling
 slightly, and 4) if I am concerned about performance, I use 8 A cutoffs
 unless it is clear that a larger cutoff is needed.
 On the less competitive, side, I have looked at what you all have done on
 scaling, and am impressed, and would be even more impressed with decent
 scaling on the larger systems in conjunction with more reasonable cutoffs.
 I would expect the NAMD effort to be difficult to keep up with.  My goal is
 to provide decent scaling to the Amber community, as a one man effort.
 My apologies to the CCL list for the long mail.
 Regards - Bob Duke
 ----- Original Message -----
 From: "Jim Phillips" <jim|at|>
 To: <chemistry|at|>
 Sent: Friday, October 31, 2003 3:29 PM
 Subject: CCL:PMEMD 3.1 Release - High Scalability Update to PMEMD
 > Dear CCL,
 > Since this message and particularly the PMEMD 3.1 update notes (available
 > at spend
 > time on comparisons to NAMD, I feel that a couple of clarifications are in
 > order regarding NAMD performance and benchmarking.
 > My first point is that not all 90K atom simulations are comparable.  The
 > bulk of the computation is in the direct nonbonded interactions, which
 > scale roughly as cutoff^3 * numatoms^2 / volume.  The apoa1 NAMD benchmark
 > uses a 12A cutoff, the default specified by the CHARMM force field.  I can
 > not find any indication of the cutoff distance of the Factor IX benchmark
 > used by PMEMD, but the Amber 7 manual uses an 8A cutoff by default.  This
 > difference alone indicates that *NAMD is doing three times more work* than
 > PMEMD, which is hardly fair.
 > An independent comparison of CHARMM, AMBER, and NAMD running the same
 > Joint Amber-Charmm (JAC) 9A cutoff with PME benchmark is available at
 > and demonstrates
 that NAMD 2.4
 > serial performance is similar to other programs, with some differences
 > depending on platform.  Note that NAMD 2.4 is 18 months old and last
 > month's NAMD 2.5 has been observed by users to be up to 40% faster on some
 > platforms, as well as incorporating the scalability advances reported in
 > our SC2002 paper (, which
 > received a Gordon Bell Award.
 > NAMD has historically been tuned for HP Alpha and AMD Athlon, with major
 > improvements in 2.5 for Intel IA64 and IA32.  NAMD currently runs twice as
 > fast on Itanium 2 as it does on POWER4.  Our local users have never used
 > an IBM SP for production, and while NCSA recently acquired several p690s,
 > these were purchased for large shared memory applications and are not cost
 > effective for NAMD simulations (vs our $30K 48-CPU Athlon clusters).
 > Most recent NAMD scalability tuning has been done on PSC's Lemieux, and
 > very little on any IBM SP.  There are also serious performance issues with
 > the interaction between the Charm++ (
 > communication layer used in NAMD and IBM's MPI implementation in which
 > MPI_Isend fails to send data until the next MPI call after the received
 > node has called MPI_Iprobe and posted an MPI_Recv.  Charm++ relies on this
 > idiom to implement message driven execution top of MPI, and it works
 > acceptably on other platforms.  We are working to address this issue.
 > In conclusion, NAMD is available free of charge as source code or binaries
 > for a dozen platforms and reads CHARMM and AMBER input file formats, so
 > it is relatively painless to run your own benchmarks to decide if NAMD
 > provides a performance benefit for your simulations.
 > Sincerely,
 > James Phillips, Ph.D.
 > Senior Research Programmer for NAMD Development
 > On Fri, 31 Oct 2003, Robert Duke wrote:
 > > We are proud to announce the release of version 3.1 (the first major
 > > performance update) of PMEMD (Particle Mesh Ewald Molecular Dynamics).
 > >
 > > PMEMD is a new version of the Amber module "Sander", and has
 > > with the major goal of improving performance in Particle Mesh Ewald
 > > molecular dynamics simulations and minimizations. The code has been
 > > rewritten in Fortran 90, and is capable of running in either an Amber
 > > Amber 7 mode.  Functionality is more complete in Amber 6 mode, with
 > > Amber 7 mode designed mostly to do the same sorts of things that Amber
 > > does, but with output that is comparable to Amber 7 Sander. The
 > > done in PMEMD are intended to replicate either Sander 6 or Sander 7
 > > calculations within the limits of roundoff errors. The calculations
 > > done more rapidly in about half the memory, and runs may be made
 > > on significantly larger numbers of processors.
 > >
 > > The primary site for high scalability work on PMEMD 3.1 has been the
 > > Edinburgh Parallel Computing Centre (IBM P690 Regatta, 1.3 GHz Power4
 > > 1280 total processors), and we would like to thank EPCC for making
 > > facilities available for this work.  At EPCC, we have obtained maximum
 > > throughputs of 3.65 nsec/day (constant volume, 320 processors) and
 > > nsec/day (constant pressure, 320 processors) for a 90906 atom PME
 > > protein simulation.  This compares to 0.41 nsec/day (constant
 > > processors) for Sander 7 on the same simulation problem and 3.43
 > > (1024 processors) for NAMD on a similar simulation problem (92,224
 > > More significant is performance at the "50% scalability
 point", the
 > > where adding more processors will decrease compute efficiency below
 > > PMEMD 3.1 runs the above simulation with at least 50% scalability on
 > > processors, producing 2.85 nsec/day throughput.  For Sander 7, only 16
 > > processors may be used without going below 50% scalability, and
 > > is 0.28 nsec/day.  For NAMD, 256 processors may be used at 50%
 > > but throughput is only 1.3 nsec/day.   Additional benchmark data is
 > > presented in the Update Note available at the Amber website.
 > >
 > > PMEMD was developed by Dr. Robert Duke in Prof. Lee Pedersen's Lab at
 > > UNC-Chapel Hill, starting from the version of Sander in Amber 6.
 > > support was provided by NIH grant HL-06350 (PPG) and NSF grant
 > > (ITR/AP).  When citing PMEMD (Particle Mesh Ewald Molecular Dynamics)
 > > literature, please use both the Amber Version 7 citation given in the
 > > 7 manual, and the following citation:
 > >
 > > Robert E. Duke and Lee G. Pedersen (2003) PMEMD 3.1, University of
 > > Carolina-Chapel Hill
 > >
 > > PMEMD is available without charge to users who have an existing
 > > Amber (version 6 or 7).  For more information, and to download the
 > > please go to:
 > >
 > >       
 > >
 > >
 > > - Robert Duke (UNC-Chapel Hill) and David Case (The Scripps Research
 > > Institute)
 > >
 > >
 > >
 > >
 > >
 > > Send your subscription/unsubscription requests to:
 > > HOME Page:   | Jobs Page:
 > >
 > >
 > >
 > >
 > >
 > >
 > -= This is automatically added to each message by the mailing script =-
 > To send e-mail to subscribers of CCL put the string CCL: on your Subject:
 > and send your message to:  CHEMISTRY|at|
 > Send your subscription/unsubscription requests to:
 > HOME Page:   | Jobs Page:
 > If your mail is bouncing from CCL.NET domain send it to the maintainer:
 > Jan Labanowski,  jkl|at| (read about it on CCL Home Page)
 > -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+