From chemistry-request "-at-" ccl.net Mon Nov 3 15:58:47 2003 Received: from ms-smtp-02-eri0.southeast.rr.com (ms-smtp-02-lbl.southeast.rr.com [24.25.9.101]) by server.ccl.net (8.12.8/8.12.8) with ESMTP id hA3KwFhP002856 for ; Mon, 3 Nov 2003 15:58:16 -0500 Received: from bobcat (rdu26-84-136.nc.rr.com [66.26.84.136]) by ms-smtp-02-eri0.southeast.rr.com (8.12.10/8.12.7) with ESMTP id hA3Kw8R9005496; Mon, 3 Nov 2003 15:58:14 -0500 (EST) Message-ID: <009901c3a24d$193f0d90$88541a42 -x- at -x- bobcat> From: "Robert Duke" To: "Jim Phillips" , References: Subject: Re: CCL:PMEMD 3.1 Release - High Scalability Update to PMEMD Date: Mon, 3 Nov 2003 15:57:27 -0500 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4807.1700 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700 X-Virus-Scanned: Symantec AntiVirus Scan Engine X-Spam-Status: No, hits=-1.5 required=7.0 tests=ORIGINAL_MESSAGE,QUOTED_EMAIL_TEXT,REFERENCES version=2.55 X-Spam-Checker-Version: SpamAssassin 2.55 (1.174.2.19-2003-05-19-exp) Jim - It was not my goal to be unfair in the comparisons, but to make the comparisons that I could, given the available benchmarks. I very specifically said something to the effect of the 90K atom benchmarks being "apples and oranges", but that I was assuming that the NAMD team would have picked a 90K atom benchmark representative of 90K atom production runs of NAMD. To my mind, the choice of a 12 A cutoff is unfortunate, in that it consumes more time in direct force calcs than is typically necessary for a PME simulation; we more typically see folks use 8 to 9 A values, and all the previously published PMEMD benchmarks indeed do use 8 A cutoffs. Note however, that the 3-fold difference is not strictly correct; the direct force computation typically scales well, whereas this is not so true of the reciprocal force computation. I would like to see a 92K atom benchmark with 8 A cutoffs in which you all get 10 nsec/day on 1024 processors at EPCC (;-}). I presumed you all were actually using the larger cutoff to compensate for errors associated with the use of a multi-timestep approach for PME, or simply because using a larger timestep would make the simulation scale better, though perhaps have worse low-end performance. I have never seen it to be necessary to use such a large cutoff, and indeed pmemd and sander 6/7 would have more trouble with really large cutoffs because they use pairlists. We used a 1.5 fs step for both the PME and direct force parts of the calculation, which gives us good results, even with highly ionic systems (the FIX benchmark is for a protein with gamma-carboxyglutamates and Ca++'s all over the place; the system is from published work). Well, lets do apples and apples on your favorite machine, lemieux, which NAMD has been optimized for, and which PMEMD has NOT been optimized for (stock mpi; no fancy attempts to structure the code to the machine architecture or compiler; in fact I do not know the machine well at all and have limited access). Over the weekend I ran a series of benchmarks on lemieux on the JAC benchmark which you referenced below. I deviated from the exact benchmark specification in two inconsequential ways: 1) I did 4000 step simulations to get more accurate times, and 2) I used a "skinnb" of 1.0 instead of 2.0. PMEMD and Sander 6/7 by default check for atom movement exceeding the skin, so the skinnb parameter is an optimization parameter, and PMEMD does better with it set to 1.0 (ie., this in no way affects results of the simulation, PMEMD just builds smaller lists more frequently). I mostly used 4 processors per node, but specify total cpu's and nodes below so you can distinguish (for folks unfamiliar with the alphaservers, or sp4's for that matter, you can sometimes improve performance by not using all the cpu's associated with a node, which is basically a clump of cpu's with some shared components). Results are in seconds for 1000 steps. JAC Benchmark, 23558 atoms on lemieux.psc.edu, results in wall clock seconds per 1000 steps. #cpu #nodes PMEMD 3.1 wc sec NAMD 2.4 wc sec PMEMD/NAMD speedup 1 1 828.5 1385 1.67x 2 1 431.5 750 1.74x 4 1 218.25 390 1.79x 8 2 118.5 198 1.67x 16 4 67.25 105 1.56x 32 8 37 61 1.65x 64 16 22.25 40 1.80x 72 24 19 nd nd 96 32 17.75 nd nd 120 40 18.75 nd nd 128 32 22 23 1.05 Now, there is a published NAMD 2.5b2 benchmark on the SGI Origin that shows 2.5b2 outperforming 2.4 by 24% at 1 proc, tarpering down to 10% at 64 procs and 0% at 126 procs (and doing worse at 252 procs). So taking that into account, PMEMD 3.1 is probably roughly 50% faster at the low end, and exceeds the NAMD max throughput slightly using about half the processors required by NAMD. By the way, I dislike this specific benchmark because 1) it is a small system, which means performance is less critical and you will also hit top end performance at a lower processor count, 2) it is a constant volume simulation; at least for us our longer runs are constant pressure and constant pressure is harder to scale, 3) it really does almost no i/o, which is unrealistic; reasonable amounts of i/o will impact high end scaling slightly, and 4) if I am concerned about performance, I use 8 A cutoffs unless it is clear that a larger cutoff is needed. On the less competitive, side, I have looked at what you all have done on scaling, and am impressed, and would be even more impressed with decent scaling on the larger systems in conjunction with more reasonable cutoffs. I would expect the NAMD effort to be difficult to keep up with. My goal is to provide decent scaling to the Amber community, as a one man effort. My apologies to the CCL list for the long mail. Regards - Bob Duke ----- Original Message ----- From: "Jim Phillips" To: Sent: Friday, October 31, 2003 3:29 PM Subject: CCL:PMEMD 3.1 Release - High Scalability Update to PMEMD > Dear CCL, > > Since this message and particularly the PMEMD 3.1 update notes (available > at http://amber.scripps.edu/pmemd.3.1.UpdateNote.html) spend considerable > time on comparisons to NAMD, I feel that a couple of clarifications are in > order regarding NAMD performance and benchmarking. > > My first point is that not all 90K atom simulations are comparable. The > bulk of the computation is in the direct nonbonded interactions, which > scale roughly as cutoff^3 * numatoms^2 / volume. The apoa1 NAMD benchmark > uses a 12A cutoff, the default specified by the CHARMM force field. I can > not find any indication of the cutoff distance of the Factor IX benchmark > used by PMEMD, but the Amber 7 manual uses an 8A cutoff by default. This > difference alone indicates that *NAMD is doing three times more work* than > PMEMD, which is hardly fair. > > An independent comparison of CHARMM, AMBER, and NAMD running the same > Joint Amber-Charmm (JAC) 9A cutoff with PME benchmark is available at > http://www.scripps.edu/brooks/Benchmarks/ and demonstrates that NAMD 2.4 > serial performance is similar to other programs, with some differences > depending on platform. Note that NAMD 2.4 is 18 months old and last > month's NAMD 2.5 has been observed by users to be up to 40% faster on some > platforms, as well as incorporating the scalability advances reported in > our SC2002 paper (http://www.sc-2002.org/paperpdfs/pap.pap277.pdf), which > received a Gordon Bell Award. > > NAMD has historically been tuned for HP Alpha and AMD Athlon, with major > improvements in 2.5 for Intel IA64 and IA32. NAMD currently runs twice as > fast on Itanium 2 as it does on POWER4. Our local users have never used > an IBM SP for production, and while NCSA recently acquired several p690s, > these were purchased for large shared memory applications and are not cost > effective for NAMD simulations (vs our $30K 48-CPU Athlon clusters). > > Most recent NAMD scalability tuning has been done on PSC's Lemieux, and > very little on any IBM SP. There are also serious performance issues with > the interaction between the Charm++ (http://charm.cs.uiuc.edu) > communication layer used in NAMD and IBM's MPI implementation in which > MPI_Isend fails to send data until the next MPI call after the received > node has called MPI_Iprobe and posted an MPI_Recv. Charm++ relies on this > idiom to implement message driven execution top of MPI, and it works > acceptably on other platforms. We are working to address this issue. > > In conclusion, NAMD is available free of charge as source code or binaries > for a dozen platforms and reads CHARMM and AMBER input file formats, so > it is relatively painless to run your own benchmarks to decide if NAMD > provides a performance benefit for your simulations. > > Sincerely, > > James Phillips, Ph.D. > Senior Research Programmer for NAMD Development > http://www.ks.uiuc.edu/Research/namd/ > > > On Fri, 31 Oct 2003, Robert Duke wrote: > > > We are proud to announce the release of version 3.1 (the first major > > performance update) of PMEMD (Particle Mesh Ewald Molecular Dynamics). > > > > PMEMD is a new version of the Amber module "Sander", and has been written > > with the major goal of improving performance in Particle Mesh Ewald > > molecular dynamics simulations and minimizations. The code has been totally > > rewritten in Fortran 90, and is capable of running in either an Amber 6 or > > Amber 7 mode. Functionality is more complete in Amber 6 mode, with the > > Amber 7 mode designed mostly to do the same sorts of things that Amber 6 > > does, but with output that is comparable to Amber 7 Sander. The calculations > > done in PMEMD are intended to replicate either Sander 6 or Sander 7 > > calculations within the limits of roundoff errors. The calculations are just > > done more rapidly in about half the memory, and runs may be made efficiently > > on significantly larger numbers of processors. > > > > The primary site for high scalability work on PMEMD 3.1 has been the > > Edinburgh Parallel Computing Centre (IBM P690 Regatta, 1.3 GHz Power4 CPU's, > > 1280 total processors), and we would like to thank EPCC for making their > > facilities available for this work. At EPCC, we have obtained maximum > > throughputs of 3.65 nsec/day (constant volume, 320 processors) and 3.48 > > nsec/day (constant pressure, 320 processors) for a 90906 atom PME solvated > > protein simulation. This compares to 0.41 nsec/day (constant pressure, 128 > > processors) for Sander 7 on the same simulation problem and 3.43 nsec/day > > (1024 processors) for NAMD on a similar simulation problem (92,224 atoms). > > More significant is performance at the "50% scalability point", the point > > where adding more processors will decrease compute efficiency below 50%. > > PMEMD 3.1 runs the above simulation with at least 50% scalability on 128 > > processors, producing 2.85 nsec/day throughput. For Sander 7, only 16 > > processors may be used without going below 50% scalability, and throughput > > is 0.28 nsec/day. For NAMD, 256 processors may be used at 50% scalability, > > but throughput is only 1.3 nsec/day. Additional benchmark data is > > presented in the Update Note available at the Amber website. > > > > PMEMD was developed by Dr. Robert Duke in Prof. Lee Pedersen's Lab at > > UNC-Chapel Hill, starting from the version of Sander in Amber 6. Funding > > support was provided by NIH grant HL-06350 (PPG) and NSF grant 2001-0759-02 > > (ITR/AP). When citing PMEMD (Particle Mesh Ewald Molecular Dynamics) in the > > literature, please use both the Amber Version 7 citation given in the Amber > > 7 manual, and the following citation: > > > > Robert E. Duke and Lee G. Pedersen (2003) PMEMD 3.1, University of North > > Carolina-Chapel Hill > > > > PMEMD is available without charge to users who have an existing license for > > Amber (version 6 or 7). For more information, and to download the code, > > please go to: > > > > http://amber.scripps.edu/pmemd-get.html > > > > > > - Robert Duke (UNC-Chapel Hill) and David Case (The Scripps Research > > Institute) > > > > > > > > > > > > Send your subscription/unsubscription requests to: CHEMISTRY-REQUEST|at|ccl.net > > HOME Page: http://www.ccl.net | Jobs Page: http://www.ccl.net/jobs > > > > > > > > > > > > > > > > > > > > > -= This is automatically added to each message by the mailing script =- > To send e-mail to subscribers of CCL put the string CCL: on your Subject: line > and send your message to: CHEMISTRY|at|ccl.net > > Send your subscription/unsubscription requests to: CHEMISTRY-REQUEST|at|ccl.net > HOME Page: http://www.ccl.net | Jobs Page: http://www.ccl.net/jobs > > If your mail is bouncing from CCL.NET domain send it to the maintainer: > Jan Labanowski, jkl|at|ccl.net (read about it on CCL Home Page) > -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > > > > > >