From chemistry-request "-at-" ccl.net Mon Nov  3 15:58:47 2003
Received: from ms-smtp-02-eri0.southeast.rr.com (ms-smtp-02-lbl.southeast.rr.com [24.25.9.101])
	by server.ccl.net (8.12.8/8.12.8) with ESMTP id hA3KwFhP002856
	for <chemistry|at|ccl.net>; Mon, 3 Nov 2003 15:58:16 -0500
Received: from bobcat (rdu26-84-136.nc.rr.com [66.26.84.136])
	by ms-smtp-02-eri0.southeast.rr.com (8.12.10/8.12.7) with ESMTP id hA3Kw8R9005496;
	Mon, 3 Nov 2003 15:58:14 -0500 (EST)
Message-ID: <009901c3a24d$193f0d90$88541a42 -x- at -x- bobcat>
From: "Robert Duke" <rduke|at|email.unc.edu>
To: "Jim Phillips" <jim|at|ks.uiuc.edu>, <chemistry|at|ccl.net>
References: <Pine.GSO.4.44.0310311112580.23215-100000|at|verdun.ks.uiuc.edu>
Subject: Re: CCL:PMEMD 3.1 Release - High Scalability Update to PMEMD
Date: Mon, 3 Nov 2003 15:57:27 -0500
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4807.1700
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700
X-Virus-Scanned: Symantec AntiVirus Scan Engine
X-Spam-Status: No, hits=-1.5 required=7.0
	tests=ORIGINAL_MESSAGE,QUOTED_EMAIL_TEXT,REFERENCES
	version=2.55
X-Spam-Checker-Version: SpamAssassin 2.55 (1.174.2.19-2003-05-19-exp)

Jim -
It was not my goal to be unfair in the comparisons, but to make the
comparisons that I could, given the available benchmarks.  I very
specifically said something to the effect of the 90K atom benchmarks being
"apples and oranges", but that I was assuming that the NAMD team would have
picked a 90K atom benchmark representative of 90K atom production runs of
NAMD.  To my mind, the choice of a 12 A cutoff is unfortunate, in that it
consumes more time in direct force calcs than is typically necessary for a
PME simulation; we more typically see folks use 8 to 9 A values, and all the
previously published PMEMD benchmarks indeed do use 8 A cutoffs.  Note
however, that the 3-fold difference is not strictly correct; the direct
force computation typically scales well, whereas this is not so true of the
reciprocal force computation.  I would like to see a 92K atom benchmark with
8 A cutoffs in which you all get 10 nsec/day on 1024 processors at EPCC
(;-}).  I presumed you all were actually using the larger cutoff to
compensate for errors associated with the use of a multi-timestep approach
for PME, or simply because using a larger timestep would make the simulation
scale better, though perhaps have worse low-end performance.  I have never
seen it to be necessary to use such a large cutoff, and indeed pmemd and
sander 6/7 would have more trouble with really large cutoffs because they
use pairlists.  We used a 1.5 fs step for both the PME and direct force
parts of the calculation, which gives us good results, even with highly
ionic systems (the FIX benchmark is for a protein with
gamma-carboxyglutamates and Ca++'s all over the place; the system is from
published work).

Well, lets do apples and apples on your favorite machine, lemieux, which
NAMD has been optimized for, and which PMEMD has NOT been optimized for
(stock mpi; no fancy attempts to structure the code to the machine
architecture or compiler; in fact I do not know the machine well at all and
have limited access).  Over the weekend I ran a series of benchmarks on
lemieux on the JAC benchmark which you referenced below.  I deviated from
the exact benchmark specification in two inconsequential ways: 1) I did 4000
step simulations to get more accurate times, and 2) I used a "skinnb" of 1.0
instead of 2.0.  PMEMD and Sander 6/7 by default check for atom movement
exceeding the skin, so the skinnb parameter is an optimization parameter,
and PMEMD does better with it set to 1.0 (ie., this in no way affects
results of the simulation, PMEMD just builds smaller lists more frequently).
I mostly used 4 processors per node, but specify total cpu's and nodes below
so you can distinguish (for folks unfamiliar with the alphaservers, or sp4's
for that matter, you can sometimes improve performance by not using all the
cpu's associated with a node, which is basically a clump of cpu's with some
shared components).  Results are in seconds for 1000 steps.

JAC Benchmark, 23558 atoms on lemieux.psc.edu, results in wall clock seconds
per 1000 steps.

#cpu   #nodes   PMEMD 3.1 wc sec    NAMD 2.4 wc sec   PMEMD/NAMD speedup

    1            1          828.5                          1385
1.67x
    2            1          431.5                            750
1.74x
    4            1          218.25                          390
1.79x
    8            2          118.5                            198
1.67x
  16            4            67.25                          105
1.56x
  32            8            37                                 61
1.65x
  64          16            22.25                            40
1.80x
  72          24            19                                 nd
nd
  96          32            17.75                            nd
nd
120          40            18.75                            nd
        nd
128          32            22                                 23
1.05

Now, there is a published NAMD 2.5b2 benchmark on the SGI Origin that shows
2.5b2 outperforming 2.4 by 24% at 1 proc, tarpering down to 10% at 64 procs
and 0% at 126 procs (and doing worse at 252 procs).  So taking that into
account, PMEMD 3.1 is probably roughly 50% faster at the low end, and
exceeds the NAMD max throughput slightly using about half the processors
required by NAMD.  By the way, I dislike this specific benchmark because 1)
it is a small system, which means performance is less critical and you will
also hit top end performance at a lower processor count, 2) it is a constant
volume simulation; at least for us our longer runs are constant pressure and
constant pressure is harder to scale, 3) it really does almost no i/o, which
is unrealistic; reasonable amounts of i/o will impact high end scaling
slightly, and 4) if I am concerned about performance, I use 8 A cutoffs
unless it is clear that a larger cutoff is needed.

On the less competitive, side, I have looked at what you all have done on
scaling, and am impressed, and would be even more impressed with decent
scaling on the larger systems in conjunction with more reasonable cutoffs.
I would expect the NAMD effort to be difficult to keep up with.  My goal is
to provide decent scaling to the Amber community, as a one man effort.

My apologies to the CCL list for the long mail.

Regards - Bob Duke

----- Original Message -----
From: "Jim Phillips" <jim|at|ks.uiuc.edu>
To: <chemistry|at|ccl.net>
Sent: Friday, October 31, 2003 3:29 PM
Subject: CCL:PMEMD 3.1 Release - High Scalability Update to PMEMD


> Dear CCL,
>
> Since this message and particularly the PMEMD 3.1 update notes (available
> at http://amber.scripps.edu/pmemd.3.1.UpdateNote.html) spend considerable
> time on comparisons to NAMD, I feel that a couple of clarifications are in
> order regarding NAMD performance and benchmarking.
>
> My first point is that not all 90K atom simulations are comparable.  The
> bulk of the computation is in the direct nonbonded interactions, which
> scale roughly as cutoff^3 * numatoms^2 / volume.  The apoa1 NAMD benchmark
> uses a 12A cutoff, the default specified by the CHARMM force field.  I can
> not find any indication of the cutoff distance of the Factor IX benchmark
> used by PMEMD, but the Amber 7 manual uses an 8A cutoff by default.  This
> difference alone indicates that *NAMD is doing three times more work* than
> PMEMD, which is hardly fair.
>
> An independent comparison of CHARMM, AMBER, and NAMD running the same
> Joint Amber-Charmm (JAC) 9A cutoff with PME benchmark is available at
> http://www.scripps.edu/brooks/Benchmarks/ and demonstrates that NAMD 2.4
> serial performance is similar to other programs, with some differences
> depending on platform.  Note that NAMD 2.4 is 18 months old and last
> month's NAMD 2.5 has been observed by users to be up to 40% faster on some
> platforms, as well as incorporating the scalability advances reported in
> our SC2002 paper (http://www.sc-2002.org/paperpdfs/pap.pap277.pdf), which
> received a Gordon Bell Award.
>
> NAMD has historically been tuned for HP Alpha and AMD Athlon, with major
> improvements in 2.5 for Intel IA64 and IA32.  NAMD currently runs twice as
> fast on Itanium 2 as it does on POWER4.  Our local users have never used
> an IBM SP for production, and while NCSA recently acquired several p690s,
> these were purchased for large shared memory applications and are not cost
> effective for NAMD simulations (vs our $30K 48-CPU Athlon clusters).
>
> Most recent NAMD scalability tuning has been done on PSC's Lemieux, and
> very little on any IBM SP.  There are also serious performance issues with
> the interaction between the Charm++ (http://charm.cs.uiuc.edu)
> communication layer used in NAMD and IBM's MPI implementation in which
> MPI_Isend fails to send data until the next MPI call after the received
> node has called MPI_Iprobe and posted an MPI_Recv.  Charm++ relies on this
> idiom to implement message driven execution top of MPI, and it works
> acceptably on other platforms.  We are working to address this issue.
>
> In conclusion, NAMD is available free of charge as source code or binaries
> for a dozen platforms and reads CHARMM and AMBER input file formats, so
> it is relatively painless to run your own benchmarks to decide if NAMD
> provides a performance benefit for your simulations.
>
> Sincerely,
>
> James Phillips, Ph.D.
> Senior Research Programmer for NAMD Development
> http://www.ks.uiuc.edu/Research/namd/
>
>
> On Fri, 31 Oct 2003, Robert Duke wrote:
>
> > We are proud to announce the release of version 3.1 (the first major
> > performance update) of PMEMD (Particle Mesh Ewald Molecular Dynamics).
> >
> > PMEMD is a new version of the Amber module "Sander", and has been
written
> > with the major goal of improving performance in Particle Mesh Ewald
> > molecular dynamics simulations and minimizations. The code has been
totally
> > rewritten in Fortran 90, and is capable of running in either an Amber 6
or
> > Amber 7 mode.  Functionality is more complete in Amber 6 mode, with the
> > Amber 7 mode designed mostly to do the same sorts of things that Amber 6
> > does, but with output that is comparable to Amber 7 Sander. The
calculations
> > done in PMEMD are intended to replicate either Sander 6 or Sander 7
> > calculations within the limits of roundoff errors. The calculations are
just
> > done more rapidly in about half the memory, and runs may be made
efficiently
> > on significantly larger numbers of processors.
> >
> > The primary site for high scalability work on PMEMD 3.1 has been the
> > Edinburgh Parallel Computing Centre (IBM P690 Regatta, 1.3 GHz Power4
CPU's,
> > 1280 total processors), and we would like to thank EPCC for making their
> > facilities available for this work.  At EPCC, we have obtained maximum
> > throughputs of 3.65 nsec/day (constant volume, 320 processors) and 3.48
> > nsec/day (constant pressure, 320 processors) for a 90906 atom PME
solvated
> > protein simulation.  This compares to 0.41 nsec/day (constant pressure,
128
> > processors) for Sander 7 on the same simulation problem and 3.43
nsec/day
> > (1024 processors) for NAMD on a similar simulation problem (92,224
atoms).
> > More significant is performance at the "50% scalability point", the
point
> > where adding more processors will decrease compute efficiency below 50%.
> > PMEMD 3.1 runs the above simulation with at least 50% scalability on 128
> > processors, producing 2.85 nsec/day throughput.  For Sander 7, only 16
> > processors may be used without going below 50% scalability, and
throughput
> > is 0.28 nsec/day.  For NAMD, 256 processors may be used at 50%
scalability,
> > but throughput is only 1.3 nsec/day.   Additional benchmark data is
> > presented in the Update Note available at the Amber website.
> >
> > PMEMD was developed by Dr. Robert Duke in Prof. Lee Pedersen's Lab at
> > UNC-Chapel Hill, starting from the version of Sander in Amber 6.
Funding
> > support was provided by NIH grant HL-06350 (PPG) and NSF grant
2001-0759-02
> > (ITR/AP).  When citing PMEMD (Particle Mesh Ewald Molecular Dynamics) in
the
> > literature, please use both the Amber Version 7 citation given in the
Amber
> > 7 manual, and the following citation:
> >
> > Robert E. Duke and Lee G. Pedersen (2003) PMEMD 3.1, University of North
> > Carolina-Chapel Hill
> >
> > PMEMD is available without charge to users who have an existing license
for
> > Amber (version 6 or 7).  For more information, and to download the code,
> > please go to:
> >
> >                 http://amber.scripps.edu/pmemd-get.html
> >
> >
> > - Robert Duke (UNC-Chapel Hill) and David Case (The Scripps Research
> > Institute)
> >
> >
> >
> >
> >
> > Send your subscription/unsubscription requests to:
CHEMISTRY-REQUEST|at|ccl.net
> > HOME Page: http://www.ccl.net   | Jobs Page: http://www.ccl.net/jobs
> >
> >
> >
> >
> >
> >
>
>
>
>
>
>
>
>
> -= This is automatically added to each message by the mailing script =-
> To send e-mail to subscribers of CCL put the string CCL: on your Subject:
line
> and send your message to:  CHEMISTRY|at|ccl.net
>
> Send your subscription/unsubscription requests to:
CHEMISTRY-REQUEST|at|ccl.net
> HOME Page: http://www.ccl.net   | Jobs Page: http://www.ccl.net/jobs
>
> If your mail is bouncing from CCL.NET domain send it to the maintainer:
> Jan Labanowski,  jkl|at|ccl.net (read about it on CCL Home Page)
> -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>
>
>
>
>
>