CCL: W:hardware for computational chemistry calculations

From: "Perry E. Metzger" <perry- -piermont.com>
Subject: CCL: W:hardware for computational chemistry calculations
Date: Fri, 30 Sep 2005 23:47:44 -0400
 Sent to CCL by: "Perry E. Metzger" [perry*|*piermont.com]
 "Eric Bennett ericb-,-pobox.com" <owner-chemistry[a]ccl.net>
 writes:
 > Perry Metzger writes:
 >>A strong recommendation though that I'll bring up here because it is
 >>vaguely OS related -- do NOT use more threads than processors in your
 >>app if you know what is good for you. Thread context switching is NOT
 >>instant, and you do not want to burn up good computation cycles on
 >> useless thread switching.
 >
 > Somewhat relevant to this: I have seen about a 25% throughput
 > increase in my MM calculations when using hyperthreading, running
 > four processes on a 2 CPU Xeon machine with hyperthreading on, as
 > compared to two processes with hyperthreading off.  In the special
 > case of hyperthreading sometimes you can benefit.
 Hyperthreading is an entirely different thing -- unfortunate that the
 terms have a common word in them. An Intel processor with
 hyperthreading is a processor that can do something useful while it is
 waiting on other things that are blocked some of the time -- it is
 somewhat of like having 1.25 processors instead of one. In that case,
 for selected apps, you want to treat one processor as though it were
 two and have two threads running. This is still an instance of my
 rule, though -- you just treat a Hyperthreaded processor as though it
 were more than one processor.
 In my comment, I'm referring to the more general case -- you don't
 want to incur context switch penalties inside your program if you can
 help it. Event dispatch costs about as much as a procedure call, but
 thread switches require tens to hundreds of times longer. If you can
 help it, use threads ONLY to exploit the parallelism of the multiple
 processors on your machine, and not for things like i/o multiplexing
 and the like.
 > Having enough RAM is always the most important thing.  If you don't
 > have enough memory to hold your software and its working data set in
 > RAM, that will for certain be the limiting factor in your speed.
 >
 > 15,000 RPM drives are only available with SCSI interfaces; the SATA
 > drives, even with their higher data density, don't have performance
 > specs that match up (15K SCSI gets you max sustained transfers of
 > around 90 MB/sec).  So if you are doing something disk-intensive like
 > large QM calculations, there are still people who will buy SCSI.  QM
 > jobs can end up writing over 10 GB of scratch files.  For MM apps like
 > dynamics the disk speed is not critical.
 Lets say you have a computation that is I/O bound on access to a 10G
 file. Right now, an additional 10G of DRAM will cost ~$1200.  The
 lowest price 15,000RPM drives you can buy are ~$210, plus you need a
 decent SCSI controller which can be another $200, so call it $410. If
 you want to stripe a couple of drives, the price goes up more.
 So, the question becomes, does the difference in speed for your app
 between having enough memory to hold the whole scratch file in buffer
 cache increase speed sufficiently to justify the marginal $1000
 cost? That depends on how I/O bound you are. If you are very I/O
 bound, the answer is a clear "yes" -- if you are only lightly I/O
 bound, the answer is not clear. If you are very I/O bound, the added
 RAM (versus a fast disk) will essentially eliminate your I/O time,
 switching you to being compute bound. This can sometimes increases
 your speed enough that you can use a fraction of the number of
 computers. So, if you are really strongly I/O bound -- that is, if
 your CPU is idle most of the time because it is waiting for the disk
 -- the answer can be is a clear "yes". Say you're only using 30% of
 the CPU -- eliminating the I/O bottleneck with RAM is worth two more
 expensive computers to you, because you'll suddenly be using 100% of
 the machine.
 Some years ago, I remember when it first became obvious that for some
 servers I was dealing with, buying 4G of memory so the entire set of
 files in the working set would fit in RAM meant that one machine could
 perform something or five or ten times better than boxes with even
 very fast disks. That was a giant win -- effectively the extra couple
 of G of RAM meant we didn't need four other computers.
 However, as the degree of I/O bottleneck goes down, the equation
 shifts. If you're only idle say 15% of the time, the economics become
 fuzzier. You have to do the calculation pretty carefully, but you may
 find that you're better off without the RAM if you are only waiting
 slightly for the disk. If your working set is, say, 40G, there is
 no way to fit enough memory into the box, and you just have to bite
 the bullet (or buy a really big honking RAID array).
 All such calculations are economics, in the end.
 By the way, if your scratch file access is not random, your working
 set may be smaller than you think, and you may be able to get most of
 the effect with less memory, which of course again shifts the
 economics of the calculations. Of course, if your working set even
 slightly exceeds RAM, you totally lose because you're constantly
 waiting for I/O. Testing to determine your true working set size can
 be very important. Knowing how to tune your OS so that you get maximum
 cache hits is also critical, and a bit of an esoteric skill, but one
 that is very important to pick up.
 --
 Perry E. Metzger		perry[a]piermont.com