SUMMARY of SGI hardware question



 Dear netters:
 Here is the collection of responses for "SGI hardware question".
 Thanks
 for your help.
 My original question was:
 We are going to buy a computer to run GAUSSIAN94. We will probably
 choose one of the two following options from Silicon Graphics:
 (1) A two-processor (R8000's) PowerChallenge workstation with 128MB RAM
  (in total) and 6 GB hard disk.
 (2) Two single-processor PowerChallenge workstations, with 128MB RAM
 and 4GB hard disk each.
 Does anybody known which of the two options has a better performance
 for GAUSSIAN calculations?.
 Thanks in advance.
 Saulo A. Vazquez (qfsaulo #*at*# usc.es)
 Response 1:
   Hi,
      If you have enough jobs to keep both processors constantly busy, it
 doesn't matter so much.  The two processor machine is nice since it makes
 scheduling much easier; you can either queue a third job and it will start
 when either one of the other jobs finish, or you can use npri to put it
 at a low priority and again, it will finish when either of the others
 is done.
      Alternatively you can just run all 3 together and each will run at 66%
 speed and again, when any one of them finishes, the other 2 will run at 100%
 speed.
      If you only have one job to run (especially a big one) you can set it
 to run in parallel mode and use both processors that way.  Due to extra
 overhead in parallel, however, only run in parallel if you only have 1 job
 to run.  (i.e. the time for 2 jobs running serial to finish will be faster
 than if they are both in parallel.)  If you only have one job, however, use
 the extra CPU!
      Bottom line, the 2 processor machine is a LOT more flexible, as long as
 you can afford the price difference.
      I hope this helps!
         Dan
 --
 Dr. Daniel L. Severance			dan #*at*# sage.syntex.com
 Staff Researcher			Work phone:	(415) 354-7509
 Syntex Discovery Research		Home phone:	(415) 969-5818
 R6W-002          		        Fax (Work):  	(415) 354-7363
 3401 Hillview Ave
 Palo Alto, CA  94303
 Response 2:
 Which option is the best depends on what you are most eager to calculate: many
 smaller jobs or fewer but bigger calculations.
 For the bigger jobs the Power Challenge can use several processors in parallel.
 This works very well in Gaussian 92 as you can see from the enclosed numbers.
 There is also the possibility to install additional processors later on. The
 parallel performance is probably even better in Gaussian 94. In the latter
 there is also network parallelism through the Linda parallel environment but
 then you need additional software in order to run parallel computations, for
 this option I have no idea about the performance.
 ..............................................................................
 Gaussian 92 Test Job 178: TATB rhf/6-31g**//hf/6-31g**, 300 basis functions:
 SGI Indigo^2, R4000:            113.4 min
 SGI Challenge, 1* R4400:         74.3 min
 SGI Power Challenge, 1*R8000:    15.7 min       1.0
 Cray Y-MP, 1 processor:          13.2 min
 SGI Power Challenge, 2*R8000:     8.9 min       1.76 * 1 CPU
 SGI Power Challenge, 4*R8000:     5.5 min       2.85 * 1 CPU
 -------------------------------------------------------
 Values obtained from other sources:
 Cray C90 8/256, 1 processor:      4.25 min
 Cray C90 8/256 (incore), 1 P.:    1.5 min
 IBM 590/pwr2:                    17   min
 IBM 390/pwr2:                    41   min
 90 MHz Pentium:                 600   min
 ..............................................................................
 Best Regards
             /Johan Landin
 ___________________________________________________________________
 Johan Landin                       Tel:   +46 31 773 3767
 Dept. of Medical Biochemistry      Fax:   +46 31 41 6108
 Medicinaregatan 9                  Home:  +46 31 14 7554
 S-413 90 Goteborg, Sweden          Email: landin #*at*# mednet.gu.se
 Response 3:
 If one ignores cost, then the next question may well be how fast do you wish to
 get a given job done.  On a 2 processor machine the effective speed-up is about
 1.85,,or so.  You might contact Roberto Gomperts at SGI in Boston, the
 "keeper"
 of G94 on SGI hardware.  He knows more than most about your question.
 Regards,
 John
 --
 John M. McKelvey			email: mckelvey #*at*# Kodak.COM
 Computational Science Laboratory	phone: (716) 477-3335
 2nd Floor, Bldg 83, RL
 Eastman Kodak Company
 Rochester, NY 14650-2216
 --
 Response 4:
 I am afraid I cannot speak from experience of GAUSSIAN.   However I would
 go for the two separate workstations.  This would give you two screens, more
 core memory, more disk space and two separate processors anyway.  If by
 chance one of the machines has a fault you would most likely have the
 other one working.
 Yours sincerely
 Peter Bladon.
 Response 5:
 The answer will depend on how you use Gaussian. If you are going to only
 run one job at any given time, then the single machine with 2CPU's might
 give you better performance, since you can do some of the work in parallel.
 If you are going to be running lots of separate Gaussian runs, then you can
 run two separate jobs on the two machines at the same time.  Each individual
 run takes longer, but a collection takes about the same time (maybe even
 less since you don't have to pay for the overhead of parallization).
 Most of the people here tend to run a "family" of Gaussian jobs at a
 time.
 (same molecule, with different basis sets, or different constraints, etc)
 I would tend to favor the second option, especially if the two options are
 about equal in cost.
 a) I seem to recall that Gaussian is another one of those memory hogs, so
    "more memory is better". (More disk is better too)
 b) Two separate machines is better if (when) one dies.
 c) Sharing two machines is easier.
 The things that I can come up with that favor the single machine are related
 to networking and administration.  Essentially, it is easier to only have one
 machine to troubleshoot, upgrade, and administer.  If you already have other
 machines that you plan to network with this (these) new machine(s), then
 the extra work for the new machine(s) sort of blends in, because "it is
 always hardest the first time".
 -------------------------------------------------------------
                                       ("`-/")_.-'"``-._
 Wendy W. Richardson, Ph.D.            (. . `) -._    )-;-,_()
 Sr. Research Investigator             (v_,)'  _  )`-.\  ``-'
 Searle                                _;- _,-_/ / ((,'
 4901 Searle Parkway                 ((,.-'  ((,/
 Skokie, IL 60077                   wwrich #*at*# ddpi7.monsanto.com
 Response 6:
   Of the two hardware platforms you describe I would lean to the
 PowerChallenge with two processors over two single processor machines.
 The memory bandwidth is better than the PowerIndigo2 and G94 automatically
 compiles to run in parallel, if desired, on a PowerChallenge.  Gaussian 94
 defaults to 32MB of memory per process which is sufficient for the majority
 of calculations and so even with 2 processors 128 MB is sufficient.  Also
 G94 can now use your full disk by splitting its scratch files into 2GB
 chunks until SGI upgrades IRIX 6 to support files larger than 2GB.
   Gaussian 94 uses a shared memory parallel model on the PowerChallenge
 and HF and DFT energies, gradients and frequencies run in parallel.  Post-HF
 calculations take some advantage of parallel but less than spectacular
 at this time.  There is no additional cost for this capablity.
   Let us know if you have additional questions.
 Doug Fox
 help #*at*# gaussian.com
 Response 7:
     It look that (for performance of ONE g94 task) you'll receive more
 high performance on 2-processor system than on 2 systems, connected
 in one cluster - if the memory requirements for your task will be not
 higher than you have on 2-processor system. This is due to more
 bad parallelization in cluster if compare with 2-processor system.
     For performance of mix of G94 jobs you must have more high
 throughput on 2 independent 1-processor systems due to more high
 summary memory and absense of bus competitions.
     In the sense of price/performance 2-processor system must be
 more attractive.
 Dr.Mikhail Kuzminsky,
 N.D.Zelinsky Institute of Organic Chemistry,
 Moscow
 Response 8:
 We have a Power Challenge L and 2 Power Indigo2 workstations:
      Power Challenge L   4xR8000, 512 MB, 2x8GB Gaussian scratch
                          directories (we implemented switching in a script)
                          striped across 2 Fast-Wide Differential SCSI-2
                          channels each with 2 4 GB disks (aggregate 40 MBs
                          and we sustain about 36 MBs)
                          Runs 2 Gaussian jobs simultaneously.  Each with
                          MEMEORY=30 (240 MB).
      Power Indigo2       96/128 MB, 2/4 GB dedicated Gaussian scratch space
                          on a single Fast-Wide SCSI2 disk (10 MBs and we
                          sustain 6-8 MBs)
                          These systems are constrained to running only 1
                          Gaussian job at a time with "MEMORY=8" (64
 MB).
      All Gaussian acratch disks are Seagate Baraccuda drives.
 As far as raw CPU speed, the Power Indigo2 is about 0.8-0.9x the speed of a
 single processor on the PowerChallenge because of smaller cache and more
 limited bus bandwidth.  Paralell performance is very good over 2-4
 processors for HF and DFT calculations (degree of parallelization is about
 0.95-0.97).  Parallel performance is also good for MP2 calculations but not
 as good as HF or DFT (degree of prallelization is about 0.90).  However you
 will find I/O to be a bigger issue particularly with Gaussian codes even
 for DIRECT calculations.  Minimizing I/O will be a major concern and will
 dramatically impact your throughput.  If you can, use striped file systems
 for your Gaussian scratch directory;  and it would be best to stripe across
 disks on multiple SCSI channels as we do on the Power Challenge.  On the
 Power Indigo 2, I/O is typically 20-50% of the total job time while on the
 Power Challenge it's 5-15%.  Note that striping across a single SCSI
 channel will only improve I/O by about 20%.
 If you can...
      Power Indigo 2:
      Stripe across an internal (bus 0) and external (bus 1) disk on the
      Power Indigo2 or use a bus extender to use external devices on both
      channels.  This will effectively double your I/O.
      Power Challenge:
      You can stripe across both internal SCSI channels but one is
      differential (20 MBs) and the other isn't (10 MBs) so you'd be limited
      to aggregate transfer rates of 30 MBs.  Alternatively, you could
      stripe across the internal differential bus and the external bus which
      may be set to differential to get 40 MBs.
      We chose to take another route which offers future expandability.  We
      installed an additional HIO card and bus extenders which provide three
      additional external differential SCSI2 channels; cost about $3,000.
      We then used two of these together with external differential 4 GB
      disks to give us a striped array which sustains 36 MBs.  We plan to
      upgrade to 6 processors and stripe across 4 fast-wide SCSI-2 channels
      as soon as possible.
 Note that as your processors get faster I/O will become a larger fraction
 of the total job time.  Therefore you'll have to look to balance your
 system performance to maximize throughput.
 Good luck!
  _______________________________________________________________________
 /                                                                       \
 | Comments are those of the author and not Unilever Research U. S.      |
 |                                                                       |
 | Karl F. Moschner, Ph. D.                                              |
 |                                                                       |
 | Unilever Research U. S.      e-mail: Karl.F.Moschner #*at*# urlus.sprint.com |
 | 45 River Road                Phone:  (201) 943-7100 x2629             |
 | Edgewater, NJ 07020          FAX:    (201) 943-5653                   |
 \_______________________________________________________________________/
 Response 9:
 It depends a bit on the type of jobs you are going to run. Option 1
 would allow parallelization, which is done reasonably well for a
 number of runtypes (in particular HF and DFT). This would mean one job
 at a time.
 If you want to run one job per processor, then go for option 2: it has
 more RAM available for each calculation and I/O will not interfere.
 Regardless of which configuration you choose: make sure that the scratch
 space is spread over at least two disks, in the form of a striped device.
 This greatly improves the I/O performance. For example, in the case of
 option 2 this would mean 2 2Gb disks, with 1Gb of each combined into a
 striped scratch partition of 2Gb total. With 1 4Gb disk, I/O performance
 is significantly worse.
 Best wishes,
       Nico van Eikema Hommes
 --
   Dr. N.J.R. van Eikema Hommes     Computer-Chemie-Centrum
   hommes #*at*# ccc.uni-erlangen.de       Universitaet Erlangen-Nuernberg
   Phone:    +49-(0)9131-856532     Naegelsbachstr. 25
   FAX:      +49-(0)9131-856566     D-91052 Erlangen, Germany