CCL: help needed



 Sent to CCL by: "Alex. A. Granovsky" [gran]~[classic.chem.msu.su]
 Hi Perry,
 > About a year ago I conducted a demonstration in which a single
 > processor machine running NetBSD (not Linux, but the principle would
 > apply to Linux as well) successfully rebuilt a large software system
 > many times faster than a four processor Windows server. The Windows
 > machine had more memory, and each individual processor was faster than
 > the NetBSD machine's one processor. The reason? The Windows machine
 > was unable to keep enough pages in memory to be able to keep its four
 > processors running at 100%. The NetBSD box more or less put everything
 > it needed into memory once and barely touched the disk again, so its
 > processor hit 100% and stayed pinned there. If the Windows machine had
 > been able to this, it would have easily outperformed the NetBSD
 > machine, but since it could not, most of its four expensive processors
 > were sitting idle most of the time.
     This is unfair comparison. In fact, we both know this,
 but it is much more unfair to give such  an incorrect examples
 with respect to people who do not.
     First of all, describing any benchmark, you must avoid any qualitative
 statements. You must fully specify hardware configurations, OS versions
 and relevant settings, any nonstandard drivers used, etc..., etc... You must
 also provide enough details on the particular benchmark used. In your case, you
 should specify compilers used on NetBSD and Windows system, indicate compiler
 options of significant impact on the compiler performance,  provide sufficient
 details on the software system which was rebuilt, explain the way how additional
 processors were used (if used at all) during compilation on Windows system,
 etc...
      Second, we all know that there are many situations when running I/O or
 memory-limited
 tests on SMP system in parallel results in worse performance than using only
 single
 thread or single process.
     Next, if you used Cygwin to compile on Windows system, you must agree that
 there are many situations when it causes significant overhead, e.g., process
 initialization,
 forking, etc... This overhead is not usually of significance for typical
 computational
 chemistry programs,  but certainly is for compilers. Also, nothing was made to
 change
 Cygwin version of gcc to optimize it for Windows OS. On the other hand, if you
 used
 different compiler, it is incorrectly to compare compilation times at all, as
 nobody knows
 how much additional time was spent generating/optimizing the code.
     Finally, you must admit that this is not the type of benchmark of
 interest/importance
 for typical computational chemist. In CC, building the project is not the most
 time-consuming
 step, using the program for number-crunching is. This is why I suggested to use
 PC GAMESS
 as the benchmark. It seems you are not very familiar with typical Quantum
 Chemistry codes,
 otherwise, you would easily recognize that any non-trivial high quality QC
 program can be used
 to impose (or model) almost any desired scenario of loads on OS and computer
 hardware.
 > I've personally conducted extensive benchmarking on this specific
 > topic, and I've read enormous amounts of Microsoft documentation.
 So did I too.
 > The page cache policy in Windows is utterly primitive. As a result of
 > this, file pages are evicted from cache long before they need to
 > be.
 > You can, of course, set the registry key to tell the box to behave as
 > a file server, at which point, executable pages are evicted from cache
 > long before they need to be. There is no in between. There are no
 > pluggable policies. You can't even tune the policy that is there.
 This is only partially true. Windows API provides standard ways
 for any particular application  to optimize its memory and I/O usage.
 Most of these API calls do not require any nonstandard rights to be used,
 some (potentially having system-wide effects) do require them but these
 rights can always be granted on per-user or per-group basis. What is important
 is to use this API and to use it properly. This is simply quite a different
 philosophy -
 to allow application itself to optimize his execution strategy. Well, it is
 evident
 that only application itself knows the best way of how it should be executed,
 and how OS resources should be managed for the best performance. One can spent
 lots of years improving OS kernels - and it is surely worth while - but there
 always
 be applications for which the default OS algorithms will perform badly.
 > This specific issue comes down to this: Windows does not have a true
 > unified VM subsystem architecture in which the buffer cache is
 > properly integrated into the virtual memory subsystem. It also does
 > not have a flexible, self tuning mechanism for managing the tradeoff
 > between using memory for executable pages, data pages and file cache
 > pages. The result of this is that it does lots of I/O when it doesn't
 > need to, drastically hurting performance.
 I do not agree with this. Certainly, Windows' caching algorithms
 differs from that of Linux or BSD derivatives. Sure there are situations
 when the performance of Unix-like strategies is better (or worse). Nevertheless,
 this is almost irrelevant to real computational chemistry, where we typically
 have one of two scenario - either huge data files (to be handled either
 sequentially
 or randomly), or very compact data files. The intermediate case is exotic.
 Aggressive
 data caching  can (and often does) seriously reduce performance if working with
 really
 large datasets. Just an example - did you ever try to perform any non-trivial
 work with
 the file twice as large as the system memory under Linux (or any other OS)?  If
 not, just
 try and you'll most likely find that the default system strategy is not perfect.
 > You can argue all you like about how much nicer Windows is. Perhaps it
 > is. That is subjective. The OBJECTIVE benchmarks, however, show that
 > it is trivial to drive a Windows box out of page cache and make it
 > stall.
 This is true for some aggressive badly-written applications.
 On  the other hand, it is very trivial to put any 2.4.x Linux kernel to the same
 state as well.
 Do you know that there are many cases when swapoff -a
 is the only way to perform the calculations not damaging your HDDs and
 spending finite time waiting for results?
 It is less trivial to do this with 2.6.x, but it is still possible.
 Once again, you simply should reckon that Windows is quite a different world
 and has somewhat different rules how to write efficient programs. The point is
 that it is typically a very good idea to allow programmer to control how his
 program will be executed, his files cached, etc... This is what Windows allows.
 > > If one would be really interested in Windows vs. Linux performance,
 > > it is a good idea to use PC GAMESS as the benchmark
 >
 > No, actually, that isn't a particularly good benchmark. Proper
 > benchmarking requires that you use a variety of tasks that exercise a
 > variety of operating conditions. A single program can never be a
 > good benchmark. In a computational chemistry context, a variety of
 > loads are needed to properly assess the differences between the two
 > systems.
 See above.
 >
 > > It is not the problem at all to create an input file which will put
 > > Linux memory & I/O subsystems down.
 >
 > That actually is false. If you have conditions in which the default
 > settings of the Linux algorithms are working incorrectly, then by a
 > simple tuning process you can alter the tradeoff between executable,
 > data and file pages and optimize your performance.
 What you forgot to say here is that a) you must be a root, and b) the changes
 for any particular program will have system-wide effect. Do not you believe
 that allowing some API here (as Windows does for years) to be used by
 any non-privileged user would be of great help here?
 > It may be true that if you don't know what you are doing, you can't
 > make a Linux box perform properly, but at least there is stuff you can
 > do if you know what you're doing. Even if you know what you are doing,
 > there is little you can do to tune Windows properly.
 Well, there is no much need to tune Windows as there is much I can do to
 write high-performance Windows programs. Nevertheless, there are some
 things I can do system-wide with Windows, although I agree that you have
 more options under Linux.
 > > The same is true for Windows to some degree. I personally do
 > > not aware of any (and doubt if it is possible at all) good
 > > implementation of memory management in any OS for the case of
 > > simultaneous heavy I/O and memory load.
 >
 > Then you aren't paying close attention to the work that people have
 > done in the last 20 years in operating system design.
 Then where are these Operating Systems causing any QC code to fly?
 > > Nevertheless, Windows has much more advanced memory and I/O API than
 > > Linux
 > Oh, really? Can you explain, then, why it is that Unix systems easily
 > beat Windows on high performance network I/O,
 > If Windows is so fast, one might ask why it is that the record for
 > fastest TCP transmission rates is not held by Windows hardware, and
 > why researchers on networking performance rarely do their work under
 > Windows.
 You can look at http://www.tpc.org/tpcc/results/tpcc_perf_results.asp
 >  why there is no ability to tune the Windows page cache
 as there is no need to do this system-wide :-)
 >why Windows is so much worse at context switches, etc?
 Oh, really? Would you like to discuss the threading
 API/implementation/performance
 under Linux vs. Windows?
 > Sure, Cutler stole a lot of VMS code to build NT. If you assume VMS
 > nearly 20 years ago was the best model of how to do I/O on earth, I'm
 > sure you're a Windows fan. The benchmarks don't agree.
 VMS was really good OS in many senses - I used to work under VMS...
 > This is also why gcc, which is far larger than ANY computational
 > chemistry package ever built and far more complicated, doesn't exist.
 It seems you are simply not familiar with QC codes. Good QC package is
 _at least_ as complex as gcc, both in the sense of nontrivial algorithms,
 size of sources, know hows used, etc...
 > Unfortunately, a lot of this stuff is hard. The problem is that issues
 > like virtual memory subsystem architecture aren't any more easily
 > understood by non-specialist than an ab initio computational chemistry
 > system is easily understood by the non-specialist. Really
 > understanding the issues requires that you have an operating systems
 > class under your belt, or the equivalent.
 Is the statement above ethical? Should we assume you are the only guy around
 knowing
 both CC and OS kernels programming? Should your opinion be the
 only law for us?
 Best regards,
 Alex Granovsky