CCL: help needed
- From: "Alex. A. Granovsky"
<gran-*-classic.chem.msu.su>
- Subject: CCL: help needed
- Date: Sat, 29 Oct 2005 02:47:54 +0400
Sent to CCL by: "Alex. A. Granovsky" [gran]~[classic.chem.msu.su]
Hi Perry,
> About a year ago I conducted a demonstration in which a single
> processor machine running NetBSD (not Linux, but the principle would
> apply to Linux as well) successfully rebuilt a large software system
> many times faster than a four processor Windows server. The Windows
> machine had more memory, and each individual processor was faster than
> the NetBSD machine's one processor. The reason? The Windows machine
> was unable to keep enough pages in memory to be able to keep its four
> processors running at 100%. The NetBSD box more or less put everything
> it needed into memory once and barely touched the disk again, so its
> processor hit 100% and stayed pinned there. If the Windows machine had
> been able to this, it would have easily outperformed the NetBSD
> machine, but since it could not, most of its four expensive processors
> were sitting idle most of the time.
This is unfair comparison. In fact, we both know this,
but it is much more unfair to give such an incorrect examples
with respect to people who do not.
First of all, describing any benchmark, you must avoid any qualitative
statements. You must fully specify hardware configurations, OS versions
and relevant settings, any nonstandard drivers used, etc..., etc... You must
also provide enough details on the particular benchmark used. In your case, you
should specify compilers used on NetBSD and Windows system, indicate compiler
options of significant impact on the compiler performance, provide sufficient
details on the software system which was rebuilt, explain the way how additional
processors were used (if used at all) during compilation on Windows system,
etc...
Second, we all know that there are many situations when running I/O or
memory-limited
tests on SMP system in parallel results in worse performance than using only
single
thread or single process.
Next, if you used Cygwin to compile on Windows system, you must agree that
there are many situations when it causes significant overhead, e.g., process
initialization,
forking, etc... This overhead is not usually of significance for typical
computational
chemistry programs, but certainly is for compilers. Also, nothing was made to
change
Cygwin version of gcc to optimize it for Windows OS. On the other hand, if you
used
different compiler, it is incorrectly to compare compilation times at all, as
nobody knows
how much additional time was spent generating/optimizing the code.
Finally, you must admit that this is not the type of benchmark of
interest/importance
for typical computational chemist. In CC, building the project is not the most
time-consuming
step, using the program for number-crunching is. This is why I suggested to use
PC GAMESS
as the benchmark. It seems you are not very familiar with typical Quantum
Chemistry codes,
otherwise, you would easily recognize that any non-trivial high quality QC
program can be used
to impose (or model) almost any desired scenario of loads on OS and computer
hardware.
> I've personally conducted extensive benchmarking on this specific
> topic, and I've read enormous amounts of Microsoft documentation.
So did I too.
> The page cache policy in Windows is utterly primitive. As a result of
> this, file pages are evicted from cache long before they need to
> be.
> You can, of course, set the registry key to tell the box to behave as
> a file server, at which point, executable pages are evicted from cache
> long before they need to be. There is no in between. There are no
> pluggable policies. You can't even tune the policy that is there.
This is only partially true. Windows API provides standard ways
for any particular application to optimize its memory and I/O usage.
Most of these API calls do not require any nonstandard rights to be used,
some (potentially having system-wide effects) do require them but these
rights can always be granted on per-user or per-group basis. What is important
is to use this API and to use it properly. This is simply quite a different
philosophy -
to allow application itself to optimize his execution strategy. Well, it is
evident
that only application itself knows the best way of how it should be executed,
and how OS resources should be managed for the best performance. One can spent
lots of years improving OS kernels - and it is surely worth while - but there
always
be applications for which the default OS algorithms will perform badly.
> This specific issue comes down to this: Windows does not have a true
> unified VM subsystem architecture in which the buffer cache is
> properly integrated into the virtual memory subsystem. It also does
> not have a flexible, self tuning mechanism for managing the tradeoff
> between using memory for executable pages, data pages and file cache
> pages. The result of this is that it does lots of I/O when it doesn't
> need to, drastically hurting performance.
I do not agree with this. Certainly, Windows' caching algorithms
differs from that of Linux or BSD derivatives. Sure there are situations
when the performance of Unix-like strategies is better (or worse). Nevertheless,
this is almost irrelevant to real computational chemistry, where we typically
have one of two scenario - either huge data files (to be handled either
sequentially
or randomly), or very compact data files. The intermediate case is exotic.
Aggressive
data caching can (and often does) seriously reduce performance if working with
really
large datasets. Just an example - did you ever try to perform any non-trivial
work with
the file twice as large as the system memory under Linux (or any other OS)? If
not, just
try and you'll most likely find that the default system strategy is not perfect.
> You can argue all you like about how much nicer Windows is. Perhaps it
> is. That is subjective. The OBJECTIVE benchmarks, however, show that
> it is trivial to drive a Windows box out of page cache and make it
> stall.
This is true for some aggressive badly-written applications.
On the other hand, it is very trivial to put any 2.4.x Linux kernel to the same
state as well.
Do you know that there are many cases when swapoff -a
is the only way to perform the calculations not damaging your HDDs and
spending finite time waiting for results?
It is less trivial to do this with 2.6.x, but it is still possible.
Once again, you simply should reckon that Windows is quite a different world
and has somewhat different rules how to write efficient programs. The point is
that it is typically a very good idea to allow programmer to control how his
program will be executed, his files cached, etc... This is what Windows allows.
> > If one would be really interested in Windows vs. Linux performance,
> > it is a good idea to use PC GAMESS as the benchmark
>
> No, actually, that isn't a particularly good benchmark. Proper
> benchmarking requires that you use a variety of tasks that exercise a
> variety of operating conditions. A single program can never be a
> good benchmark. In a computational chemistry context, a variety of
> loads are needed to properly assess the differences between the two
> systems.
See above.
>
> > It is not the problem at all to create an input file which will put
> > Linux memory & I/O subsystems down.
>
> That actually is false. If you have conditions in which the default
> settings of the Linux algorithms are working incorrectly, then by a
> simple tuning process you can alter the tradeoff between executable,
> data and file pages and optimize your performance.
What you forgot to say here is that a) you must be a root, and b) the changes
for any particular program will have system-wide effect. Do not you believe
that allowing some API here (as Windows does for years) to be used by
any non-privileged user would be of great help here?
> It may be true that if you don't know what you are doing, you can't
> make a Linux box perform properly, but at least there is stuff you can
> do if you know what you're doing. Even if you know what you are doing,
> there is little you can do to tune Windows properly.
Well, there is no much need to tune Windows as there is much I can do to
write high-performance Windows programs. Nevertheless, there are some
things I can do system-wide with Windows, although I agree that you have
more options under Linux.
> > The same is true for Windows to some degree. I personally do
> > not aware of any (and doubt if it is possible at all) good
> > implementation of memory management in any OS for the case of
> > simultaneous heavy I/O and memory load.
>
> Then you aren't paying close attention to the work that people have
> done in the last 20 years in operating system design.
Then where are these Operating Systems causing any QC code to fly?
> > Nevertheless, Windows has much more advanced memory and I/O API than
> > Linux
> Oh, really? Can you explain, then, why it is that Unix systems easily
> beat Windows on high performance network I/O,
> If Windows is so fast, one might ask why it is that the record for
> fastest TCP transmission rates is not held by Windows hardware, and
> why researchers on networking performance rarely do their work under
> Windows.
You can look at http://www.tpc.org/tpcc/results/tpcc_perf_results.asp
> why there is no ability to tune the Windows page cache
as there is no need to do this system-wide :-)
>why Windows is so much worse at context switches, etc?
Oh, really? Would you like to discuss the threading
API/implementation/performance
under Linux vs. Windows?
> Sure, Cutler stole a lot of VMS code to build NT. If you assume VMS
> nearly 20 years ago was the best model of how to do I/O on earth, I'm
> sure you're a Windows fan. The benchmarks don't agree.
VMS was really good OS in many senses - I used to work under VMS...
> This is also why gcc, which is far larger than ANY computational
> chemistry package ever built and far more complicated, doesn't exist.
It seems you are simply not familiar with QC codes. Good QC package is
_at least_ as complex as gcc, both in the sense of nontrivial algorithms,
size of sources, know hows used, etc...
> Unfortunately, a lot of this stuff is hard. The problem is that issues
> like virtual memory subsystem architecture aren't any more easily
> understood by non-specialist than an ab initio computational chemistry
> system is easily understood by the non-specialist. Really
> understanding the issues requires that you have an operating systems
> class under your belt, or the equivalent.
Is the statement above ethical? Should we assume you are the only guy around
knowing
both CC and OS kernels programming? Should your opinion be the
only law for us?
Best regards,
Alex Granovsky