CCL Home Preclinical Pharmacokinetics Service
APREDICA -- Preclinical Service: ADME, Toxicity, Pharmacokinetics
Up Directory CCL December 01, 1994 [006]
Previous Message Month index Next day

From:  "Don H. Phillips" <virtual -8 at 8- quantum.larc.nasa.gov>
Date:  Thu, 1 Dec 1994 09:42:54 -0500
Subject:  Pentium Bug


In considering and discussing the possibilities of serious errors
arising from the Pentium division bug, it is important to be clear
about absolute and relative errors, the nature of the bug itself,
and the nature of the calculation.

About the bug:

1.  It bites infrequently, once every several billion divides if the
    numbers involved in the calculation are chosen by a truly random
    process.

2.  The error (bite) is usually (relatively) small but can be (relatively)
    large.  The largest relative error reported thus far was (6.1*10^-05).

About ab initio calculations:

1.  Very large quantities of numbers are calculated and processed.  The
    numbers processed vary very widely in absolute size.

2.  Different types of calculations (CI, MP, analytical or numerical energy
    derivative, SCF, etc) involve different types of intensive calculations.
    Some calculations approach numerical instability with the precision
    used for integrals, etc ( approaching *numerical* linear dependence
    with large basis sets, for example).

Some of the discussion on this list has focused on the large number of
quantities (two electron integrals) which have small absolute size.  It
is true that relative errors, even as large so as to leave only 4 significant
digits, are not important if the absolute value of the quantity under
consideration is near the numerical threshold of the calculation.

However, even after we discount the operations involving numbers within
a factor of 10^4 of the threshold we are left with a large number of
operations which might hit the bug and *might* lead to a significant error.

The analogy of the computations to weighing a captain by weighing the
ship and captain and subtracting the weight of the ship is a little
oversimplified for this discussion unless you consider that one of the
two weights may be in error by a factor which may be as large as
( 1 - 1*10^-4).  (How many *TONS* did you say the captain weighs?)

To spruce up the analogy you might consider a ship composed of nearly
equal amounts of positive and negative mass with the NET mass of
a supertanker.  Instead of weighing the ship, you must imagine cutting
up the ship two different ways and weighing the parts, including the
filings resulting from the cutting, independently. Include the captain
in one of the weighings.  The (signed) weights are summed twice, once
using the weights including the captain and once using the weights
without the captain.  If some fraction of the weights of the filings
are off by 10^-4 or less, the two sums will still have sufficient
accuracy for the difference to be a good approximation to the captain's
weight, but if the value for just one of the large pieces is significantly
in error, all bets are off.

A few years back, I saw reports that indicated that about 20% of the
cycles at the NSF supercomputer sites were being used for calculations
on molecules and solids.  Today, we most certainly have a much larger
volume of calculations being carried out on small (but powerful)
computers.  Some of the calculations take days, weeks, and even months
of cpu time.  It is inevitable that if enough calculations are being
carried out on Pentium cpus, some of them will contain significant errors.
The fact that one or more particular sets of calculations of particular
types were carried out without detecting an error, or only detecting
inconsequential errors, has limited relevance to the issue.  In
order to consider this error inconsequential, we would have to know
that the probability of an error of a given size be on the order of
that for an error of the same size occurring due to multiple bit memory
errors.  I don't believe that the available evidence supports such a
conclusion at this time.

I know of three cases in which powerful computers, operating in production
environments, had broken hardware producing bad values for weeks or months.
These cases involved a hot bit in a disk buffer on a Cray, a broken
vector multiplier on a Cray, and a broken vector adder on a Convex.  In
each case only a couple of users realized anything was wrong. Getting
the machine out of production for repair required that a user *prove*
that the machine had a problem.  In each case, dozens or hundreds of
persons were getting questionable results from the machines
without realizing it.  If Intel continues to require that users prove
that their applications have a problem with the Pentium divide error
in order to obtain a "repair", there are going to be a lot of machines
producing questionable results for a long time.

Don


Similar Messages
08/01/1996:  Re: CCL:M:Heat of formation calculation using MOPAC.
06/03/1993:  Hierarchical multipole solvers
04/23/1992:   Huckel MO Theory software
12/16/1994:  Spin contamination & AM1 "ROHF" versus UHF
08/01/1995:  Spin contamination, effect on energy and structure.
03/11/1996:  Law of conservation of difficulty: violations.
05/09/1997:  Re: mn complexes
03/19/1997:  Chem Topic: SCF Convergence and Chaos Theory
10/13/1992:  On the use of Cut-off Schemes in MD
03/13/1992:  Re: Time step in MD


Raw Message Text