From: Quadibloc on
Some time ago, I believe it was on this newsgroup, someone mentioned
the forthcoming ATI FireStream 9170 floating-point coprocessor card.

Although I visit the Tom's Hardware site occasionally, I missed this
article until I saw a link to it on HPCwire:

http://www.tgdaily.com/content/view/39348/135/1/1/

This is the second page of the article; it compares other coprocessor
products available now.

Two items, in the $500 range, are high-end video cards, one from AMD/
ATI and the other from Nvidia. They can be used with the same software
from these companies as their coprocessor cards can be used with.
According to the article, the Nvidia Tesla coprocessor card basically
uses the same components as the video card, but costs twice as much
because it's a less mass-market item - omitting the video components
lets the card run at a slightly higher speed.

This sounded so peachy keen that I wanted to run out and buy one
*right now*!

But the trouble is that the video cards from both companies have been
tested, and found to have an error rate of around 1%. Since these
errors are cited as being due to a lack of ECC RAM in the devices,
this would seem to mean that they are mostly severe errors, rather
than the normal results of arithmetic that is not fully IEEE-784
compliant. If the only difference in the Tesla is that the graphics
part is left out, that makes me think it is likely to have the same
problem.

Maybe the delay in the FireStream part is due to an attempt to deal
with this. For now, it is the other, much more expensive items listed
that don't have this problem. And, of course, a 1% error rate will
turn any calculation that depends on trillions of floating-point
calculatins being carried out without error into mush.

It looks like, despite all the improvements that have been made in
microprocessors, the old adage "You get what you pay for" is still in
force - even if the price of quality has come 'way down, there are
still some bargains on offer that _are_ too good to be true.

John Savard
From: Guy Macon on



Quadibloc wrote:

>But the trouble is that the video cards from both companies
>have been tested, and found to have an error rate of around 1%.

"an error rate of around 1%" sounds like 1 out of 100 operations
returns an error, but the article says "...compute failure rates
are around 1%. This is based on a sampling of approximately 500
GPGPU users on Folding(a)Home. A study carried out showed that
approximately 1% of the computations carried out by Nvidia
resulted in some form of failed processing.", which implies
that 1 out of 100 Folding(a)Home work units, each running billions
of cycles, fails due to an error hat may or may not be hardware-
related.

The Folding(a)Home FAQ on NVIDIA hardware says this:

"'The client was working, but now all I'm getting was Early Unit
Ends (EUE's). How can I fix this?' We've seen cases where playing
GPU intensive games can leave the GPU in a weird state, leading
to consistent EUE's (Early Unit End error messages). Restarting
the computer has worked to resolve this problem. We are looking
into a better solution.

"'My client gives an UNSTABLE_MACHINE error and is going to sleep
for 24 hours! What should I do?' This occurs when 5 EUE's occur.
Rapidly EUE-ing machines are a sign that the client needs some
donor intervention to fix it. Please check out the FAQ below
as well as forum (http://foldingforum.org) for details about
how to fix a misconfigured client. This error typically
results from a problem with drivers. Please see the instructions
above for which drivers you should use for your hardware.
Unfortunately, we cannot give more information from the client,
since all the client knows is that it can't run CUDA and there's
lots of reasons why (and there's currently no way for the core
to detect them)."

Source: http://folding.stanford.edu/English/FAQ-NVIDIA

That sounds a lot like that 1% being a count of all cases of
Folding(a)Home Early Unit Ends -- including driver problems.
I would like to see the results of actual diagnostic tests
of the hardware alone.

>Since these errors are cited as being due to a lack of
>ECC RAM in the devices,

Total speculation on the part of the author. He has zero
evidence that whatever is causing those Early Unit Ends
is even in the hardware, much less narrowing it down to
the memory subsystem.


--
Guy Macon
<http://www.GuyMacon.com/>

From: Nick Maclaren on

In article <LKGdnXp7BYye5E_VRVn_vwA(a)giganews.com>,
Guy Macon <http://www.GuyMacon.com/> writes:
|>
|> >Since these errors are cited as being due to a lack of
|> >ECC RAM in the devices,
|>
|> Total speculation on the part of the author. He has zero
|> evidence that whatever is causing those Early Unit Ends
|> is even in the hardware, much less narrowing it down to
|> the memory subsystem.

Yes. I was talking about precisely this issue to someone from
SiCortex yesterday. They use ECC everywhere, which is good, but
I pointed out that neutron flux is NOT a problem in the UK. Not
merely are we at sea level, we are at 50+ degrees of latitude,
with a notoriously humid atmosphere. Dammit, we have 6 months
when there isn't even enough ultraviolet to maintain adequate
vitamin D levels even in pale skinned people ....

I pointed out that a lot of users had claimed that explanation
for occasional non-repeatable errors, because they had been told
it by the Los Alamos people. Well, it MAY be true there (though I
am not convinced), but assuredly wasn't with us. I told them that
it was almost certainly software, probably race conditions; and,
in one case, I got proof that it was the race condition that I keep
banging on about in the FLIH.

If you have ECC and log all errors, neutron flux will show up as
single-bit errors; and double- and triple-bit ones will bear the
appropriate Poisson relationship. Right? Well, that's precisely
what we WEREN'T seeing.

We KNOW that the ISA and memory hardware/software boundary is solid
with race conditions, because the interface isn't adequate to do
the job properly. It wouldn't surprise me at all if the same weren't
true of the GPU interface. Is that the reason? Dunno.


Regards,
Nick Maclaren.
From: Quadibloc on
On Sep 18, 9:51 am, Guy Macon <http://www.GuyMacon.com/> wrote:
> which implies
> that 1 out of 100 Folding(a)Home work units, each running billions
> of cycles, fails

Ah. I must not have read the article carefully enough then. This is
still somewhat worrisome, because even a small (but not totally
insignificant) rate of error still requires overhead to test for, but
at least now one can think in terms of checking whole computations,
not individual operations.

For most floating-point applications, the weaknesses of floating-point
in the "bad old days" before IEEE 784, on the other hand, were
entirely reasonable - and if Folding(a)Home had been converted from code
*requiring* IEEE 784 compliance, there's even a chance that this is in
the "no problem" category.

John Savard
From: Nick Maclaren on

In article <0c5dbd55-2884-430e-937c-6041be8f8a82(a)i76g2000hsf.googlegroups.com>,
Quadibloc <jsavard(a)ecn.ab.ca> writes:
|> On Sep 18, 9:51=A0am, Guy Macon <http://www.GuyMacon.com/> wrote:
|> > which implies
|> > that 1 out of 100 Folding(a)Home work units, each running billions
|> > of cycles, fails
|>
|> Ah. I must not have read the article carefully enough then. This is
|> still somewhat worrisome, because even a small (but not totally
|> insignificant) rate of error still requires overhead to test for, but
|> at least now one can think in terms of checking whole computations,
|> not individual operations.
|>
|> For most floating-point applications, the weaknesses of floating-point
|> in the "bad old days" before IEEE 784, on the other hand, were
|> entirely reasonable - and if Folding(a)Home had been converted from code
|> *requiring* IEEE 784 compliance, there's even a chance that this is in
|> the "no problem" category.

Eh? If you mean by that what I think you mean by that, you are quite
seriously wrong. Could you explain?


Regards,
Nick Maclaren.