|
Prev: Omega DeVille Watches: Quality Omega Discount Watches
Next: Tag Heuer Link Automatic Chronograph Ayrton Senna 2004 Limited Edition CJF2113.BA0576
From: Quadibloc on 18 Sep 2008 07:25 Some time ago, I believe it was on this newsgroup, someone mentioned the forthcoming ATI FireStream 9170 floating-point coprocessor card. Although I visit the Tom's Hardware site occasionally, I missed this article until I saw a link to it on HPCwire: http://www.tgdaily.com/content/view/39348/135/1/1/ This is the second page of the article; it compares other coprocessor products available now. Two items, in the $500 range, are high-end video cards, one from AMD/ ATI and the other from Nvidia. They can be used with the same software from these companies as their coprocessor cards can be used with. According to the article, the Nvidia Tesla coprocessor card basically uses the same components as the video card, but costs twice as much because it's a less mass-market item - omitting the video components lets the card run at a slightly higher speed. This sounded so peachy keen that I wanted to run out and buy one *right now*! But the trouble is that the video cards from both companies have been tested, and found to have an error rate of around 1%. Since these errors are cited as being due to a lack of ECC RAM in the devices, this would seem to mean that they are mostly severe errors, rather than the normal results of arithmetic that is not fully IEEE-784 compliant. If the only difference in the Tesla is that the graphics part is left out, that makes me think it is likely to have the same problem. Maybe the delay in the FireStream part is due to an attempt to deal with this. For now, it is the other, much more expensive items listed that don't have this problem. And, of course, a 1% error rate will turn any calculation that depends on trillions of floating-point calculatins being carried out without error into mush. It looks like, despite all the improvements that have been made in microprocessors, the old adage "You get what you pay for" is still in force - even if the price of quality has come 'way down, there are still some bargains on offer that _are_ too good to be true. John Savard
From: Guy Macon on 18 Sep 2008 11:51 Quadibloc wrote: >But the trouble is that the video cards from both companies >have been tested, and found to have an error rate of around 1%. "an error rate of around 1%" sounds like 1 out of 100 operations returns an error, but the article says "...compute failure rates are around 1%. This is based on a sampling of approximately 500 GPGPU users on Folding(a)Home. A study carried out showed that approximately 1% of the computations carried out by Nvidia resulted in some form of failed processing.", which implies that 1 out of 100 Folding(a)Home work units, each running billions of cycles, fails due to an error hat may or may not be hardware- related. The Folding(a)Home FAQ on NVIDIA hardware says this: "'The client was working, but now all I'm getting was Early Unit Ends (EUE's). How can I fix this?' We've seen cases where playing GPU intensive games can leave the GPU in a weird state, leading to consistent EUE's (Early Unit End error messages). Restarting the computer has worked to resolve this problem. We are looking into a better solution. "'My client gives an UNSTABLE_MACHINE error and is going to sleep for 24 hours! What should I do?' This occurs when 5 EUE's occur. Rapidly EUE-ing machines are a sign that the client needs some donor intervention to fix it. Please check out the FAQ below as well as forum (http://foldingforum.org) for details about how to fix a misconfigured client. This error typically results from a problem with drivers. Please see the instructions above for which drivers you should use for your hardware. Unfortunately, we cannot give more information from the client, since all the client knows is that it can't run CUDA and there's lots of reasons why (and there's currently no way for the core to detect them)." Source: http://folding.stanford.edu/English/FAQ-NVIDIA That sounds a lot like that 1% being a count of all cases of Folding(a)Home Early Unit Ends -- including driver problems. I would like to see the results of actual diagnostic tests of the hardware alone. >Since these errors are cited as being due to a lack of >ECC RAM in the devices, Total speculation on the part of the author. He has zero evidence that whatever is causing those Early Unit Ends is even in the hardware, much less narrowing it down to the memory subsystem. -- Guy Macon <http://www.GuyMacon.com/>
From: Nick Maclaren on 18 Sep 2008 12:34 In article <LKGdnXp7BYye5E_VRVn_vwA(a)giganews.com>, Guy Macon <http://www.GuyMacon.com/> writes: |> |> >Since these errors are cited as being due to a lack of |> >ECC RAM in the devices, |> |> Total speculation on the part of the author. He has zero |> evidence that whatever is causing those Early Unit Ends |> is even in the hardware, much less narrowing it down to |> the memory subsystem. Yes. I was talking about precisely this issue to someone from SiCortex yesterday. They use ECC everywhere, which is good, but I pointed out that neutron flux is NOT a problem in the UK. Not merely are we at sea level, we are at 50+ degrees of latitude, with a notoriously humid atmosphere. Dammit, we have 6 months when there isn't even enough ultraviolet to maintain adequate vitamin D levels even in pale skinned people .... I pointed out that a lot of users had claimed that explanation for occasional non-repeatable errors, because they had been told it by the Los Alamos people. Well, it MAY be true there (though I am not convinced), but assuredly wasn't with us. I told them that it was almost certainly software, probably race conditions; and, in one case, I got proof that it was the race condition that I keep banging on about in the FLIH. If you have ECC and log all errors, neutron flux will show up as single-bit errors; and double- and triple-bit ones will bear the appropriate Poisson relationship. Right? Well, that's precisely what we WEREN'T seeing. We KNOW that the ISA and memory hardware/software boundary is solid with race conditions, because the interface isn't adequate to do the job properly. It wouldn't surprise me at all if the same weren't true of the GPU interface. Is that the reason? Dunno. Regards, Nick Maclaren.
From: Quadibloc on 18 Sep 2008 12:52 On Sep 18, 9:51 am, Guy Macon <http://www.GuyMacon.com/> wrote: > which implies > that 1 out of 100 Folding(a)Home work units, each running billions > of cycles, fails Ah. I must not have read the article carefully enough then. This is still somewhat worrisome, because even a small (but not totally insignificant) rate of error still requires overhead to test for, but at least now one can think in terms of checking whole computations, not individual operations. For most floating-point applications, the weaknesses of floating-point in the "bad old days" before IEEE 784, on the other hand, were entirely reasonable - and if Folding(a)Home had been converted from code *requiring* IEEE 784 compliance, there's even a chance that this is in the "no problem" category. John Savard
From: Nick Maclaren on 18 Sep 2008 13:27
In article <0c5dbd55-2884-430e-937c-6041be8f8a82(a)i76g2000hsf.googlegroups.com>, Quadibloc <jsavard(a)ecn.ab.ca> writes: |> On Sep 18, 9:51=A0am, Guy Macon <http://www.GuyMacon.com/> wrote: |> > which implies |> > that 1 out of 100 Folding(a)Home work units, each running billions |> > of cycles, fails |> |> Ah. I must not have read the article carefully enough then. This is |> still somewhat worrisome, because even a small (but not totally |> insignificant) rate of error still requires overhead to test for, but |> at least now one can think in terms of checking whole computations, |> not individual operations. |> |> For most floating-point applications, the weaknesses of floating-point |> in the "bad old days" before IEEE 784, on the other hand, were |> entirely reasonable - and if Folding(a)Home had been converted from code |> *requiring* IEEE 784 compliance, there's even a chance that this is in |> the "no problem" category. Eh? If you mean by that what I think you mean by that, you are quite seriously wrong. Could you explain? Regards, Nick Maclaren. |