|
From: Terje Mathisen on 29 Mar 2008 10:30 Nick Maclaren wrote: > In article <c%eHj.28257$5i5.22175(a)newsfe6-gui.ntli.net>, > "Wilco Dijkstra" <Wilco_dot_Dijkstra(a)ntlworld.com> writes: > |> "Terje Mathisen" <terje.mathisen(a)hda.hydro.com> wrote in > |> message news:mfqdnY6m6PHh7nDanZ2dnUVZ_hadnZ2d(a)giganews.com... > |> > |> > I'm afraid that many cpus have been designed with (a) measured on SPECfp only, and as long as none of the SPEC > |> > benchmarks generate significant number of denormals, they will stay unimportant. > |> > |> The question is, are there really important HPC codes that use denormals a > |> lot? I don't know... If there were you'd expect them to be added to SPEC. > > You shouldn't confusing using denormals with generating them - as Greg > Lindahl's and my posts say, most generated denormals are mere noise. > > Also, because of this problem, HPC codes tend to get rewritten to > hack around the denormal problem. You would need to get a code that > had not yet been hacked (or remove the hacks and retest it). A friend of mine told about an important sw codec which on one particular architecture really trashed when faced with "silence", i.e. close to but not quite zero amplitude: This was of course due to sw denomal traps, the workaround was ugly but effective: Every N iterations add a small amount to each array element, "small" chosen so as to be well below the (probably 16-bit) output floor, but making sure that denorms wouldn't happen. > > I date from the days when HPC codes were also hacked to get around the > problem of slow integer to floating-point conversion. Exactly the > same arguments were made about that, as were made then and are still > made about underflow! But the CPUs fixed it, and you don't see those > hacks in new code (or very rarely). Oh but you still see a lot of fp->int hacks, even on brand new code! When you need scaling as well as fp->int conversion (i.e. fixed-point math) the hacks are usually still the fastest method: Add a magic double value, selected so as to leave the properly scaled integer in the lower half of the result. Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: Nick Maclaren on 29 Mar 2008 10:52 In article <qeidnU5ouuHiz3PanZ2dnUVZ_gydnZ2d(a)giganews.com>, Terje Mathisen <terje.mathisen(a)hda.hydro.com> writes: |> |> A friend of mine told about an important sw codec which on one |> particular architecture really trashed when faced with "silence", i.e. |> close to but not quite zero amplitude: |> |> This was of course due to sw denomal traps, the workaround was ugly but |> effective: Every N iterations add a small amount to each array element, |> "small" chosen so as to be well below the (probably 16-bit) output |> floor, but making sure that denorms wouldn't happen. Been there - done that :-) |> Oh but you still see a lot of fp->int hacks, even on brand new code! |> |> When you need scaling as well as fp->int conversion (i.e. fixed-point |> math) the hacks are usually still the fastest method: |> |> Add a magic double value, selected so as to leave the properly scaled |> integer in the lower half of the result. Do they just? Sometimes I think I have got caught in a time warp :-) Regards, Nick Maclaren.
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on 31 Mar 2008 18:47 Terje Mathisen <terje.mathisen(a)hda.hydro.com> wrote: > Given that denorm doesn't matter for SPEC, it would seem like a very > reasonable hw approach would be to start by assuming both values are > normal, then in parallel check for special values, including zero. > > If a denorm is found, move the mantissa into the barrel shifter, > normalize and stop/flush/restart the pipeline. > > Cost: Anywhere from 1-20 cycles? On the POWER4/PPC970 FPU, the first denormal input cause an 8 cycle stall, so there would be time enough in there to drain the FPU pipeline. Additional denormal inputs add 3 cycles to the stall each. A denormal output likely adds a 2 clock "Massive Cancellation Stall" and a 2 clock "Underflow/Overflow Stall". I have test code that I can plug particular input values into, if there are some that you suspect may be undocumented exceptions. The POWER6 FPU has no penalty for denormals. -- Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Terje Mathisen on 31 Mar 2008 10:11 Nick Maclaren wrote: > In article <fsqff4$elk$1(a)s1.news.oleane.net>, > =?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <Jan.Vorbrueggen(a)not-thomson.net> writes: > |> > > |> >> No, that's NOT true! The proper analysis uses game theory, and the > |> >> conclusions depend on (a) the probability of denormals occurring, > |> >> (b) the slowdown due to such denormals and (c) the cost to the system > |> >> of such a slowdown. > |> > > |> > I'm afraid that many cpus have been designed with (a) measured on SPECfp > |> > only, and as long as none of the SPEC benchmarks generate significant > |> > number of denormals, they will stay unimportant. > |> > |> I can state for a fact that this isn't true for SPEC CFP2000. Hey, I > |> wrote the code of the counter-example, so I should know! > > Congratulations! I spent a little while looking for a counter-example, > and the only ones I had off-hand were trivial, revolting or constrained > by copyright. > > |> Note that in many cases, the -fast or equivalent compiler option selects > |> turning off precise denormal handling and activates flush-to-zero, so > |> the problem just goes away even for a base SPECfp run. > > The IEEE 754R people won't love SPEC for allowing that in the base :-) Ouch! So instead of fixing the hw to make denorm (nearly) as fast as normal numbers, they allow all SPEC submissions to punt the issue. > > |> > Given that denorm doesn't matter for SPEC, it would seem like a very > |> > reasonable hw approach would be to start by assuming both values are > |> > normal, then in parallel check for special values, including zero. > |> > > |> > If a denorm is found, move the mantissa into the barrel shifter, > |> > normalize and stop/flush/restart the pipeline. > |> > > |> > Cost: Anywhere from 1-20 cycles? > |> > |> Application slowdown: a factor of 20. Yes, you read that right. Reread what I wrote: The 1-20 cycles are for various levels of hw handling, from the really heroic (a barrel shifter followed by a complete (extra) fp pipeline), to handling it like a branch miss, flushing the pipelines and restarting the instruction, while doing the normalization in the background. > > Yes, but that's implementing it by interrupt! The only system I found > that DID implement it by a restartable instruction (in hardware) was > the IBM POWER series, and that lost only a small factor. x87 with 80-bit extended would seem to (mostly) avoid the issue, as long as the load/store pipes could handle it. Did you see the same levels of slowdown here? Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: Nick Maclaren on 31 Mar 2008 10:47 In article <atudnc4VINW7bG3anZ2dnUVZ8t6inZ2d(a)giganews.com>, Terje Mathisen <terje.mathisen(a)hda.hydro.com> writes: |> |> > |> Note that in many cases, the -fast or equivalent compiler option selects |> > |> turning off precise denormal handling and activates flush-to-zero, so |> > |> the problem just goes away even for a base SPECfp run. |> > |> > The IEEE 754R people won't love SPEC for allowing that in the base :-) |> |> Ouch! So instead of fixing the hw to make denorm (nearly) as fast as |> normal numbers, they allow all SPEC submissions to punt the issue. Now, there we disagree. Even the very best numerical analysts and architecture experts are divided on whether denormals are a good idea, even in theory. The ones I have worked with mostly think that they aren't, and my experience and analyses agree with them. IEEE 754R permits hard underflow, as I understand it. |> > Yes, but that's implementing it by interrupt! The only system I found |> > that DID implement it by a restartable instruction (in hardware) was |> > the IBM POWER series, and that lost only a small factor. |> |> x87 with 80-bit extended would seem to (mostly) avoid the issue, as long |> as the load/store pipes could handle it. Did you see the same levels of |> slowdown here? I didn't test that, as I was sticking to more-or-less standard compiler options. There are many good reasons to regard the Intel extended format as a serious mistake - yes, I know it can be useful, but the chaos it has caused over the years is excessive compared to its benefit. Regards, Nick Maclaren.
|
Next
|
Last
Pages: 1 2 3 Prev: performance of hardware dynamic scheduling Next: Committed Instructions |