From: Terje Mathisen on
Nick Maclaren wrote:
> In article <c%eHj.28257$5i5.22175(a)newsfe6-gui.ntli.net>,
> "Wilco Dijkstra" <Wilco_dot_Dijkstra(a)ntlworld.com> writes:
> |> "Terje Mathisen" <terje.mathisen(a)hda.hydro.com> wrote in
> |> message news:mfqdnY6m6PHh7nDanZ2dnUVZ_hadnZ2d(a)giganews.com...
> |>
> |> > I'm afraid that many cpus have been designed with (a) measured on SPECfp only, and as long as none of the SPEC
> |> > benchmarks generate significant number of denormals, they will stay unimportant.
> |>
> |> The question is, are there really important HPC codes that use denormals a
> |> lot? I don't know... If there were you'd expect them to be added to SPEC.
>
> You shouldn't confusing using denormals with generating them - as Greg
> Lindahl's and my posts say, most generated denormals are mere noise.
>
> Also, because of this problem, HPC codes tend to get rewritten to
> hack around the denormal problem. You would need to get a code that
> had not yet been hacked (or remove the hacks and retest it).

A friend of mine told about an important sw codec which on one
particular architecture really trashed when faced with "silence", i.e.
close to but not quite zero amplitude:

This was of course due to sw denomal traps, the workaround was ugly but
effective: Every N iterations add a small amount to each array element,
"small" chosen so as to be well below the (probably 16-bit) output
floor, but making sure that denorms wouldn't happen.
>
> I date from the days when HPC codes were also hacked to get around the
> problem of slow integer to floating-point conversion. Exactly the
> same arguments were made about that, as were made then and are still
> made about underflow! But the CPUs fixed it, and you don't see those
> hacks in new code (or very rarely).

Oh but you still see a lot of fp->int hacks, even on brand new code!

When you need scaling as well as fp->int conversion (i.e. fixed-point
math) the hacks are usually still the fastest method:

Add a magic double value, selected so as to leave the properly scaled
integer in the lower half of the result.

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
From: Nick Maclaren on

In article <qeidnU5ouuHiz3PanZ2dnUVZ_gydnZ2d(a)giganews.com>,
Terje Mathisen <terje.mathisen(a)hda.hydro.com> writes:
|>
|> A friend of mine told about an important sw codec which on one
|> particular architecture really trashed when faced with "silence", i.e.
|> close to but not quite zero amplitude:
|>
|> This was of course due to sw denomal traps, the workaround was ugly but
|> effective: Every N iterations add a small amount to each array element,
|> "small" chosen so as to be well below the (probably 16-bit) output
|> floor, but making sure that denorms wouldn't happen.

Been there - done that :-)

|> Oh but you still see a lot of fp->int hacks, even on brand new code!
|>
|> When you need scaling as well as fp->int conversion (i.e. fixed-point
|> math) the hacks are usually still the fastest method:
|>
|> Add a magic double value, selected so as to leave the properly scaled
|> integer in the lower half of the result.

Do they just? Sometimes I think I have got caught in a time warp :-)


Regards,
Nick Maclaren.
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
Terje Mathisen <terje.mathisen(a)hda.hydro.com> wrote:

> Given that denorm doesn't matter for SPEC, it would seem like a very
> reasonable hw approach would be to start by assuming both values are
> normal, then in parallel check for special values, including zero.
>
> If a denorm is found, move the mantissa into the barrel shifter,
> normalize and stop/flush/restart the pipeline.
>
> Cost: Anywhere from 1-20 cycles?

On the POWER4/PPC970 FPU, the first denormal input cause an 8 cycle
stall, so there would be time enough in there to drain the FPU pipeline.
Additional denormal inputs add 3 cycles to the stall each. A denormal
output likely adds a 2 clock "Massive Cancellation Stall" and a 2 clock
"Underflow/Overflow Stall".

I have test code that I can plug particular input values into, if there
are some that you suspect may be undocumented exceptions.

The POWER6 FPU has no penalty for denormals.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Terje Mathisen on
Nick Maclaren wrote:
> In article <fsqff4$elk$1(a)s1.news.oleane.net>,
> =?ISO-8859-1?Q?Jan_Vorbr=FCggen?= <Jan.Vorbrueggen(a)not-thomson.net> writes:
> |> >
> |> >> No, that's NOT true! The proper analysis uses game theory, and the
> |> >> conclusions depend on (a) the probability of denormals occurring,
> |> >> (b) the slowdown due to such denormals and (c) the cost to the system
> |> >> of such a slowdown.
> |> >
> |> > I'm afraid that many cpus have been designed with (a) measured on SPECfp
> |> > only, and as long as none of the SPEC benchmarks generate significant
> |> > number of denormals, they will stay unimportant.
> |>
> |> I can state for a fact that this isn't true for SPEC CFP2000. Hey, I
> |> wrote the code of the counter-example, so I should know!
>
> Congratulations! I spent a little while looking for a counter-example,
> and the only ones I had off-hand were trivial, revolting or constrained
> by copyright.
>
> |> Note that in many cases, the -fast or equivalent compiler option selects
> |> turning off precise denormal handling and activates flush-to-zero, so
> |> the problem just goes away even for a base SPECfp run.
>
> The IEEE 754R people won't love SPEC for allowing that in the base :-)

Ouch! So instead of fixing the hw to make denorm (nearly) as fast as
normal numbers, they allow all SPEC submissions to punt the issue.
>
> |> > Given that denorm doesn't matter for SPEC, it would seem like a very
> |> > reasonable hw approach would be to start by assuming both values are
> |> > normal, then in parallel check for special values, including zero.
> |> >
> |> > If a denorm is found, move the mantissa into the barrel shifter,
> |> > normalize and stop/flush/restart the pipeline.
> |> >
> |> > Cost: Anywhere from 1-20 cycles?
> |>
> |> Application slowdown: a factor of 20. Yes, you read that right.

Reread what I wrote: The 1-20 cycles are for various levels of hw
handling, from the really heroic (a barrel shifter followed by a
complete (extra) fp pipeline), to handling it like a branch miss,
flushing the pipelines and restarting the instruction, while doing the
normalization in the background.
>
> Yes, but that's implementing it by interrupt! The only system I found
> that DID implement it by a restartable instruction (in hardware) was
> the IBM POWER series, and that lost only a small factor.

x87 with 80-bit extended would seem to (mostly) avoid the issue, as long
as the load/store pipes could handle it. Did you see the same levels of
slowdown here?

Terje

--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
From: Nick Maclaren on

In article <atudnc4VINW7bG3anZ2dnUVZ8t6inZ2d(a)giganews.com>,
Terje Mathisen <terje.mathisen(a)hda.hydro.com> writes:
|>
|> > |> Note that in many cases, the -fast or equivalent compiler option selects
|> > |> turning off precise denormal handling and activates flush-to-zero, so
|> > |> the problem just goes away even for a base SPECfp run.
|> >
|> > The IEEE 754R people won't love SPEC for allowing that in the base :-)
|>
|> Ouch! So instead of fixing the hw to make denorm (nearly) as fast as
|> normal numbers, they allow all SPEC submissions to punt the issue.

Now, there we disagree. Even the very best numerical analysts and
architecture experts are divided on whether denormals are a good idea,
even in theory. The ones I have worked with mostly think that they
aren't, and my experience and analyses agree with them.

IEEE 754R permits hard underflow, as I understand it.

|> > Yes, but that's implementing it by interrupt! The only system I found
|> > that DID implement it by a restartable instruction (in hardware) was
|> > the IBM POWER series, and that lost only a small factor.
|>
|> x87 with 80-bit extended would seem to (mostly) avoid the issue, as long
|> as the load/store pipes could handle it. Did you see the same levels of
|> slowdown here?

I didn't test that, as I was sticking to more-or-less standard compiler
options. There are many good reasons to regard the Intel extended format
as a serious mistake - yes, I know it can be useful, but the chaos it
has caused over the years is excessive compared to its benefit.


Regards,
Nick Maclaren.