From: Steven G. Johnson on
On Nov 28, 7:11 pm, robert bristow-johnson <r...(a)audioimagination.com>
wrote:
> okay, Steven, using O&S for illustration, the pointers changes looks
> pretty knarly, even for N=8, but even if it's not, it's not in-place,
> is it? and if it's not in-place, is there not a cache hit? assuming
> you have the (double) memory and that cost is not a problem, you now
> have the top and bottom (i'm gonna assume radix-2 for the moment) of
> the butterfly you're reading (from the previous pass) and the top and
> bottom of the butterfly you are writing (which would be the same
> locations if you were doing this in-place). if it is not in-place,
> then you are ping-ponging and i would expect the number of cache
> misses to double.

First of all, I have no idea what you are talking about with "gnarly
pointer changes" for N=8. An N=8 FFT is 60 floating-point operations
that can be implemented in straight-line code: read into local
variables (= registers), do calculations, write results.

Second, there are in-place algorithms that do not require a separate
bitreversal pass, some of which I pointed out in another message on
this thread. Basically, they interleave the transpositions
(bitreversal = product of transpositions) with the computational
stages.

Third, an out-of-place algorithm does not require that you "ping-pong"
between two arrays (as in the Stockham algorithm). Actually, only one
stage of the computation needs to work out-of-place (as in the
recursive example code I posted) to write things in the right place;
all other stages of the computation can work in-place. And if you look
at the FFTW benchmarks, you'll see that properly written out-of-place
code can indeed be very competitive.

Fourth, even if you have to do additional data motion or copying
stages, sometimes this can be advantageous because it makes the data
access during computation more consecutive. e.g. the Stockham ("ping-
pong") algorithm was originally designed for vector machines where
consecutive access was a huge win, and even now it has locality
benefits (although we currently find that other algorithms work better
for us). Or e.g. in the radix-sqrt(n) approach (dubbed the "four-step"
and "six-step" algorithms by Bailey, although variants date back to
Gentleman and Sande) you do explicit transpositions---essentially, a
part of the bit-reversal permutation---interleaved with the
computation, in order to improve locality, and this is a win for very
large N or for very slow memory (e.g. distributed-memory systems).

Fifth, one is crazy to do radix-2 these days. You need to do the work
in larger chunks to take full advantage of the CPU. I've never seen a
radix-2 FFT that was not several times slower than the fastest code.

Ultimately, however, machines are complicated enough these days that
theorizing about performance is not very persuasive; you have to
actually try the different algorithms and sometimes the results are
unexpected. In the case of FFTs, people have tried quite a few
algorithms and there are many examples of people finding performance
benefits from algorithms that avoid separate bit-reversal passes.

Regards,
Steven G. Johnson
From: robert bristow-johnson on
On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:

> On Nov 27, 1:46 pm, robert bristow-johnson

> > Steven, what's different (in terms of code) is the pointer
> > arithmetic. doing all sorts of goofy pointer arithmetic might be more
> > costly, in the aggregate throughout all of the FFT passes, than a
> > single bit-reversing pass at the beginning or end.

> As someone who has actually implemented many of the algorithms without
> bit reversal, I can say that (a) the pointer arithmetic is not really
> that complicated and (b) the pointer arithmetic per se is not the
> dominant factor in the time anyway (compared to the memory access
> itself), and (c) and it really does seem to be faster on modern
> general-purpose CPUs than the methods involving a separate array pass.

On Nov 28, 9:16 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> On Nov 28, 7:11 pm, robert bristow-johnson <r...(a)audioimagination.com>
> wrote:
>
> > okay, Steven, using O&S for illustration, the pointers changes looks
> > pretty knarly, even for N=8, but even if it's not, it's not in-place,
> > is it? and if it's not in-place, is there not a cache hit? assuming
> > you have the (double) memory and that cost is not a problem, you now
> > have the top and bottom (i'm gonna assume radix-2 for the moment) of
> > the butterfly you're reading (from the previous pass) and the top and
> > bottom of the butterfly you are writing (which would be the same
> > locations if you were doing this in-place). if it is not in-place,
> > then you are ping-ponging and i would expect the number of cache
> > misses to double.
>
> First of all, I have no idea what you are talking about with "gnarly
> pointer changes" for N=8. An N=8 FFT is 60 floating-point operations
> that can be implemented in straight-line code: read into local
> variables (= registers), do calculations, write results.

well, i said "for illustration". in particular i was referring to
Fig. 9-15 (p. 597) in my 1989 version of O&S. certainly if N=8 you
can do it with straight-line code. that's beside the point. i was
trying to infer, from the illustration, what the rhyme or reason there
is in the particular pointer manipulation from that N=8 illustration.
it still looks gnarly. for the case of a general N (say, a power of
2) and much bigger than 2^3, what is the tight and efficient algorithm
for determining the order of indexing for the node that you are
writing to, and once you've determined that, what are the nodes that
you're reading from? except for the first pass (which is just like
the first pass of an in-place DIT), i cannot see what that rhyme or
reason is. i'm imagine there may be one, but it doesn't look simple
or elegant. so how could this be coded compactly and efficiently for
a general N = 2^p? i cannot see how it is done, while i do not deny
it can be done. i just don't see it. surely, if you have to look up
those node indices from a table (and a different table for every N?),
that may be quick, but it's not compact. this is what i mean by
"goofy pointer arithmetic". and, it's clear that it is not in-place
after the first pass. there *must* be another array space to write
to.

so, there clearly is a scheme you know about (i never thought
otherwise) that is something i hadn't seen. but using the one
reference to a normal-in, normal-out structure that i have seen, i'll
put it the way O&S did (p. 598):

"In the flow graph of Fig. 9-15, the data are accessed
nonsequentially, the computation is not in place, and a
scheme for indexing the data is considerably more
complicated than in either of the two previous cases [two
in-place DIT structures]. Consequently, this structure
has no apparent advantages."

i was reverberating that sentiment from the little bit that i know of
the radix-(2^n) FFT. that information could very well be stale. it's
also in the Yellow O&S (1975 - p. 300, Fig. 6.13, and they say the
same thing), so it's more than 3 decades old.

sometimes i think that O&S is timeless, other times it's just dated.

r b-j
From: Steven G. Johnson on
On Nov 29, 11:37 pm, robert bristow-johnson
<r...(a)audioimagination.com> wrote:
> well, i said "for illustration". in particular i was referring to
> Fig. 9-15 (p. 597) in my 1989 version of O&S. certainly if N=8 you
> can do it with straight-line code. that's beside the point. i was
> trying to infer, from the illustration, what the rhyme or reason there
> is in the particular pointer manipulation from that N=8 illustration.
> it still looks gnarly. for the case of a general N (say, a power of
> 2) and much bigger than 2^3, what is the tight and efficient algorithm
> for determining the order of indexing for the node that you are
> writing to, and once you've determined that, what are the nodes that
> you're reading from? except for the first pass (which is just like
> the first pass of an in-place DIT), i cannot see what that rhyme or
> reason is. i'm imagine there may be one, but it doesn't look simple
> or elegant.

No, there are certainly simple patterns to the memory access for in-
place algorithms without bit-reversal. They are not that complicated
to implement (as my 15-line example in another post shows). You
absolutely do not have to use a look-up-table of indices!

(One of the many reasons why I feel that butterfly diagrams are next
to useless in truly understanding FFT algorithms.)

(Of course, if you compare a highly optimized in-order FFT like FFTW
or MKL, it will be way faster than a textbook radix-2 algorithm even
if you skip the bitreversal stage. Highly optimized algorithms are
always going to be way more complicated than textbook radix-2, as long
as CPUs remain so complex.)

> "In the flow graph of Fig. 9-15, the data are accessed
> nonsequentially, the computation is not in place, and a
> scheme for indexing the data is considerably more
> complicated than in either of the two previous cases [two
> in-place DIT structures]. Consequently, this structure
> has no apparent advantages."

One reason to be cautious here is that O&S is comparing apples and
oranges: FFTs with bitreversed input or output, and FFTs with normal-
order output. Of course, if you can get away with permuted output it
will be simpler.

That's not the comparison I'm making: I'm comparing only FFTs with
normal-order output, that differ in whether you have a separate
bitreversal pass or whether you combine the permutations with the
computation.

> i was reverberating that sentiment from the little bit that i know of
> the radix-(2^n) FFT. that information could very well be stale. it's
> also in the Yellow O&S (1975 - p. 300, Fig. 6.13, and they say the
> same thing), so it's more than 3 decades old.
>
> sometimes i think that O&S is timeless, other times it's just dated.

Comments on performance advantages from FFT algorithms that avoid a
separate bitreversal pass date back to the 60's, so in some sense it's
not a matter of being out-of-date. On the other hand, it seems to be
especially in the last 20 years that floating-point performance of
CPUs has totally outstripped memory performance, and that makes non-
arithmetic issues so much more relevant now.

The main problem is that textbooks are trying to teach students the
basic ideas of FFT algorithms rather than performance hacks, and
therefore they almost always limit their discussion to only two
algorithms for power-of-two N: radix-2 decimation in time and radix-2
decimation in frequency. If you limit yourself to these two cases,
you can't really have a useful discussion of performance optimization
on modern CPUs where performance is limited not by arithmetic counts
but rather by memory access and register/pipeline utilization.

Steven
First  |  Prev  | 
Pages: 1 2 3 4 5 6 7 8
Prev: Scilab and DSP
Next: Separating Harmonic Spectra