From: Ron N. on
On Nov 27, 9:24 am, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> On Nov 24, 9:45 am, Richard Owlett <rowl...(a)atlascomm.net> wrote:
>
> > In "Chapter 12 - The Fast Fourier Transform / FFT Programs" of _The
> > Scientist and Engineer's Guide to Digital Signal Processing_
> > Smith says "There are also FFT routines that completely eliminate the
> > bit reversal sorting."
>
> > Could someone point me to a description of one, preferably with BASIC or
> > FORTRAN sample code.
>
> A number of references are given in the Wikipedia article on the
> Cooley-Tukey FFT algorithm.
>
> http://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm
>
> The Cooley-Tukey FFT (the most common algorithm), along with other FFT
> algorithms for that matter, intrinsically requires some sort of re-
> ordering, in the sense that its operations map consecutive inputs to
> non-consecutive outputs and vice-versa.
>
> The most well-known approach is to do the re-ordering all at once,
> with a single bit-reversal pass (or, more generally, a mixed-base
> digit reversal). Another approach that still works in-place is to
> arrange the sequence of radices so that one can do a sequence of small
> transpositions interleaved with the transform (alternating DIT/DIF
> stages work well here). The simplest approaches work out-of-place.
> e.g. the Stockham algorithm goes back and forth between two arrays
> with each pass, transposing one digit at a time.
>
> The easiest approach, in my mind, is simply to do the algorithm
> recursively out-of-place; if you directly transcribe the Cooley-Tukey
> algorithm from the mathematical description, working recursively, all
> of the outputs are "automatically" put in the right place. (This was
> the approach we used in the original FFTW; the current version of FFTW
> implements a variety of algorithms as described in our Proc. IEEE
> paper.)
>
> For example, here is a very simple 15-line radix-2 FFT, working
> recursively and out-of-place, that does not require any separate bit-
> reversal pass. (To make it reasonably fast, one would want to use a
> larger base case than the size-1 base-case used here. It also uses a
> relatively inaccurate trigonometric recurrence. But of course, if you
> were actually interested in such practical issues you should be using
> a pre-existing FFT library.)
>
> #include <complex.h>
> #define TWOPI 6.2831853071795864769252867665590057683943388
> /* call with rec_fft(+1 or -1, N, input, output, 1, 1), where input
> and output are arrays of length N ...
> requires C99 complex-number support */
> void rec_fft(int sign, int N, complex double *x, complex double *y,
> int is, int os)
> {
> if (N == 1) *y = *x;
> else {
> int N2 = N/2;
> rec_fft(sign, N2, x, y, is*2, os);
> rec_fft(sign, N2, x+is, y+os*N2, is*2, os);
> complex double omega = 1.0, omega1 = cexp(sign * I * TWOPI /
> N);
> for (int i = 0; i < N2; ++i) {
> complex double a = y[os*i], b = omega * y[os*(i + N2)];
> y[os*i] = a + b;
> y[os*(i + N2)] = a - b;
> omega *= omega1;
> }
> }
>
> }
>
> Regards,
> Steven G. Johnson

Richard wanted the FFT routines in BASIC, so I converted
the above routine into Basic for Chipmunk Basic, changing
pointer arithmetic into array access, converting complex
data types back to standard float variables, and removing
the trig recursion. A Basic implementation that allow
recursive subroutines with locally scoped variables is
required.

I found that the line in the above C code:
> if (N == 1) *y = *x;
does pretty much the same thing as does the final sort of
a generic in-place decimation in frequency FFT algorithm,
except distributing the sort with the butterflys, and
calculating the bit-reversed indexes recursively.

rem *** recursive out-of-place radix-2 FFT in Basic ***
rem
rem *** call as: rec_fft(flag, n, xr,xi,0, yr,yi,0, 1,1)
rem flag = 1 for FFT, -1 for IFFT
rem n is FFT length
rem xr,xi are input arrays (must be dimensioned to length >= n)
rem yr,yi are result arrays (must be dimensiond to length >= n)
rem (kx,ky,ks,os) are sub-vector parameter state variables
rem (n2,i,c,s,k1,k2,ar,ai,br,bi) are local variables
rem

sub rec_fft(flag,n,xr,xi,kx,yr,yi,ky,ks,os,
n2,i,c,s,k1,k2,ar,ai,br,bi)
if (n = 1)
rem ** this does a bit-reversed-index copy and scale **
rem print ky,kx to see post (DiF) sorting
s = 1 : if (flag = -1) then s = 1/ks
yr(ky) = xr(kx) * s
yi(ky) = xi(kx) * s
else
n2 = n/2
rec_fft(flag,n2, xr,xi,kx , yr,yi,ky, ks*2,os)
rec_fft(flag,n2, xr,xi,kx+ks, yr,yi,ky+os*n2, ks*2,os)
for i = 0 to n2-1
c = cos(i * 2*pi/n)
s = sin(i * flag * 2*pi/n)
k1 = ky+os*(i)
k2 = ky+os*(i+n2)
ar = yr(k1)
ai = yi(k1)
br = c * yr(k2) - s * yi(k2)
bi = c * yi(k2) + s * yr(k2)
yr(k1) = ar + br
yi(k1) = ai + bi
yr(k2) = ar - br
yi(k2) = ai - bi
next i
endif
end sub : rem end rec_fft

If the above code formats wrong (old-fashioned Basic is
line-break sensitive), a copy is here:
http://www.nicholson.com/dsp.fft1.html


IMHO. YMMV.
--
rhn A.T nicholson d.0.t C-o-M
From: Richard Owlett on
Steven G. Johnson wrote:
> On Nov 24, 9:45 am, Richard Owlett <rowl...(a)atlascomm.net> wrote:
>
>>In "Chapter 12 - The Fast Fourier Transform / FFT Programs" of _The
>>Scientist and Engineer's Guide to Digital Signal Processing_
>>Smith says "There are also FFT routines that completely eliminate the
>>bit reversal sorting."
>>
>>Could someone point me to a description of one, preferably with BASIC or
>>FORTRAN sample code.
>
>
> A number of references are given in the Wikipedia article on the
> Cooley-Tukey FFT algorithm.
>
> http://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm


One of its references is to a paper titled
"The FFT � an algorithm the whole family can use"
(http://www.cs.dartmouth.edu/~rockmore/cse-fft.pdf).

I've taken a quick look and I may be in target audience.
From: Steven G. Johnson on
On Nov 27, 1:46 pm, robert bristow-johnson
> Steven, what's different (in terms of code) is the pointer
> arithmetic. doing all sorts of goofy pointer arithmetic might be more
> costly, in the aggregate throughout all of the FFT passes, than a
> single bit-reversing pass at the beginning or end.

As someone who has actually implemented many of the algorithms without
bit reversal, I can say that (a) the pointer arithmetic is not really
that complicated and (b) the pointer arithmetic per se is not the
dominant factor in the time anyway (compared to the memory access
itself), and (c) and it really does seem to be faster on modern
general-purpose CPUs than the methods involving a separate array pass.

> in addition, as i tried to point out earlier, if you're doing
> something like fast-convolution using the FFT, then if your forward
> FFT leaves its results in bit-reversing order and your inverse FFT
> takes its input in bit reversed order (and leaves the results in
> normal order), then no bit reversing will be necessary. (the transfer
> function, besides being pre-computed, will also be pre bit-reversed.)

It's well known that you don't need normal-order output for
convolutions. I was concerned with the case where you do want normal
order (which most users seem to want, even if they could do without
out it conceivably). However, in the case of convolutions, I'm
convinced that the optimal algorithm will not be to do two completely
separate FFTs in any case, regardless of the output/input order; I
have some ideas on this that I want to explore for FFTW 4 (which will
have built-in convolution routines).

Regards,
Steven G. Johnson
From: robert bristow-johnson on
On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> On Nov 27, 1:46 pm, robert bristow-johnson
>
> > Steven, what's different (in terms of code) is the pointer
> > arithmetic. doing all sorts of goofy pointer arithmetic might be more
> > costly, in the aggregate throughout all of the FFT passes, than a
> > single bit-reversing pass at the beginning or end.
>
> As someone who has actually implemented many of the algorithms without
> bit reversal, I can say that (a) the pointer arithmetic is not really
> that complicated and (b) the pointer arithmetic per se is not the
> dominant factor in the time anyway (compared to the memory access
> itself), and (c) and it really does seem to be faster on modern
> general-purpose CPUs than the methods involving a separate array pass.

okay, Steven, using O&S for illustration, the pointers changes looks
pretty knarly, even for N=8, but even if it's not, it's not in-place,
is it? and if it's not in-place, is there not a cache hit? assuming
you have the (double) memory and that cost is not a problem, you now
have the top and bottom (i'm gonna assume radix-2 for the moment) of
the butterfly you're reading (from the previous pass) and the top and
bottom of the butterfly you are writing (which would be the same
locations if you were doing this in-place). if it is not in-place,
then you are ping-ponging and i would expect the number of cache
misses to double.

> > in addition, as i tried to point out earlier, if you're doing
> > something like fast-convolution using the FFT, then if your forward
> > FFT leaves its results in bit-reversing order and your inverse FFT
> > takes its input in bit reversed order (and leaves the results in
> > normal order), then no bit reversing will be necessary. (the transfer
> > function, besides being pre-computed, will also be pre bit-reversed.)
>
> It's well known that you don't need normal-order output for
> convolutions. I was concerned with the case where you do want normal
> order (which most users seem to want, even if they could do without
> out it conceivably). However, in the case of convolutions, I'm
> convinced that the optimal algorithm will not be to do two completely
> separate FFTs in any case,

the FFT and iFFT kernals are pretty small. so you have two short
snippets of code rather than one. i can't imagine that being a cache-
miss problem. what else could be the problem?

> regardless of the output/input order; I
> have some ideas on this that I want to explore for FFTW 4 (which will
> have built-in convolution routines).

well, let us know. even though there are some things you're saying
here that either i just don't understand (in a context where i think i
*should* be understanding) or a little contentious, everyone here,
including me, recognize you as the FFT guru. but the maths and other
pertanent facts don't really give a fig who is the guru or what the
guru says. that's the only reason i take some of these issues on.

bestest,

r b-j



P.S. BTW, i think we now work within 15 kilometers of each other. i
wouldn't mind hooking up (maybe at some kind of function, but the
Boston section AES meetings seem to be pretty lame as of late).
anyway, if you know of some signal proc or similar seminar or
something (or some FFTW thing you're doing), please lemme know. i'd
love to see it and shake your hand.
From: Ron N. on
On Nov 28, 4:11 pm, robert bristow-johnson <r...(a)audioimagination.com>
wrote:
> On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
>
> > On Nov 27, 1:46 pm, robert bristow-johnson
>
> > > Steven, what's different (in terms of code) is the pointer
> > > arithmetic. doing all sorts of goofy pointer arithmetic might be more
> > > costly, in the aggregate throughout all of the FFT passes, than a
> > > single bit-reversing pass at the beginning or end.
>
> > As someone who has actually implemented many of the algorithms without
> > bit reversal, I can say that (a) the pointer arithmetic is not really
> > that complicated and (b) the pointer arithmetic per se is not the
> > dominant factor in the time anyway (compared to the memory access
> > itself), and (c) and it really does seem to be faster on modern
> > general-purpose CPUs than the methods involving a separate array pass.
>
> okay, Steven, using O&S for illustration, the pointers changes looks
> pretty knarly, even for N=8, but even if it's not, it's not in-place,
> is it? and if it's not in-place, is there not a cache hit? assuming
> you have the (double) memory and that cost is not a problem, you now
> have the top and bottom (i'm gonna assume radix-2 for the moment) of
> the butterfly you're reading (from the previous pass) and the top and
> bottom of the butterfly you are writing (which would be the same
> locations if you were doing this in-place). if it is not in-place,
> then you are ping-ponging and i would expect the number of cache
> misses to double.

There are alternatives to either in-place storage or
ping-ponging of all butterfly results. Gnarly software
tagging/caching/swapping indeed; but even gnarly stuff
could look close to free on systems where a single
dcache miss might support the execution of hundred(s)
of address computation instructions underneath it.


IMHO. YMMV.
--
rhn A.T nicholson d.0.t C-o-M
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8
Prev: Scilab and DSP
Next: Separating Harmonic Spectra