From: mike3 on
Hi.

I made a program to run Bailey's "4-step" (aka "matrix") Fast Fourier
Transform algorithm, but it seemed to go _slower_ than a simple direct
recursive+vectorized-iterative approach. Why is that?
From: Malcolm McLean on

"mike3" <mike4ty4(a)yahoo.com> wrote in message news:
>
> I made a program to run Bailey's "4-step" (aka "matrix") Fast Fourier
> Transform algorithm, but it seemed to go _slower_ than a simple direct
> recursive+vectorized-iterative approach. Why is that?
>
Could be any reason. By using the Fast Fourier transform at all you have
basically reduced the problem to N log N multiplications. Whilst I'm not
familiar with Bailey's tweak, it will almost certainly get rid of a few of
the multitpications in exchange for more shifting of the flow about in
memory, but not change the order analysis of the algorithm.
This could well compile to less efficient code, particularly for small N,
because of cache usage, or because you are using the integer/float pipelines
in a less parallel way, or because the loops are harder for the compiler to
optimise.


From: Steven G. Johnson on
On Sep 12, 5:50 am, mike3 <mike4...(a)yahoo.com> wrote:
> I made a program to run Bailey's "4-step" (aka "matrix") Fast Fourier
> Transform algorithm, but it seemed to go _slower_ than a simple direct
> recursive+vectorized-iterative approach. Why is that?

With our FFTW code (www.fftw.org), we found that a step radix-sqrt(N)
(what Bailey calls the 4-step algorithm) is beneficial only when N
gets really large (on the order of 2^20 or bigger), and even then you
only do a single step (to reduce it to size ~ 2^10 transforms done by
other methods).

Essentially, while you can prove that radix-sqrt(N) uses hierarchical
memory (caches) very well in an asymptotic sense, the constant factors
are not so good compared to bounded-radix code. The advantage of a
bounded-radix code is that the radix butterfly can be essentially done
completely in hard-coded straight-line code including twiddle factors
(which translates into efficient usage of the registers and CPU
pipeline, at least if the radix is large enough), whereas radix-sqrt
(N) requires more overhead for the transpositions, the twiddle
factors, and other things that must now typically be done in separate
array passes.

See also our chapter "Implementing FFTs in practice" (http://cnx.org/
content/m16336/latest/) in Burrus's FFT book for some basic background
on these questions.

Regards,
Steven G. Johnson

PS. Quite apart from all of these questions, there are general issues
of program optimization that come to bear when comparing "algorithms"
in this way, because you are always really comparing implementations,
not abstract algorithms, and the quality of the implementations
matters a lot. I would urge you to benchmark your code against some
other available routines, or just compare to the numbers posted on
fftw.org/speed, to get some idea of whether you are comparing
reasonably efficient FFT implementations. (If you compare extremely
slow implementations, it's hard to draw any sensible conclusions about
the algorithms.)
From: mike3 on
On Sep 13, 10:26 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> On Sep 12, 5:50 am, mike3 <mike4...(a)yahoo.com> wrote:
>
> > I made a program to run Bailey's "4-step" (aka "matrix") Fast Fourier
> > Transform algorithm, but it seemed to go _slower_ than a simple direct
> > recursive+vectorized-iterative approach. Why is that?
>
> With our FFTW code (www.fftw.org), we found that a step radix-sqrt(N)
> (what Bailey calls the 4-step algorithm) is beneficial only when N
> gets really large (on the order of 2^20 or bigger), and even then you
> only do a single step (to reduce it to size ~ 2^10 transforms done by
> other methods).
>

Well I'm running at a very large 2^26 = 67,108,864 els for the matrix
algorithm
and it still shows up as significantly slower than the non-matrix one
(which
is a radix-2 recursion down to a size small enough to switch to a fast
iterative method with a precomputed root table.).

> Essentially, while you can prove that radix-sqrt(N) uses hierarchical
> memory (caches) very well in an asymptotic sense, the constant factors
> are not so good compared to bounded-radix code.  The advantage of a
> bounded-radix code is that the radix butterfly can be essentially done
> completely in hard-coded straight-line code including twiddle factors
> (which translates into efficient usage of the registers and CPU
> pipeline, at least if the radix is large enough), whereas radix-sqrt
> (N) requires more overhead for the transpositions, the twiddle
> factors, and other things that must now typically be done in separate
> array passes.
>

So what could be going on here?

> See also our chapter "Implementing FFTs in practice" (http://cnx.org/
> content/m16336/latest/) in Burrus's FFT book for some basic background
> on these questions.
>
> Regards,
> Steven G. Johnson
>
> PS. Quite apart from all of these questions, there are general issues
> of program optimization that come to bear when comparing "algorithms"
> in this way, because you are always really comparing implementations,
> not abstract algorithms, and the quality of the implementations
> matters a lot.  I would urge you to benchmark your code against some
> other available routines, or just compare to the numbers posted on
> fftw.org/speed, to get some idea of whether you are comparing
> reasonably efficient FFT implementations.  (If you compare extremely
> slow implementations, it's hard to draw any sensible conclusions about
> the algorithms.)

From: Steven G. Johnson on
On Sep 15, 3:03 am, mike3 <mike4...(a)yahoo.com> wrote:
> Well I'm running at a very large 2^26 = 67,108,864 els for the matrix
> algorithm
> and it still shows up as significantly slower than the non-matrix one
> (which
> is a radix-2 recursion down to a size small enough to switch to a fast
> iterative method with a precomputed root table.).

Note that you only probably want to do 1 step of the radix-sqrt(N)
algorithm, and then switch to another algorithm for the size 2^13
subtransforms. You don't want to do radix-sqrt(N) recursively.

> So what could be going on here?

Well, there is always the possibility that your implementation is
simply very inefficient. (Even a simple square-matrix transposition,
which is one part of the 4-step algorithm, is surprisingly tricky to
code efficiently once you take cache lines into account. Dealing with
memory efficiently is nontrivial these days. Efficient FFTs are also
nontrivial, bearing little resemblance to textbook programs ala
Numerical Recipes.)

Again, I strongly recommend that you compare with other available FFT
software to see if you are even in the ballpark of an efficient
implementation, otherwise you are comparing one slow code to another
(in which case you may need to rethink your whole approach before
detailed optimization becomes worthwhile). Looking at the graphs on
fftw.org/speed, on a 3GHz Intel Core Duo from a couple years ago
(using only a single processor), both FFTW and the Intel Math Kernel
Library take a little over 4 seconds to do a size 2^26 complex-data
FFT in double precision.

Steven