From: kevin on
On Jul 9, 12:39 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com>
wrote:

>I am not exactly sure what you mean by serial.

By serial, I mean that all the arithmetic operations (+,-,x) are done
in serially - one bit at a time. This obviously requires more
arithmetic units to meet a given throughput, but the units are
smaller, and the wiring between units is serial. But ignore that for
now because there are other issues I'd like to address, and I hope
they will answer the questions raised in your post.

Consider a conventional 4096, radix 2, sequential in, bit reverse out
FFT. Decimation-in-Time (DIT) (input leg twiddles). Let's say we use
a fully parallel array to process it. We'd have 12 stages of
butterflies, with 4096/2 (=2048) butterflies per stage. Our inputs
will consist of 4096 complex words, which means that we have 2x4096
inputs of so many bits each. The twiddles in the 1st stage are unity
(twiddle 0 = 1), the twiddles in the 2nd stage are 0 (=1) and +/- j
(sign change and swap). The 3rd stage has 0, +/-j and +/-.707 types.
And we could implement the +/-.707 types with special hardware. Later
stages have more difficult twiddles, so perhaps we'd use full
multipliers to do them.

The above is inefficient because, as I mentioned before, you can
modify a lower radix graph to be just as computationally efficient as
a higher radix graph. In doing that for the above example, we'd have
twiddle 0's in the 1st stage, 0 and +/- j in the 2nd, 0,+/- j in the
3rd, 0 in the 4th, 0 and +/- j in the 5th, 0, +/- j in the 6th, etc.
We'd ALSO have full multipliers between the 3rd and 4th stage, the 6th
and 7th stage, and the 9th and 10th stage. Overall, it's very similar
to having 4 stages of a radix 8 graph, or 2 stages off a radix 64
graph (8x8x8x8 = 64x64 = 4096). But our STRUCTURE is still radix 2.
And some of the multiplies in the columns of full multipliers will be
of the 0, +/- j and +/- .707 types. The net result is a substantial
reduction of full multipliers compared to the naive radix 2 approach.

Now consider using a higher radix structure, say, radix 8. You'd have
4 stages of radix 8 butterflies, with 4096/8 (=512) butterflies per
stage. But remember that each radix 8 butterfly consists of 3 stages
of four radix 2 butterflies.

So I have a little bit of difficulty with your statement:

>Hence I say:
>"Using smaller radices has following disadvantages ---
>The number of stages and required number of FFTs (to achieve 64
>samples/clock) increases.

I think you're presuming that a radix 8 computes in the same amount of
time as a radix 2. Shouldn't 3 radix 2 stages (an 8 point FFT)
compute in the same amount of time as1 radix 8 stage (an 8 point
FFT)? I guess it may depend on how tightly packed the elements are
(+, -, x).

Your other statement is:

>Reasonably assuming that the number arithmetic
>units for any approach is same

Yes, with emphasis on ANY APPROACH. With a 4096 graph, the
computational efficiency (number of +, -, x) of a radix 2 graph can be
made identical to a radix 8 or radix 64 graph.

As to whether lower or higher radix implementations have shorter
wiring routes - that's a good question. Remember that with higher
radix, you're basically grouping a larger number of radix 2
butterflies and drawing/calling it a radix 8 butterfly. So the wiring
routes INTERNAL to the radix 8 butterfly may seem to decrease, and the
wiring EXTERNAL to the butterfly may seem to increase.

But maybe it only LOOKS that way because of the way in which FFT
graphs are drawn. The graphs are reality in terms of processing
sequence, but not necessarily in terms of physical layout.

In full custom design, a critical factor is the height/width of the
various elements you're dealing with. In one case I dealt with, the
pads were huge compared to the size of the FFT I was doing, so it
didn't much matter how tall or wide my adds/subtracts or multiplies
were. Very long (slow) input/output wires converged to my tiny
processor. I had spent all my time trying to figure out how to
tightly space my processing elements (let's see, if I use a barrel
shifter instead of multiplexers, I can lay out my commutator design
with Nlog2(N) transistors, versus a lot more for an 8x8 multiplexer,
and the height fits the adders much better). Talk about the forest
and the trees.

You have a different problem because you're using an FPGA (I've used
them too, but I find that I have to think differently to partition
things properly).

>Will the routes when using a single
>64-path radix-64 solution be longer than when using 4 different radix-16
>FFTs each designed as having 16-paths? -- I think they will.

On an FPGA, I guess I'd agree with that. But I'm not certain.

One other thing troubles me. You're presuming 64 eight bit inputs at
a time, and a 64 point array to do the two stages you need (input
shuffling/reordering, 1st 64 point array FFT, second stage shuffling/
reordering, 2nd 64 point array, possibly followed by output
reordering, depending how the 2 shuffling/compute stages are
designed). It just seems to me that, for complex valued inputs and a
64 point array, you'd need 128 inputs at a time. 64 real and 64
imaginary. With 8 bits each, that's 128x8 = 1024 bits. You're
presuming 64 eight point values per 375 Mhz clock tick (512 bits).
Doesn't that mean that you're going to delay 512 bits by 1 clock
cycle, and wait until you have 1024 bits to pump them into your
array? That would cut your processing clock rate by half. Is this
what you intend?

I'd have more to say, but it's really late here.

Kevin McGee
From: kevin on
3rd paragraph, 2nd sentence should be .. 0, +/- j, +/-.707 in the
6th, etc.

Kevin
From: onkars on
>3rd paragraph, 2nd sentence should be .. 0, +/- j, +/-.707 in the
>6th, etc.
>
>Kevin
>

I dont intend that. I intend to have 64 complex values at a time. Maybe
using the SERDES blocks in the FPGA IOs to do the parallelization in to 8
bits of a sample.
From: kevin on
On Jul 10, 3:29 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com>
wrote:
> >3rd paragraph, 2nd sentence should be  .. 0, +/- j, +/-.707 in the
> >6th, etc.
>
> >Kevin
>
> I dont intend that. I intend to have 64 complex values at a time. Maybe
> using the SERDES blocks in the FPGA IOs to do the parallelization in to 8
> bits of a sample.


OK. So you've got 64 complex values arriving per clock tick at 375
Mhz. That's 2x64x8 = 1024 bits.

Presuming that your inputs are ordered/shuffled/delayed properly for a
4096 FFT, now you've got to do a 64 point fully parallel array
architecture to compute your first stage. If you do it as radix 2,
DIT, you'd have twiddles of 0 (=1) in the 1st stage, twiddles of 0 and
+/-j in the 2nd, twiddles of 0, +/- j, and +/-.707 in the 3rd, 0 in
the 4th, 0, +/-j in the 5th, and 0, +/-j, +/-.707 in the 6th. PLUS
one column of full complex multiplies between the 3rd and 4th stages
(same as a 64 point two stage radix 8 graph). And,since you're doing
things fully parallel, some of those multiplies will be 0, +/-j and
+/-.707 types, which don't require a full multiplier.

You may decide to design it differently using a higher radix, which
will affect your architecture, and quite possibly your input/ouput
ordering, but that's up to you.

For large FFT's, I've often found it to be more difficult to keep
track of where I am in the larger graph, and get the input/output
order correct for the pipeline stages (or, in your case, the 64 point
arrays).

Kevin McGee