From: kevin on

On Jul 3, 11:09 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com>
wrote:

>As for the inputs to the FPGA -- thats an open aspect too. It can be
>designed to best suite the FFT implementation. But we will be using the
>HighSpeed serial IOs. So it pretty much depends on the how the FFT is
>implemented -- we could even use external fast demuxes if necessary.

>Also, the input is complex.

>As mentioned earlier I should have 64 samples output per clock cycle. As a
>reminder I am doing 4096 pt. fft.

Very useful information.

I'm a little concerned about the rest of your post because I'm
presuming that your input data is sequential. That is, 8 bits each of
points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then
128 ... 191, then 192 ... 255, all the way up to 4095. So you have 64
eight bit sequential inputs per clock tick at 375 Mhz (512 bits
total).

Here's the problem. With a 4096 point FFT, you've got 2 stages of 64
point butterflies with 1 column of inter-stage twiddles between them.
No matter what the input/output order, geometry type or decimation,
the top most butterfly in the first stage uses the following inputs:
0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is
4096-64 = 4032). The next butterfly down from the top in the first
stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033).
The next butterfly down uses a PLUS 2 (2, 66, 130, 194, ... 4034).
Once again, samples separated by 64.

If you need any clarification, I can send a 64 point 8x8 bitmap
(or .pdf or whatever) via 'reply to author'. It's not at all
difficult to explain what a 64x64 graph looks like after viewing the
8x8 (and I can send the explanation, too).

Now you can break down the a 4096 many different ways, but you'll
still have the same problem of how to feed the first stage of the
pipeline with the appropriate grouping of inputs. I was concerned
because you post kind of implied that you've already got your 4096
data coming in as if it were in bit-reversed order. Bit-reversing
your 4096 inputs when they're coming in 64 at a time is not trivial
and requires some careful thought. A double buffer would work, but
it's extra hardware.

Not to worry. I'm sure it can be done, but the devil's in the
details.

I've also got a couple of other suggestions, but I'll post them later,
and think about your buffering problem in the meantime.

You've got an interesting problem.

Kevin McGee

From: onkars on
>
>On Jul 3, 11:09 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com>
>wrote:
>
>>As for the inputs to the FPGA -- thats an open aspect too. It can be
>>designed to best suite the FFT implementation. But we will be using the
>>HighSpeed serial IOs. So it pretty much depends on the how the FFT is
>>implemented -- we could even use external fast demuxes if necessary.
>
>>Also, the input is complex.
>
>>As mentioned earlier I should have 64 samples output per clock cycle. As
a
>>reminder I am doing 4096 pt. fft.
>
>Very useful information.
>
>I'm a little concerned about the rest of your post because I'm
>presuming that your input data is sequential. That is, 8 bits each of
>points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then
>128 ... 191, then 192 ... 255, all the way up to 4095. So you have 64
>eight bit sequential inputs per clock tick at 375 Mhz (512 bits
>total).
>
>Here's the problem. With a 4096 point FFT, you've got 2 stages of 64
>point butterflies with 1 column of inter-stage twiddles between them.
>No matter what the input/output order, geometry type or decimation,
>the top most butterfly in the first stage uses the following inputs:
>0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is
>4096-64 = 4032). The next butterfly down from the top in the first
>stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033).
>The next butterfly down uses a PLUS 2 (2, 66, 130, 194, ... 4034).
>Once again, samples separated by 64.
>
>If you need any clarification, I can send a 64 point 8x8 bitmap
>(or .pdf or whatever) via 'reply to author'. It's not at all
>difficult to explain what a 64x64 graph looks like after viewing the
>8x8 (and I can send the explanation, too).
>
>Now you can break down the a 4096 many different ways, but you'll
>still have the same problem of how to feed the first stage of the
>pipeline with the appropriate grouping of inputs. I was concerned
>because you post kind of implied that you've already got your 4096
>data coming in as if it were in bit-reversed order. Bit-reversing
>your 4096 inputs when they're coming in 64 at a time is not trivial
>and requires some careful thought. A double buffer would work, but
>it's extra hardware.
>
>Not to worry. I'm sure it can be done, but the devil's in the
>details.
>
>I've also got a couple of other suggestions, but I'll post them later,
>and think about your buffering problem in the meantime.
>
>You've got an interesting problem.
>
>Kevin McGee
>
>


Actually for the ordering of the input -- I have always thought of using
double buffers. Extra hardware -- which here is memory -- is abundant on
the FPGA (38000 Kb). It will increase latency -- which is not an issue.

Thanks a lot.

Onkar
From: kevin on

I looked at some of the FFT cores being offered by various vendors.
Your requirements are quite steep (and I've always thought it's more
fun to do your own). But you might want to Google "virtex and FFT and
cores" and check them out. You can get an idea of how others are
handling the I/O. And most seem to be doing radix 2, with a few radix
4 or mixed radix (and one NxM - arbitrary numbers).

I have more to add after my previous post, but it's late here, so I'll
post again later today.

Kevin McGee
From: kevin on

OK. I'm convinced that there are at least several ways of
manipulating your inputs such that you can sequence them properly and
process them in the first stage 64 point pipelines. The only question
is which way is most efficient.

As for the structure of the first stage 64, you were wondering if the
higher radix versions would result in longer wiring routes. Well,
you're going to have very long wiring routes even for a radix 2
pipeline. Consider that your input data (as you described) consists
of 64 eight bit words arriving simultaneously (512 bits total). Each
stage of a 64 point radix 2 pipeline contains 32 butterflies.
Presumably, they're done in parallel to meet the required throughput
speed. And since you're doing things with complex inputs, you
actually have 512 + 512 = 1024 bits being processed per clock tick
(or, 64 complex 8 bit words per tick). That's a lot of wires. And
because of the nature of the FFT, some of those wires are going to be
crossing other wires to get to the proper butterfly (it depends on the
exact algorithm, but every single algorithm is going to require it).

As an aside, you might consider taking a look at the correspondence I
mentioned some time before:

K. J. McGee, “Comments on ‘A 64-Point Fourier Transform Chip for Video
Motion Compensation Using Phase Correlation,” IEEE J. Solid-State
Circuits, vol. 33, no. 6, June 1998, pp. 928-932.

Note in the abstract where I state: lower radix algorithms can be
modified to be computationally equivalent to higher radix algorithms.
Then I describe why in section III of the above.

It's a true statement, and I figured it out in 1981 after manipulating
a lot of FFT graphs. For instance, any 64 point radix 2 graph can be
modified such that it contains the EXACT SAME number of adds,
subtracts and multiplies as in a 64 point radix 8 graph. So lower
radix efficiency can be made = to higher radix efficiency.

So, for me at least, the choice of radix has more to do with what
hardware resources are available, and which arrangement is most
efficient.

So you could proceed by picking a particular 64 point graph. Start
with radix 2, if you want, and see just how much in the way of
resources you'll need to implement it in full parallel form.

There's yet other important thing to note. Since you seem to be
fairly flexible as to your inputs, you might seriously consider doing
things in a serial fashion. Suppose your data were coming in as 512
bits at a time. Then, after 8 ticks of your 375 Mhz clock, you'd have
512 eight bit words of data, and you'd process them serially.

Whether your data is coming in parallel or serial, you should be able
to get the same high throughput that you require. Serial processing
may help you decrease the amount of long wire routes. It's something
that you should consider.

Kevin McGee
From: onkars on
Thanks again for your time.

>There's yet other important thing to note. Since you seem to be
>fairly flexible as to your inputs, you might seriously consider doing
>things in a serial fashion. Suppose your data were coming in as 512
>bits at a time. Then, after 8 ticks of your 375 Mhz clock, you'd have
>512 eight bit words of data, and you'd process them serially.
>
>Whether your data is coming in parallel or serial, you should be able
>to get the same high throughput that you require. Serial processing
>may help you decrease the amount of long wire routes. It's something
>that you should consider.
>


I am not exactly sure what you mean by serial. Does it mean having multiple
smaller pipeline FFT engines -- each doing a 4096 point FFT but fed
serially -- instead of 1 big 64 path pipelined fft. If this is what you
meant -- I actually mentioned this in my previous posts -- actually these
were part of the options I am contemplating. See below (extract from
previous post):
"USE **4** FFTs made of 16-path radix-16 using 3 stages (full row and
column
parallel implementation of radix-16)
OR
USE **8** FFTs made of 8-path radix-8 (full row and column parallel
implementation
of radix-8) using 4 stages"

Whenever I speak of smaller radix(r), I mean each FFT pipeline stage of the
4096 radix-r fft will have this 1 radix implemented as full parallel MDC
structure (which will internally be built of radix-2 type structures e.g.
radix-4 will internally have 4 radix-2s making a radix-4 have 8 complex
additions and finally 3 complex multiplications). So, this implies that
whenever I say I will use small radix-r it means I will have to use more
number of radix-r FFT units.

Hence I say:
"Using smaller radices has following disadvantages ---
The number of stages and required number of FFTs (to achieve 64
samples/clock) increases. Reasonably assuming that the number arithmetic
units for any approach is same --- this means that the control overhead
(scheduling logic + commutators using logic not mem. etc.) is increased. I
am not sure how much percentage this is.

Advantage of smaller radices is that the routes will be shorter and
higher frequencies can be achievable. Will the routes when using a single
64-path radix-64 solution be longer than when using 4 different radix-16
FFTs each designed as having 16-paths? -- I think they will.
??