From: Kevin Neilson on
KJ wrote:
> "Kevin Neilson" <kevin_neilson(a)removethiscomcast.net> wrote in message
> news:fvfm29$on81(a)cnn.xsj.xilinx.com...
>> KJ wrote:
>>> "Kevin Neilson" <kevin_neilson(a)removethiscomcast.net> wrote in message
>>> news:fv7i38$69n6(a)cnn.xsj.xilinx.com...
>>>> My question: what is the cleanest way to describe an FSM requiring
>>>> pipelining?
>> ...
>>> The other thing to consider is whether the latency being introduced by
>>> this outsourced logic needs to be 'compensated for' in some fashion or is
>>> it OK to simply wait for the acknowledge. In some instances, it is fine
>>> for the FSM to simply wait in a particular state until the acknowledge
>>> comes back. In others you need to be feeding new data into the
>>> hunk-o-logic on every clock cycle even though you haven't got the results
>>> from the first back. In that situation you still have the req/ack pair
>>> but now the ack is simply saying that the request has been accepted for
>>> processing, the actual results will be coming out later. Now the
>>> hunk-o-logic needs an additional output to flag when the output is
>>> actually valid. This output data valid signal would typically tend to
>>> feed into a separate FSM or some other logic (i.e. 'usually' not the
>>> first FSM). The first FSM controls feeding stuff in, the second FSM or
>>> other processing logic is in charge of taking up the outputs and doing
>>> something with it.
>>>
>> ...
>>> Kevin Jennings
>> In this case I do indeed have to continue to keep the pipe full, so
>> inserting wait states is not an option. And the latency of the "hunk of
>> logic", aka concurrent process, is actually significant because I have to
>> get the result and feed it back into the FSM. This example shows why:
>>
>> STATE2: begin
>> if (condition)
>> begin
>> state <= STATE3;
>> y <= (a*b+c)*d;
>> end
>> end
>>
>> I have to get the result (a*b+c) and then feed it back into the FSM so I
>> can multiply by d. Why not just let the concurrent process handle that?
>> Because I want to limit my resource usage to a single DSP48, so I have to
>> schedule the multiplications inside the FSM. But I'll have to check out
>> the Wishbone thing you're talking about.
>
> Well, just the fact that you're time sharing the DSP48 means that you're not
> processing something new on every clock cycle which just screams out to me
> that you'd want to implement this with a request/acknowledge type of
> framework. ...
>
> Kevin Jennings
>
>
But I *do* have to process something on every cycle. Consider that I
have to process these two equations:

y0 <= (a0*b0+c0)*d0;
y1 <= (a1*b1+c1);

Now, if you look at the structure of the DSP48, you can see that I can't
even process these two sequentially. I can send off (a0*b0+c0)*d0 to
the black box thingy you speak of, but this can't be processed without
dead cycles: I have to get the result of (a0*b0+c0) before I multiply
it with d0, and if the DSP48 is fully pipelined, that means that the
multiplier is unused for three cycles. It's similar to a superscalar
process with dependencies. So I have to reschedule: I put (a0*b0+c0)
into the pipe, then put in (a1*b1+c1) (which has no dependency on what
is in the pipe), and then when the result of (a0*b0+c0) pops out I can
feed it back into the DSP48 and multiply it with d0 to get y0. In the
meantime y1 pops out. Without this intermixed scheduling I end up with
too many dead cycles and then I need to use too many DSP48s.

Maybe what I really want is a way to take the superscalar scheduling
techniques that have been used for a long time and apply them in a
similar way to HDL.
-Kevin
From: Kevin Neilson on
Mike Treseler wrote:
> Kevin Neilson wrote:
>
>> I have to get the result (a*b+c) and then feed it back into the FSM so I
>> can multiply by d. Why not just let the concurrent process handle that?
>> Because I want to limit my resource usage to a single DSP48, so I have
>> to schedule the multiplications inside the FSM.
>
> I still like the idea of a step counter.
>
> On tick one, I do x <= a * b *c;
> On tick two, I do y <= x * d;
> and so on ...
>
> -- Mike Treseler

That is essentially what I'm doing; I'm just trying to find a
syntactically better way to design this pipelined stuff without having a
bunch of interdependent concurrent FSMs (or a single FSM and a bunch of
logic outside). -Kevin
From: KJ on

"Kevin Neilson" <kevin_neilson(a)removethiscomcast.net> wrote in message
news:fvo47b$on52(a)cnn.xsj.xilinx.com...
> KJ wrote:
>> "Kevin Neilson" <kevin_neilson(a)removethiscomcast.net> wrote in message
>> news:fvfm29$on81(a)cnn.xsj.xilinx.com...
>>> KJ wrote:
>>>> "Kevin Neilson" <kevin_neilson(a)removethiscomcast.net> wrote in message
>>>> news:fv7i38$69n6(a)cnn.xsj.xilinx.com...
>>>>> My question: what is the cleanest way to describe an FSM requiring
>>
>> Well, just the fact that you're time sharing the DSP48 means that you're
>> not processing something new on every clock cycle which just screams out
>> to me that you'd want to implement this with a request/acknowledge type
>> of framework. ...
>>
>> Kevin Jennings
> But I *do* have to process something on every cycle.

You're not able to process a new set of 'a', 'b', 'c' and 'd' on every clock
cycle since the DSP48 is time shared (by your choice) and that was my point.
Time multiplexing the DSP48 to keep *it* busy on every clock cycle is not
the same thing.

> Consider that I have to process these two equations:
>
> y0 <= (a0*b0+c0)*d0;
> y1 <= (a1*b1+c1);
>
> Now, if you look at the structure of the DSP48, you can see that I can't
> even process these two sequentially. I can send off (a0*b0+c0)*d0 to the
> black box thingy you speak of, but this can't be processed without dead
> cycles: I have to get the result of (a0*b0+c0) before I multiply it with
> d0, and if the DSP48 is fully pipelined, that means that the multiplier is
> unused for three cycles.

That's only true if the addition can't be done combinatorially. If it can
then the calculation of 'y0' takes two clock cycles and the DSP48 is fully
utilized. The answer pops out after two clock cycles of latency, the DSP48
hums along doing something useful on every tick.

> It's similar to a superscalar process with dependencies. So I have to
> reschedule: I put (a0*b0+c0) into the pipe, then put in (a1*b1+c1) (which
> has no dependency on what is in the pipe), and then when the result of
> (a0*b0+c0) pops out I can feed it back into the DSP48 and multiply it with
> d0 to get y0. In the meantime y1 pops out. Without this intermixed
> scheduling I end up with too many dead cycles and then I need to use too
> many DSP48s.
>

And depending on just what the bottlenecks in the design are, one can do all
kinds of things. But no matter what, you still need to interface *to* that
thing, no matter what it does and no matter how wide of an input vector it
takes (i.e. a0, b0, c0, d0, a1, b1, c1...if that's what it takes). In other
words, a0, b0, c0, d0, a1, b1 and c1 all need to get in somehow; y0 and y1
both need to make it out and you need to flag when they are valid and that
flagging is functionally the same thing as handshaking.

Kevin Jennings