From: Patrick Maupin on
On Feb 12, 10:32 am, rickman <gnu...(a)gmail.com> wrote:

> In the case of using latches in place of registers, the speed gains
> are always usable.  But can't the same sort of gains be made by
> register leveling?  If you have logic that is slower than a clock
> cycle followed by logic that is faster than a clock cycle, why not
> just move some of the slow logic across the register to the faster
> logic section?

That's a similar technique, to be sure, for speed-gains. But as I
wrote in an earlier post, I think the primary motivation for latch-
based design was originally cost. For example, since each flop is
really two latches, if you are going to have logic which ANDs together
the output of two flops, you could replace that with ANDing the output
of two latches, and outputting that result through another latch, for
a net savings of 75% of the latches.

From: Weng Tianxiang on
On Feb 12, 7:35 pm, Patrick Maupin <pmau...(a)gmail.com> wrote:
> On Feb 12, 10:32 am, rickman <gnu...(a)gmail.com> wrote:
>
> > In the case of using latches in place of registers, the speed gains
> > are always usable.  But can't the same sort of gains be made by
> > register leveling?  If you have logic that is slower than a clock
> > cycle followed by logic that is faster than a clock cycle, why not
> > just move some of the slow logic across the register to the faster
> > logic section?
>
> That's a similar technique, to be sure, for speed-gains.  But as I
> wrote in an earlier post, I think the primary motivation for latch-
> based design was originally cost.  For example, since each flop is
> really two latches, if you are going to have logic which ANDs together
> the output of two flops, you could replace that with ANDing the output
> of two latches, and outputting that result through another latch, for
> a net savings of 75% of the latches.

Your method's target and the target used by CPU designers inserting
latches in the pipeline line are totally different.

They use it because a combinational signal time delay is tool long to
fit within one clock cycle and too short within two clock cycles in a
pipeline, not in any places you may want to.

Weng
From: John_H on
On Feb 12, 11:32 am, rickman <gnu...(a)gmail.com> wrote:
<snip>
>
> In the case of using latches in place of registers, the speed gains
> are always usable.  But can't the same sort of gains be made by
> register leveling?  If you have logic that is slower than a clock
> cycle followed by logic that is faster than a clock cycle, why not
> just move some of the slow logic across the register to the faster
> logic section?
>
> Rick

I argued with my coworker for a few days about the benefit of latches
versus registers before I finally realized the advantage of latch
based designs. Not only is granularity less of a problem (e.g., only
able to fit 2 logic delays in a level rather than the maximum 2.8
available, losing nearly 30%) but synchronous delays are different.
Rather than accounting for Tco+Tsu for every register in a chain of a
few clock cycles where register leveling is helpful, only the Tito
transparent latch delay (minus the Tilo LUT delay) needs to be added
for each latch in the chain [using Xilinx timing nomenclature].

I agree that the register based FPGAs are probably designed (and
tested) to minimize Tsu and Tco without strong consideration for Tito
and that the timing analysis is NOT set up to do a good job with
"latch leveled" timing analysis.

When I do use latches (when transferring data between rising/falling
time domains for a fast clock, for instance) I have to specify false
values around the latch for synchronous analysis rather than the
precise values through the latch because the analysis wants to see
registers at each stage even with the proper analysis flag turned on.
If the analyzer would recognize a chain of rise/fall/rise/fall
controlled latches and automatically increase the timing constraint by
a half period for each stage, we'd potentially have a powerful tool at
our disposal. But they don't so we don't. At least not in FPGAs.

- John_H
From: glen herrmannsfeldt on
In comp.arch.fpga John_H <newsgroup(a)johnhandwork.com> wrote:
(snip)

> I argued with my coworker for a few days about the benefit of latches
> versus registers before I finally realized the advantage of latch
> based designs. Not only is granularity less of a problem (e.g., only
> able to fit 2 logic delays in a level rather than the maximum 2.8
> available, losing nearly 30%) but synchronous delays are different.
> Rather than accounting for Tco+Tsu for every register in a chain of a
> few clock cycles where register leveling is helpful, only the Tito
> transparent latch delay (minus the Tilo LUT delay) needs to be added
> for each latch in the chain [using Xilinx timing nomenclature].

I would have thought that they were fast enough now for that
not to matter so much. My thought would be that clock skew,
even with the fancy clock distribution system, would be the important
factor.

If the granularity is the problem then you might try clocking
some on rising and some on falling edge (if available) or having
two clocks with known phase difference. That would be especially
true if the DLL's could generate the appropriate clocks.

> I agree that the register based FPGAs are probably designed (and
> tested) to minimize Tsu and Tco without strong consideration for Tito
> and that the timing analysis is NOT set up to do a good job with
> "latch leveled" timing analysis.

> When I do use latches (when transferring data between rising/falling
> time domains for a fast clock, for instance) I have to specify false
> values around the latch for synchronous analysis rather than the
> precise values through the latch because the analysis wants to see
> registers at each stage even with the proper analysis flag turned on.
> If the analyzer would recognize a chain of rise/fall/rise/fall
> controlled latches and automatically increase the timing constraint by
> a half period for each stage, we'd potentially have a powerful tool at
> our disposal. But they don't so we don't. At least not in FPGAs.

That sounds useful. If it gets popular enough, maybe they
will add it.

-- glen
From: John_H on
On Feb 13, 3:09 pm, glen herrmannsfeldt <g...(a)ugcs.caltech.edu> wrote:
<snip>
> > Rather than accounting for Tco+Tsu for every register in a chain of a
> > few clock cycles where register leveling is helpful, only the Tito
> > transparent latch delay (minus the Tilo LUT delay) needs to be added
> > for each latch in the chain [using Xilinx timing nomenclature].
>
> I would have thought that they were fast enough now for that
> not to matter so much.  My thought would be that clock skew,
> even with the fancy clock distribution system, would be the important
> factor.

Clock skew becomes entirely unimportant in the latch scheme as I know
it unless CLK and CLK180 are used instead of normal and inverted
versions of the same clock. The latches are explicitly alternated
posedge/negedge/posedge/negedge effectively decomposing a conceptual
register into its two latches and balancing the logic between them.
For clock skew to be an issue, two consecutive latches would have to
be transparent long enough for the logic path plus delays to sneak
through; that won't happen when using the normal and invert of the
*same* clock net unless things are very, very wrong in the latch
design.

> If the granularity is the problem then you might try clocking
> some on rising and some on falling edge (if available) or having
> two clocks with known phase difference.  That would be especially
> true if the DLL's could generate the appropriate clocks.

Some... registers? Using the posedge and negedge in a registered
arrangement would simply exacerbate the granularity problem, able to
fit fewer whole delays into the same clock period by dividing the
logic into two phases. The latches allow longer delays to move the
valid data further toward the end of the transparent window and
shorter delays to move it back, always with the safeguard that data
for the next (half) cycle isn't allowed to be valid any sooner than
the front edge of the transparent window.

The description comes out a little muddy which is why it took me a few
days to buy in to the whole concept. It's sweet! It just takes some
timing diagrams and head scratching. And it's certainly not set up
for proper analysis especially in the Xilinx tools where I
experimented with the phase domain changes.

- John_H