Advice on Xilinx Spelunking [FPGA]

Prev: mux behavior
Next: Software bloat (Larkin was right)

From: Rob Gaddi on 25 May 2010 17:32

I've got a Spartan 6 design that I'm working with under ISE 11.5. A
code block that I would expect to take up about 200 LUTs is taking 800
instead. 600 LUTs wouldn't be the end of the world, except I'm planning
to replicate this block 32 times, which puts me well over the top.

So the question becomes where are all of the LUTs going? There's
nothing in the XST status report for the section that would imply
anywhere near this much utilization. I've tried looking over the RTL
schematic; it's difficult to read and from what I could make out, there
still wasn't anything to explain all those LUTs. Then I tried looking
through the technology schematic instead. The viewer took forever to
open the schematic, and when I finally got it open it took better than a
minute any time I wanted to refresh the screen. Needless to say, this
got me nowhere.

So, I'm out for advice. Any suggestions on figuring out just where all
of those LUTs are going?

Thanks,
Rob

--
Rob Gaddi, Highland Technology
Email address is currently out of order

From: glen herrmannsfeldt on 25 May 2010 17:54

Rob Gaddi <rgaddi(a)technologyhighland.com> wrote:
> I've got a Spartan 6 design that I'm working with under ISE 11.5. A
> code block that I would expect to take up about 200 LUTs is taking 800
> instead. 600 LUTs wouldn't be the end of the world, except I'm planning
> to replicate this block 32 times, which puts me well over the top.

How full is the FPGA that you are targeting? If not so full, I
believe that the tools don't try so hard. Well, actually the LUT
count shouldn't be so far off, but the CLB count can change, as
it doesn't fill each CLB.

Otherwise, without knowing about the design it is hard to say.

Can you say a little about the logic? How many counters, adders, RAMs.

Maybe it is using CLB for RAM, instead of BRAM?

-- glen

From: John_H on 25 May 2010 18:42

On May 25, 5:32 pm, Rob Gaddi <rga...(a)technologyhighland.com> wrote:
> I've got a Spartan 6 design that I'm working with under ISE 11.5. A
> code block that I would expect to take up about 200 LUTs is taking 800
> instead. 600 LUTs wouldn't be the end of the world, except I'm planning
> to replicate this block 32 times, which puts me well over the top.
>
> So the question becomes where are all of the LUTs going? There's
> nothing in the XST status report for the section that would imply
> anywhere near this much utilization. I've tried looking over the RTL
> schematic; it's difficult to read and from what I could make out, there
> still wasn't anything to explain all those LUTs. Then I tried looking
> through the technology schematic instead. The viewer took forever to
> open the schematic, and when I finally got it open it took better than a
> minute any time I wanted to refresh the screen. Needless to say, this
> got me nowhere.
>
> So, I'm out for advice. Any suggestions on figuring out just where all
> of those LUTs are going?
>
> Thanks,
> Rob
>
> --
> Rob Gaddi, Highland Technology
> Email address is currently out of order

A good technology view will make the world of difference. But it
seems Xilinx isn't giving you that. I used the Synplify synthesizer's
HDL Analyst to get a superb technology view that allowed me to
understand the occasional oddity the synthesizer would produce from my
code. I found that technology viewer to be a truly top-notch product
and sincerely helpful in keeping a design on track.

I've only glanced at the Xilinx technology viewer, seeing that it
looked like a last-gen VW beetle compared to a modern day Lexus in the
HDL Analyst. It may do the job but it won't be a comfortable job if
it gets too involved.

From: Symon on 25 May 2010 18:55

On 5/25/2010 10:32 PM, Rob Gaddi wrote:
>
> So the question becomes where are all of the LUTs going?
>
> Thanks,
> Rob
>
Does ISE11.5 have FPGA editor?

Syms.

From: Rob Gaddi on 25 May 2010 19:39

On 5/25/2010 2:54 PM, glen herrmannsfeldt wrote:
> Rob Gaddi<rgaddi(a)technologyhighland.com> wrote:
>> I've got a Spartan 6 design that I'm working with under ISE 11.5. A
>> code block that I would expect to take up about 200 LUTs is taking 800
>> instead. 600 LUTs wouldn't be the end of the world, except I'm planning
>> to replicate this block 32 times, which puts me well over the top.
>
> How full is the FPGA that you are targeting? If not so full, I
> believe that the tools don't try so hard. Well, actually the LUT
> count shouldn't be so far off, but the CLB count can change, as
> it doesn't fill each CLB.
>
> Otherwise, without knowing about the design it is hard to say.
>
> Can you say a little about the logic? How many counters, adders, RAMs.
>
> Maybe it is using CLB for RAM, instead of BRAM?
>
> -- glen

Sure. The widget in question does 8 pole IIR filtering of 16 bit data
using 48-bit internal data paths. The actual add/multiply/add math is
taken care of by a subblock that uses a DSP48 slice and 222 LUTs that
I'm not counting towards the 800.

The block I'm looking at is the wrapper that sequences the math
operations and holds the internal states. The logic infers two 48 bit
LUT RAMs, one dual port, and one quad port. There's a 24-bit LUT RAM
and a 24 bit adder that I use to implement an FIR prefilter (the 8 zeros
at z=-1 that you get from the bilinear transform of an 8 pole filter).
There's an FSM with four states, and a couple of 3 bit counters. There
are two 18 bit comparators, but most of the LSBs of them should optimize
out.

I'll append the code here. I'm not bothering to include pkg_bus as
well, but it just defines a simple WISHBONE bus and a few constants.

--

library IEEE;
use IEEE.STD_LOGIC_1164.all;
use IEEE.NUMERIC_STD.all;
use IEEE.STD_LOGIC_MISC.all;

use work.pkg_bus.all;

-- Xilinx specific macro library
-- library UNISIM;
-- use UNISIM.VComponents.all;

entity filter is
port (
-- Data path
din : in signed(15 downto 0);
nd : in boolean;

dout : out signed(15 downto 0);
drdy : out boolean;

-- Coefficient path
WB_IN : in t_wb_mosi;
WB_OUT : out t_wb_miso;

WB_SYS : in t_wb_sys
);
end entity filter;

architecture Behavioral of filter is

alias clk : std_logic is WB_SYS.CLK_I;
alias rst : std_logic is WB_SYS.RST_I;

-- Component declaration of the "filter_math" unit defined in
-- file: "./src/vhdl/filter_math.vhd"
component filter_math
port(
data : in SIGNED(47 downto 0);
pre : in SIGNED(47 downto 0);
post : in SIGNED(47 downto 0);
k : in SIGNED(47 downto 0);
lsd_nd : in BOOLEAN;
ichg : out BOOLEAN;
irdy : out BOOLEAN;
y : out SIGNED(47 downto 0);
lsd_rdy : out BOOLEAN;
msd_rdy : out BOOLEAN;
clk : in STD_LOGIC);
end component;
for all: filter_math use entity work.filter_math(Xilinx_DSP48A1);

-- We're going to use a whole mess o' RAMs to store various
-- and sundry.
subtype t_data is signed(47 downto 0);
constant POLES : integer := 8;
constant MAX_IDX : integer := POLES-1;

-- Data memory is S3.45.
subtype t_idx is integer range 0 to MAX_IDX;
type t_ram is array(t_idx) of t_data;
signal ram_dat : t_ram := (others => (others => '0'));

subtype t_uns_idx is unsigned(2 downto 0);
signal write_idx : t_uns_idx;
signal read_idx : t_uns_idx;

-- Coefficient memory is also S3.45, but since we're
-- writing it from a 16 bit data bus, we need to be
-- able to access it a word at a time.
--
type t_coefram is array(t_idx) of t_wb_data;
signal ram_k_hi : t_coefram := (others => (others => '0'));
signal ram_k_md : t_coefram := (others => (others => '0'));
signal ram_k_lo : t_coefram := (others => (others => '0'));

-- As seen from the memory bus, the coefficients are
-- 64 bits long. The uppermost word of this is shared
-- between all coefficients, and is the filter control
-- word.

signal fcw : t_wb_data;

-- Bits 2:0 are POLES_USED, which should be an odd number
-- equal to the number of poles for this filter - 1. Any
-- even number here, including zero, will code for no filter.
alias poles_used : std_logic_vector is fcw(2 downto 0);

-- Hook the data up to the math core
signal data : t_data;
signal pre : t_data;
signal post : t_data;
signal k : t_data;
signal y : t_data;

signal go : boolean;

signal lsd_nd : boolean;
signal ichg : boolean;
signal irdy : boolean;
signal lsd_rdy : boolean;
signal msd_rdy : boolean;

-- Downstream of the math core we'll apply a cascade of 2 pole
-- boxcar filters in order to put some zeros. One bit growth per
-- stage brings us to S1.23 when we're done.

subtype t_fir_data is signed(din'length + POLES - 1 downto 0);
type t_firram is array(t_idx) of t_fir_data;
signal fir_cascade : t_firram := (others => (others => '0'));
signal fir_idx : t_uns_idx;
signal fir_din : t_fir_data;

-- Internal states of things
signal fir_drdy : boolean;
signal use_fir_data : boolean;

type t_state is (IDLE, FIR, IIR, RESET);
signal state : t_state := RESET;

-- LFSR noise generator. When we first extend the 16 bit data to 24
-- bits for the FIR filter, adding this noise in below the LSB helps
-- make sure the IIR filters don't get into long, drawn out settlings.
signal lfsr : std_logic_vector(22 downto 1) := (others => '0');

begin
-------------------------------------------------------------------------
-- Make sure our constants are compiled correctly.
-------------------------------------------------------------------------

assert (2**write_idx'length = POLES)
report "Length of RAM index does not correspond to number of poles."
severity failure;

-------------------------------------------------------------------------
-- Connect up the asynchronous data paths.
-------------------------------------------------------------------------

-- FIR data is in S1.23, the math core is expecting S3.45
data <= SHIFT_LEFT(RESIZE(fir_din, data'length), 45-23) when
use_fir_data
else y;

lsd_nd <= fir_drdy when use_fir_data else
lsd_rdy;

-- Everything else comes out of the RAMs. ram_k has one r/w port and one
-- read port, ram_dat has one write port and two read ports.
--

pre <= ram_dat(TO_INTEGER(read_idx or "001"));
post <= ram_dat(TO_INTEGER(read_idx));
k <= SIGNED(ram_k_hi(TO_INTEGER(read_idx))) &
SIGNED(ram_k_md(TO_INTEGER(read_idx))) &
SIGNED(ram_k_lo(TO_INTEGER(read_idx)));

-- Instantiate our math core.
MATH : filter_math
port map(
data => data,
pre => pre,
post => post,
k => k,
lsd_nd => lsd_nd,
ichg => ichg,
irdy => irdy,
y => y,
lsd_rdy => lsd_rdy,
msd_rdy => msd_rdy,
clk => clk
);

-------------------------------------------------------------------------
-- WISHBONE coefficient readback.
-------------------------------------------------------------------------

WB_READBACK: process(WB_IN, fcw, ram_k_hi, ram_k_md, ram_k_lo)
variable read_addr : integer range 0 to MAX_IDX;
variable word_addr : integer range 0 to 3;

begin
read_addr := TO_INTEGER(WB_IN.ADDR(1 + read_idx'length downto 2));
word_addr := TO_INTEGER(WB_IN.ADDR(1 downto 0));

WB_OUT <= WB_BADA_SLAVE;

if read_addr <= MAX_IDX then
case word_addr is
when 0 => WB_OUT.DAT <= fcw;
when 1 => WB_OUT.DAT <= ram_k_hi(read_addr);
when 2 => WB_OUT.DAT <= ram_k_md(read_addr);
when 3 => WB_OUT.DAT <= ram_k_lo(read_addr);
end case;
end if;

end process WB_READBACK;

-------------------------------------------------------------------------
-- Wrangle the big state machine.
-------------------------------------------------------------------------

MACHINE: process
variable write_addr : integer range 0 to 31;
variable word_addr : integer range 0 to 3;
variable current : t_data;

variable unclamped : signed(17 downto 0); -- S3.15 number

begin
wait until rising_edge(clk);
drdy <= false;
fir_drdy <= false;

if nd then
assert (state = IDLE)
report "New data request before IDLE state."
severity error;
end if;

case state is
when IDLE =>
-- Hold things in the start state.
use_fir_data <= true;
read_idx <= (others => '0');
write_idx <= (others => '0');

if nd then
if (poles_used(0) = '0') then
-- Allow for no filter at all
dout <= din;
drdy <= true;
state <= IDLE;
else
-- Start our FIR filter with din at the MSBs.
state <= FIR;
fir_idx <= UNSIGNED(poles_used);
fir_din <= SHIFT_LEFT(
RESIZE(din & lfsr(lfsr'high), fir_din'length),
fir_din'length - din'length - 1
);
end if;

else
state <= IDLE;
end if;

when FIR =>
-- Store the value, push the average forward.
fir_cascade(TO_INTEGER(fir_idx)) <= fir_din;

fir_din <= SHIFT_RIGHT(fir_din, 1) +
SHIFT_RIGHT(fir_cascade(TO_INTEGER(fir_idx)), 1);

if (fir_idx = 0) then
-- Start the IIR filter. Repurpose the FIR index to count
-- down the number of poles to do.
state <= IIR;
fir_drdy <= true;
fir_idx <= UNSIGNED(poles_used);
else
fir_idx <= fir_idx - 1;
end if;

when IIR =>
-- The main responsibilities are updating
-- the pointers and updating the stored data.

if msd_rdy then

-- Update the stored data and advance the
-- write pointer. Also decrement the FIR index, which
-- we're just using to count IIR stages at this point.

ram_dat(TO_INTEGER(write_idx)) <= y;
write_idx <= write_idx + 1;
fir_idx <= fir_idx - 1;

if (fir_idx = 0) then
state <= IDLE;
write_idx <= (others => '0');

-- We've treated the data as S3.45 all the
-- way through. First, remap it to S3.15
unclamped := RESIZE(SHIFT_RIGHT(y, 45-15), 18);

-- Now clamp any excess.
if TO_INTEGER(unclamped) >= 2**15 then
dout <= x"7FFF";

elsif TO_INTEGER(unclamped) <= -(2**15) then
dout <= x"8001";

else
dout <= RESIZE(unclamped, 16);

end if;
drdy <= true;

end if;

elsif ichg and not lsd_nd then
-- We can advance the read index ahead of
-- time.
read_idx <= write_idx + 1;

if (fir_idx = 0) then
use_fir_data <= true;

else
use_fir_data <= false;

end if;

end if;

when RESET =>
-- Initialize the states for both filters
ram_dat(TO_INTEGER(write_idx)) <= (others => '0');
fir_cascade(TO_INTEGER(fir_idx))<= (others => '0');

if (fir_idx = 0) then
write_idx <= (others => '0');
state <= IDLE;

else
write_idx <= write_idx + 1;
fir_idx <= fir_idx - 1;

end if;

end case;

-- Allow bus writes to the coefficient RAM
if is_write(WB_IN) then
write_addr := TO_INTEGER(WB_IN.ADDR(6 downto 2));
word_addr := TO_INTEGER(WB_IN.ADDR(1 downto 0));

if write_addr <= MAX_IDX then
case word_addr is
when 0 => fcw <= WB_IN.DAT;
when 1 => ram_k_hi(write_addr) <= WB_IN.DAT;
when 2 => ram_k_md(write_addr) <= WB_IN.DAT;
when 3 => ram_k_lo(write_addr) <= WB_IN.DAT;
end case;
end if;
end if;

-- Advance the LFSR
lfsr <= lfsr(21 downto 1) & (lfsr(22) xnor lfsr(21));

-- Handle the reset.
if (rst = '1') then
write_idx <= (others => '0');
read_idx <= (others => '0');
fir_idx <= (others => '1');
fcw <= (others => '0');
use_fir_data <= true;
state <= RESET;
end if;

end process;

end architecture Behavioral;

--
Rob Gaddi, Highland Technology
Email address is currently out of order

| Next | Last
Pages: 1 2 3 4
Prev: mux behavior
Next: Software bloat (Larkin was right)