From: daniel.larkin on
Hi all,

In my Cyclone 4 based design I'm getting an embedded multiplier
inferred, as expected from the following VHDL:

C <= A * B;

(where A and B are registered 12 bit values, and the output C is
subsequently registered, with no other logic in the path)

However I'm seeing a timing violation on this path. Looking at the
timing reports, there is nearly a 2ns delay between the output of the
multiplier and the flop. Obviously I'd really like to pull in some of
this 2ns, which would sort out the negative slack problem.

I looked through the documentation for the embedded multipliers, and
as expected there are input and output registers as part of the
embedded multiplier block. But clearly with that 2ns delay the output
register isn't being used. So my question is: how do I write my code
to infer the use of the output registers in the embedded multipliers?
As I tried a number of coding styles, including putting the
multiplication operation directly inside a clocked process and it had
no impact on timing. But I definitely don't want to instantiate the
embedded multiplier directly. Perhaps there are any VHDL attributes
that may help (anything other than MULTSTYLE DSP/LOGIC)?

Any suggestions or pointers to documents would be greatly appreciated!
From: firefox3107 on
On Jul 26, 9:49 pm, "daniel.lar...(a)gmail.com"
<daniel.lar...(a)gmail.com> wrote:
> Hi all,
>
> In my Cyclone 4 based design I'm getting an embedded multiplier
> inferred, as expected from the following VHDL:
>
> C <= A * B;
>
> (where A and B are registered 12 bit values, and the output C is
> subsequently registered, with no other logic in the path)
>
> However I'm seeing a timing violation on this path. Looking at the
> timing reports, there is nearly a 2ns delay between the output of the
> multiplier and the flop. Obviously I'd really like to pull in some of
> this 2ns, which would sort out the negative slack problem.
>
> I looked through the documentation for the embedded multipliers, and
> as expected there are input and output registers as part of the
> embedded multiplier block. But clearly with that 2ns delay the output
> register isn't being used. So my question is: how do I write my code
> to infer the use of the output registers in the embedded multipliers?
> As I tried a number of coding styles, including putting the
> multiplication operation directly inside a clocked process and it had
> no impact on timing. But I definitely don't want to instantiate the
> embedded multiplier directly.  Perhaps there are any VHDL attributes
> that may help (anything other than MULTSTYLE  DSP/LOGIC)?
>
> Any suggestions or pointers to documents would be greatly appreciated!

I would try this

Mult: process (iClk, inResetAsync) is
begin
if inResetAsync = '0' then
C <= (others => '0');
elsif rising_edge(iClk) then -- rising clock edge
C <= A * B;
end if;
end process Mult;
From: daniel.larkin on
I thought I'd already tried that - but it looks like I forgot to reset
the output (i.e. C in this case), which subsequently gave a result
which didn't use the output register. Problem solved now - Thanks


> I would try this
>
> Mult: process (iClk, inResetAsync) is
>   begin
>     if inResetAsync = '0' then
>       C <= (others => '0');
>     elsif rising_edge(iClk) then     -- rising clock edge
>       C <= A * B;
>     end if;
>   end process Mult;

From: Nial Stewart on
> I thought I'd already tried that - but it looks like I forgot to reset
> the output (i.e. C in this case), which subsequently gave a result
> which didn't use the output register. Problem solved now - Thanks


That's odd, I'd have expected the output to have been registered whether
it was asynchronously reset or not.

Is this a bug in the synthesis tool?


Nial.


From: dgreig on
On Jul 26, 8:49 pm, "daniel.lar...(a)gmail.com"
<daniel.lar...(a)gmail.com> wrote:
> Hi all,
>
> In my Cyclone 4 based design I'm getting an embedded multiplier
> inferred, as expected from the following VHDL:
>
> C <= A * B;
>
> (where A and B are registered 12 bit values, and the output C is
> subsequently registered, with no other logic in the path)
>
> However I'm seeing a timing violation on this path. Looking at the
> timing reports, there is nearly a 2ns delay between the output of the
> multiplier and the flop. Obviously I'd really like to pull in some of
> this 2ns, which would sort out the negative slack problem.
>
> I looked through the documentation for the embedded multipliers, and
> as expected there are input and output registers as part of the
> embedded multiplier block. But clearly with that 2ns delay the output
> register isn't being used. So my question is: how do I write my code
> to infer the use of the output registers in the embedded multipliers?
> As I tried a number of coding styles, including putting the
> multiplication operation directly inside a clocked process and it had
> no impact on timing. But I definitely don't want to instantiate the
> embedded multiplier directly.  Perhaps there are any VHDL attributes
> that may help (anything other than MULTSTYLE  DSP/LOGIC)?
>
> Any suggestions or pointers to documents would be greatly appreciated!

The following works, I have 10's of thousands of instantions in a
similar number of FPGA's actually in the field.
The multstyle attribute may be what you need. Synthesis might not use
DSP if there is not timing need and no power need.
--
============================================================================================================================================================--
-- COPYRIGHT (c) 2010 DAVID GREIG. This source file is the
property of David Greig. This work must not be copied without
permission from David Greig. --
-- Any copy or derivative of this
source file must include this copyright
statement. --
----------------------------------------------------------------------------------------------------------------------------------------------------------------
-- File : SyS_Mult.vhd
-- Author : David Greig (email :
-- Revision :
-- Description : signed input data multiplier with clken output reg
------------------------------------------------------------------------------------------------------------------------
-- Notes : 2 clock cycle delay
--
============================================================================================================================================================--
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
--
============================================================================================================================================================--
entity SyS_Mult is -- 2 clock cycle delay
generic(
Gdawidth : natural;
Gdbwidth : natural;
Gmult_pref : string
);
port(
arstn : in std_logic;
clk : in std_logic;
clken : in std_logic;
da_i : in std_logic_vector(Gdawidth - 1 downto 0);
db_i : in std_logic_vector(Gdbwidth - 1 downto 0);
q_o : out std_logic_vector(Gdawidth + Gdbwidth - 1 downto 0)
);
end entity SyS_Mult;
--
============================================================================================================================================================--
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--
architecture rtl of SyS_Mult is
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--
attribute multstyle : string; -- Implementation style, "logic" "dsp"
------------------------------------------------------------------------------------------------------------------------
signal da_r : signed(Gdawidth - 1 downto 0);
signal db_r : signed(Gdbwidth - 1 downto 0);
signal p_r : signed(Gdawidth + Gdbwidth - 1 downto 0); attribute
multstyle of p_r : signal is Gmult_pref;
------------------------------------------------------------------------------------------------------------------------
begin
------------------------------------------------------------------------------------------------------------------------
prcs_SyS_Mult : process(arstn, clken, clk)
begin
if (arstn = '0') then
da_r <= (others => '0');
db_r <= (others => '0');
p_r <= (others => '0');
elsif (clken = '0') then
null;
elsif rising_edge(clk) then
da_r <= signed(da_i);
db_r <= signed(db_i);
p_r <= (da_r * db_r);
end if;
end process prcs_SyS_Mult;
------------------------------------------------------------------------------------------------------------------------
q_o <= std_logic_vector(p_r);
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--
end architecture rtl;
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--
-- component SyS_Mult is -- 2 clock cycle delay
-- generic(
-- Gdawidth : natural;
-- Gdbwidth : natural;
-- Gmult_pref : string
-- );
-- port(
-- arstn : in std_logic;
-- clk : in std_logic;
-- clken : in std_logic;
-- da_i : in std_logic_vector(Gdawidth - 1 downto 0);
-- db_i : in std_logic_vector(Gdbwidth - 1 downto 0);
-- q_o : out std_logic_vector(Gdawidth + Gdbwidth -1 downto 0)
-- );
-- end component SyS_Mult;

-- i_ : SyS_Mult -- 2 clock cycle delay
-- generic map(
-- Gdawidth => ,
-- Gdbwidth => ,
-- Gmult_pref =>
-- )
-- port map(
-- arstn => ,
-- clk => ,
-- clken => ,
-- da_i => ,
-- db_i => ,
-- q_o =>
-- );