From: John_H on
On Dec 29 2009, 4:37 pm, Rob Doyle <radioe...(a)gmail.com> wrote:
>
> I *need* one write port and three read ports - so I'm OK just
> duplicating the RAM.
>
> I could save a clock cycle in the ALU if I could do two writes
> and three reads.  If I have to stall the pipeline to implement
> this, I've gained nothing.
>
> The timing won't permit 2 Register clock cycles per ALU clock cycle
> to double-up the register accesses.
>
> The multi-port "flag" memory is the trick I was looking for.  The ALU
> has 1024 registers so I can envision some tall data selectors,
> multiplexers, and accompanying levels of logic to implement the
> address decoders.
>
> I think I'm going to stay with simple for now and put this in my
> back pocket as "Plan B".
>
> I greatly appreciate the help.
>
> Rob Doyle

If you wanted fewer registers (1024, really?) there's a nice technique
that can use LUT RAMs to provide (combinatorially) the read values you
want with two write ports. Two writes to the same address would
result in an undefined value but avoiding that condition results in
seamless operation. Two write ports with three reads would use 8 dual-
port LUT RAM arrays - (write_ports x (read_ports+1) ).

The reason LUT RAMs are needed is the operation is a read-modify-
write.

One could get around the read-modify-write need by delaying the write
one cycle but that method of selecting the read value or the delayed
write value is still needed.

I can provide more detail if needed. Multiple write ports start to
eat resources but they're doable if the performance gain is worth the
resource loss.
From: Peter Alfke on

From the bowels of my computer I resurrected a file written more than
3 years ago:

Using Virtex-5 CLB as Multi-Port Memory

The four M-LUTs in a half-CLB can be combined to form a quad-port RAM,
ideally suited for register-file applications.
The four LUTs, called A, B, C, and D are configured in such a way that
the write address applied to D is automatically also multiplexed onto
the write addressing of LUTs A, B, and C.
Writing into D thus also writes into the same location in A, B, and C,
but these three LUTs have their address inputs still available as read
addresses. (In this application, LUT D is never read.)
The structure functions as a quad-port RAM with one write port
(address applied to D) and common data written into LUTs A, B, and C .
There are three independent read ports (addresses applied to LUTs A,
B, and C.) Writing is synchronous, reading is combinatorial.
Each LUT can either be a 64 x 1, or a 32 x 2 RAM.

A similar structure, using common read addresses and individual Data
inputs, acts as simple dual-port memory, either 3 bits wide and 64
deep, or 3 bits wide and 32 deep.

In the Virtex-5 MicroBlaze application, the 32 x 32 register file with
one write port and two read ports, using 384 LUTs in Virtex-4, is
reduced to 44 LUTs, a saving of over 88%.

Peter Alfke, 3-21-06

From: whygee on
wow, a great new year's present :-)))

Peter Alfke wrote:
> From the bowels of my computer I resurrected a file written more than
> 3 years ago:
>
> Using Virtex-5 CLB as Multi-Port Memory
<snip>
> In the Virtex-5 MicroBlaze application, the 32 x 32 register file with
> one write port and two read ports, using 384 LUTs in Virtex-4, is
> reduced to 44 LUTs, a saving of over 88%.
>
> Peter Alfke, 3-21-06

Any more information, diagram, schematics, source code,
appnote, or whatever, would be really appreciated :-)

thanks and greetings,
yg
--
http://ygdes.com / http://yasep.org
From: John_H on
On Jan 1, 7:54 pm, Peter Alfke <al...(a)sbcglobal.net> wrote:
> From the bowels of my computer I resurrected a file written more than
> 3 years ago:
>
> Using Virtex-5 CLB as Multi-Port Memory
>
> The four M-LUTs in a half-CLB can be combined to form a quad-port RAM,
> ideally suited for register-file applications.
> The four LUTs, called A, B, C, and D are configured in such a way that
> the write address applied to D is automatically also multiplexed onto
> the write addressing of LUTs A, B, and C.
> Writing into D thus also writes into the same location in A, B, and C,
> but these three LUTs have their address inputs still available as read
> addresses. (In this application, LUT D is never read.)
> The structure functions as a quad-port RAM with one write port
> (address applied to D) and common data written into LUTs A, B, and C .
> There are three independent read ports (addresses applied to LUTs A,
> B, and C.) Writing is synchronous, reading is combinatorial.
> Each LUT can either be a 64 x 1, or a 32 x 2 RAM.
>
> A similar structure, using common read addresses and individual Data
> inputs, acts as simple dual-port memory, either 3 bits wide and 64
> deep, or 3 bits wide and 32 deep.
>
> In the Virtex-5 MicroBlaze application, the 32 x 32 register file with
> one write port and two read ports, using 384 LUTs in Virtex-4, is
> reduced to 44 LUTs, a saving of over 88%.
>
> Peter Alfke,  3-21-06

Greetings Peter, always a pleasure.

You describe a physical implementation which folds beautifully into
the Xilinx fabric showing how few routing resources are needed to
implement the bits of the multi-port (read) memories. But isn't this
precisely what one gets when inferring a single port write, multi-port
read memory through HDL?

For that, I wouldn't think example code would be needed since the
inference can have the same physical implementation you describe.
Without explicit placement constraints, both inferred and instantiated
methods are left to the Place & Route to fold everything for each bit
into single CLBs, aren't they? It's certainly easier to apply those
constraints if the designer defines the names for each instance in the
first place.

The bigger challenge raised in this thread is the multi-port with two
write ports which can be performed in CLBs very nicely but with a
little overhead.

If the reader has no interest in multi-port writes with CLB memories,
you can ignore the rest of the message.


reg [n:0] m1 [m:0], m2 [m:0];
wire [n:0] rd1, rd2, rd3;
always @(posedge clk) if( we1 ) m1[wa1] <= wdata1 ^ m2[wa1];
always @(posedge clk) if( we2 ) m2[wa2] <= wdata2 ^ m1[wa2];
assign rd1 = m1[ra1] ^ m2[ra1];
assign rd2 = m1[ra2] ^ m2[ra2];
assign rd3 = m1[ra3] ^ m2[ra3];

Since m1 has 4 unique addresses, the inferred memory will be
replicated for 4 total copies.
Since 2 memories are needed for 2 writes, there are 2 sets of these 4
memory copies.

Since writing a value to m1 doesn't affect m2, reading that address
later results in
rdx == m1[was_wa1] ^ m2[was_wa1] == (was_wdata1 ^ m2[was_wa1]) ^ m2
[was_wa1] == was_wdata1

The XOR on the input and the output resurrects the original data
written to that port independent of which read memory accesses it.
The one caveat: two writes to the same address on the same clock
results in no change to the existing data. Priority could be assigned
to one write port by disabling the write on the other port when a
conflict is detected instead.

This is where CLB SelectRAM design gets interesting and fun!