Efficient Multi-Ported Memories for FPGAs [FPGA]

Prev: clock routing to generic IO pins?
Next: Raggedstone2 Spartan-6 Board Update

From: John_H on 26 Apr 2010 07:31

On Apr 26, 12:04 am, Weng Tianxiang <wtx...(a)gmail.com> wrote:
> On Apr 25, 7:57 pm, John_H <newsgr...(a)johnhandwork.com> wrote:
>
>
>
> > On Apr 25, 12:03 am, Weng Tianxiang <wtx...(a)gmail.com> wrote:
>
> > > Each write port writes its data into its block RAM and at the same
> > > time writes its column number to LVT column
> > > so that LVT column stores the latest write column number for the
> > > latest write address.
>
> > > When being read, read port first reads the LVT column to get the
> > > column number which stores the latest write data with the read
> > > address,
> > > then read the latest write column ( block RAM ) into its read output
> > > register.
>
> > So what happens when the multiple reads all want to read from the same
> > memory because that happens to be where all the last values were
> > written for that read?
>
> > As to the XOR, I don't have code to share; I developed it a while ago
> > for some asynchronous stuff and it applies well to multi-port writes.
>
> > As I try to put together a clearer explanation, I find I may have been
> > wrong about the memory count for the XOR approach such that the LVT
> > would use fewer. I still believe the LVT approach requires M*N
> > BlockRAMs for an M-write, N-read multi-port memory plus the LVT; I'm
> > having trouble remembering why I thought the "+1" was needed. The XOR
> > approach appears to need M*(M-1+N) memories.
>
> > If you have 3 write ports and 4 read ports, you'll need 3 sets of *6*
> > memories. The 6 memories in the "write bank" set all have the same
> > write address and write data corresponding to that write port. When a
> > write occurs, the *read* value for that write address is accessed from
> > the other two write banks so each set of 6 must provide a read address
> > for the other two write bank addresses. The four reads have their own
> > memories for their own addresses in each of the write banks.
>
> > When the write occurs, the read values for that write address are
> > retrieved from the other write banks and the XOR of those values along
> > with the new write data are written to all the write ports for its
> > bank of 6 memories. When a read is performed, the read value within
> > each write bank is retrieved and the values (3 in this case) are XORed
> > to get the original data.
>
> > newDataWr0^oldDataWr1^oldDataWr2 overwrites oldDataWr0 for the write.
>
> > The later reads retrieve oldDataWr0, oldDataWr1, and oldDataWr2 but
> > since oldDataWr0 was updated to newDataWr0^oldDataWr1^oldDataWr2,
> > the XOR of the three read values are
> > oldDataWr0^oldDataWr1^oldDataWr2 ==
> > (newDataWr0^oldDataWr1^oldDataWr2)^oldDataWr1^oldDataWr2 ==
> > newDataWr0^(oldDataWr1^oldDataWr1)^(oldDataWr2^oldDataWr2) ==
> > newDataWr0^(0)^(0) ==
> > newDataWr0
>
> > So with a little coordination, the XOR approach requires M*(M-1+N)
> > BlockRAMs for an M write, N read port memory along with the XORs.
>
> > The LVT needs a memory for each write port but requires multiples of
> > them to accommodate every read port in case the multiple reads for any
> > one cycle are all from the same write bank for the most recently
> > updated value.
>
> > Depending on the complexity of the LVT, the number of write ports, and
> > the allowable latencies, the LVT could be a more effective approach.
>
> "The LVT needs a memory for each write port but requires multiples of
> them to accommodate every read port in case the multiple reads for
> any
> one cycle are all from the same write bank for the most recently
> updated value. "
>
> I don't see any problem reading any of 9 block RAM into one read
> register by a selection data or to any number of read registers by a
> selection data.
>
> Your XOR method is inferior to the LVT method absolutely, even though
> I have no full knowledge of your XOR method.
>
> The LVT method is very clean, very fast and easy to implement and so
> flexible that
> even 2 block RAM with each having independent 2 write ports and 2 read
> ports can be easily
> expanded into 4 write ports and 4 read ports.
>
> Weng

There are engineering tradeoffs in everything. You surprise me by
saying my approach is inferior "absolutely." Bad form.

I reread your post after adding mine. It looks like you believe some
form of scheduling is still needed. That's what the LVT approach was
trying to overcome! Or at least I thought it was.

Have you ever tried to implement a 512-entry distributed CLB
SelectRAM? The latency and resources are pretty extreme. If one has
a design with no clocking concerns and lots of room for logic in the
FPGA, the LVT approach is stellar. If the clock rate is a serious
concern and there are plenty of memories left over, the XOR approach
excels. "Absolutely."

If the LVT approach is not designed to allow next-cycle reads, the
approach has little merit over simpler approaches like "multipumping"
or cache-based write queues. But the LVT *can* handle multiple reads
by having one memory for each read port associated with each write
memory.

It's all tradeoffs. There are absolutely no absolutes.

From: Eric on 26 Apr 2010 10:23

On Apr 25, 10:57 pm, John_H <newsgr...(a)johnhandwork.com> wrote:
> As to the XOR, I don't have code to share; I developed it a while ago
> for some asynchronous stuff and it applies well to multi-port writes.

The idea is definitely floating around. Multiple people have
independently suggested it to me after seeing the LVT approach. It's
definitely an interesting approach.

> As I try to put together a clearer explanation, I find I may have been
> wrong about the memory count for the XOR approach such that the LVT
> would use fewer. I still believe the LVT approach requires M*N
> BlockRAMs for an M-write, N-read multi-port memory plus the LVT; I'm
> having trouble remembering why I thought the "+1" was needed. The XOR
> approach appears to need M*(M-1+N) memories.
>
> If you have 3 write ports and 4 read ports, you'll need 3 sets of *6*
> memories. <snip>

Yep, you are right. The number of BlockRAMs required is M*N, plus the
LVT (which uses no Block RAMs).
Thanks for the explanation of the XOR design. You're the first to do
so. It makes a lot of sense, except I don't see why you need the extra
M-1 memories on the read side?

> The LVT needs a memory for each write port but requires multiples of
> them to accommodate every read port in case the multiple reads for any
> one cycle are all from the same write bank for the most recently
> updated value.

*Exactly*

> Depending on the complexity of the LVT, the number of write ports, and
> the allowable latencies, the LVT could be a more effective approach.

It tends to be, given the research so far, but if an extra cycle of
latency (to do the reads to get the decoding data) is acceptable for a
design, the XOR approach could be very useful. The LVT does add delay,
but it's still faster than the alternatives I explored for an
arbitrary number of ports (for small numbers, pure multipumping works
better).

From: Eric on 26 Apr 2010 10:32

On Apr 25, 12:03 am, Weng Tianxiang <wtx...(a)gmail.com> wrote:
> 5. Your final conclusion about write-and-read schedule is not right.
> When people is using your method, they are still facing write-and-read
> scheduling.
> For example, there is a wait list pool to receive write and read
> requests, and the pool can hold 200 write requests and 200 read
> requests.
<snip>

There is no read/write scheduling problem *within a cycle*. If you
have more pending reads/writes than there are ports, then of course
there will always be a scheduling problem, but that's a different
problem more akin to that solved by reorder buffers in a dynamically
scheduled CPU.

As for simultaneous reads and writes to the same address, the
behaviour is a function of how the underlying Block RAMs are
configured. For a read and a write to the same address, you can
specify that the read port returns either the old value or the current
one being written. This doesn't affect the rest of the LVT-based
design.

In the case of two simultaneous *writes* to the same address, the
default behaviour is like most any other multiported memory:
undefined. However, there is no *physical* conflict in the banks of
Block RAMs, but only in the LVT. So you can go ahead and store both
writes and decide what to store in the LVT as part of your conflict
resolution logic (eg: lower port number has priority, etc...).

From: John_H on 26 Apr 2010 18:52

On Apr 26, 10:23 am, Eric <eric.lafor...(a)gmail.com> wrote:
>
> Yep, you are right. The number of BlockRAMs required is M*N, plus the
> LVT (which uses no Block RAMs).
> Thanks for the explanation of the XOR design. You're the first to do
> so. It makes a lot of sense, except I don't see why you need the extra
> M-1 memories on the read side?

Because the write of the new data is the DataIn XORed with the old
data at the new WrAddr, each write address needs a read from the other
write memory sets.

For M write ports, there are M write sets. Each write set has the N
read memories for the N-port read and also has M-1 reads
from_the_other_write_addresses to complete the XORed data value to
store to those write sets.

From: Eric on 27 Apr 2010 08:59

On Apr 26, 6:52 pm, John_H <newsgr...(a)johnhandwork.com> wrote:
<snip>
> For M write ports, there are M write sets. Each write set has the N
> read memories for the N-port read and also has M-1 reads
> from_the_other_write_addresses to complete the XORed data value to
> store to those write sets.

Ah. Of course! This way you don't have to wait for the read ports to
be available to get the data you need to do the XORed write. Thank
you!

That's funny. I was pondering on the impact of the additional read
cycle this solution implied, but if I understand this, just plain
replication is the solution. :)

You mentioned earlier that you had implemented this at some point in
the past. Could you tell me more about where you heard about this
(domain, application, etc...) or did you come up with it yourself? I'm
just trying to suss out the common origin of the multiple XOR
suggestions I've received.

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: clock routing to generic IO pins?
Next: Raggedstone2 Spartan-6 Board Update