Efficient Multi-Ported Memories for FPGAs [FPGA]

Prev: clock routing to generic IO pins?
Next: Raggedstone2 Spartan-6 Board Update

From: John_H on 21 Apr 2010 06:26

On Apr 20, 1:00 pm, Eric <eric.lafor...(a)gmail.com> wrote:
> Hi All,
>
> I've recently published a paper exploring how to implement memories
> with multiple read and write ports on existing FPGAs. I figured it
> might be of interest to some.
>
> Summary, paper, slides, and example code are here:http://www.eecg.utoronto.ca/~laforest/multiport/index.html
>
> There are no patents or other additional IP encumbrances on the code.
> If you have any comments or other feedback, I'd like to hear it.
>
> Eric LaForest
> PhD student, ECE Dept.
> University of Torontohttp://www.eecg.utoronto.ca/~laforest/

Could you mention here or on your page what you mean by
"multipumping?" If you mean time multiplexed access, I can see why
multipumping is bad. [The "pure logic" approach also isn't obvious.]

Do you update the LVT in the same way I might update the RAM value in
a many-write BlockRAM? To implement a many-write BlockRAM, each
"write bank" is maintained separately. To write a new value, only the
write bank for the associated write port is updated but with the XOR
of the write data with the reads for all the other write banks at the
same address. By reading the same address from all write banks, the
XOR of all the reads is the most recent write value (unless there are
multiple writes in the same clock cycle). Since the LVT only has to
be as wide as the number of entries, I can see how this would be very
beneficial for wide data and many write ports.

Aside from wide data, however, I don't see (without going into the
attachments on that page) how updating the LVT is any different than
updating the memory in the first place.

From: Weng Tianxiang on 23 Apr 2010 22:39

On Apr 22, 3:45 pm, John_H <newsgr...(a)johnhandwork.com> wrote:
> On Apr 22, 1:55 pm, Eric <eric.lafor...(a)gmail.com> wrote:
>
>
>
>
>
> > On Apr 22, 12:36 pm, rickman <gnu...(a)gmail.com> wrote:
>
> > > I guess I don't understand what you are accomplishing with this.
> > > Block rams in FPGAs are almost always multiported. Maybe not N way
> > > ported, but you assume they are single ported when they are dual
> > > ported.
>
> > But what if you want more ports, say 2-write/4-read, without wait
> > states?
> > I assume them to be "simply dual-ported", which means one write port
> > and one read port, both operating concurrently. It is also possible to
> > run them in "true dual port" mode, where each port can either read or
> > write in a cycle. Some of the designs in the paper do that.
>
> > > Can you give a general overview of what you are doing without using
> > > jargon? I took a look and didn't get it at first glance.
>
> > OK. Let me try:
>
> > Assume a big, apparently multiported memory of some given capacity and
> > number of ports. Inside it, I use a small multiported memory
> > implemented using only the fabric of an FPGA, which stores only the
> > number of the write port which wrote last to a given address. Thus
> > this small memory is of the same depth as the whole memory, but much
> > narrower, hence it scales better.
>
> > When you read at a given address from the big memory, internally you
> > use that address to look up which write port wrote there last, and use
> > that information to steer the read to the correct internal memory bank
> > which will hold the data you want. These banks are built-up of
> > multiple Block RAMs so as to have one write port each, and as many
> > read ports as the big memory appears to have.
>
> > The net result is a memory which appears to have multiple read and
> > write ports which can all work simultaneously, but which leaves the
> > bulk of the storage to Block RAMs instead of the FPGA fabric, which
> > makes for better speed and smaller area.
>
> > Does that help?
>
> > Eric
>
> I appreciate the elaboration here in the newsgroup.
>
> The "true dual port" nature of the BlockRAMs allows one independent
> address on each of the two ports with a separate write enable for each
> port. The behavior of the BlockRAM can be modified to provide read
> data based on the new write data, old data, or no change in the read
> data value from last cycle (particularly helpful for multi-pumping).
>
> For an M write, N read memory, your approach appears to need M x (N+1)
> memories since you can have M writes all happening at the same time N
> accesses are made to the same "most recently written" memory. Please
> correct me if I'm wrong. This is the same number of memories required
> with the XOR approach but without the LVT overhead. The time delay in
> reading the LVT and multiplexing the memories feels like it would be
> cumbersome. While this might not add "wait states" it appears the
> system would not be able to run terribly quickly. XORs are pretty
> quick.
>
> There are always more ways to approach a problem that any one group
> can come up with. Kudos on your effort to bring a better approach to
> a tough system level issue for difficult designs.

John_H,
What is the XOR method in this regard? Can you give an explanation? or
give a hint on the source?

Weng

From: Weng Tianxiang on 25 Apr 2010 00:03

On Apr 23, 7:39 pm, Weng Tianxiang <wtx...(a)gmail.com> wrote:
> On Apr 22, 3:45 pm, John_H <newsgr...(a)johnhandwork.com> wrote:
>
>
>
>
>
> > On Apr 22, 1:55 pm, Eric <eric.lafor...(a)gmail.com> wrote:
>
> > > On Apr 22, 12:36 pm, rickman <gnu...(a)gmail.com> wrote:
>
> > > > I guess I don't understand what you are accomplishing with this.
> > > > Block rams in FPGAs are almost always multiported. Maybe not N way
> > > > ported, but you assume they are single ported when they are dual
> > > > ported.
>
> > > But what if you want more ports, say 2-write/4-read, without wait
> > > states?
> > > I assume them to be "simply dual-ported", which means one write port
> > > and one read port, both operating concurrently. It is also possible to
> > > run them in "true dual port" mode, where each port can either read or
> > > write in a cycle. Some of the designs in the paper do that.
>
> > > > Can you give a general overview of what you are doing without using
> > > > jargon? I took a look and didn't get it at first glance.
>
> > > OK. Let me try:
>
> > > Assume a big, apparently multiported memory of some given capacity and
> > > number of ports. Inside it, I use a small multiported memory
> > > implemented using only the fabric of an FPGA, which stores only the
> > > number of the write port which wrote last to a given address. Thus
> > > this small memory is of the same depth as the whole memory, but much
> > > narrower, hence it scales better.
>
> > > When you read at a given address from the big memory, internally you
> > > use that address to look up which write port wrote there last, and use
> > > that information to steer the read to the correct internal memory bank
> > > which will hold the data you want. These banks are built-up of
> > > multiple Block RAMs so as to have one write port each, and as many
> > > read ports as the big memory appears to have.
>
> > > The net result is a memory which appears to have multiple read and
> > > write ports which can all work simultaneously, but which leaves the
> > > bulk of the storage to Block RAMs instead of the FPGA fabric, which
> > > makes for better speed and smaller area.
>
> > > Does that help?
>
> > > Eric
>
> > I appreciate the elaboration here in the newsgroup.
>
> > The "true dual port" nature of the BlockRAMs allows one independent
> > address on each of the two ports with a separate write enable for each
> > port. The behavior of the BlockRAM can be modified to provide read
> > data based on the new write data, old data, or no change in the read
> > data value from last cycle (particularly helpful for multi-pumping).
>
> > For an M write, N read memory, your approach appears to need M x (N+1)
> > memories since you can have M writes all happening at the same time N
> > accesses are made to the same "most recently written" memory. Please
> > correct me if I'm wrong. This is the same number of memories required
> > with the XOR approach but without the LVT overhead. The time delay in
> > reading the LVT and multiplexing the memories feels like it would be
> > cumbersome. While this might not add "wait states" it appears the
> > system would not be able to run terribly quickly. XORs are pretty
> > quick.
>
> > There are always more ways to approach a problem that any one group
> > can come up with. Kudos on your effort to bring a better approach to
> > a tough system level issue for difficult designs.
>
> John_H,
> What is the XOR method in this regard? Can you give an explanation? or
> give a hint on the source?
>
> Weng

Eric,
Here is my answer to your paper.

1. It is an excellent paper. If let me give a 0-100 mark, I would give
90.

2. It really has inventive idea: it solidly resolves a problem: with a
current FPGA chip available, how to generate
a block RAM with required number of multiple read/write ports using
available block RAM which have limited
and fixed number of write/read ports. In this regard, your method is
the best !!! I will use it if I need to.

3. The references are good, but it lacks the most important things:
patents in this regard filed by Altera and Xilinx.
They must have their technique patented. No matter what, big or small,
both companies would patent everything they think it is important to
them.

4. If you want to expand your inventive idea to new block RAM with
multiple write/read ports made in FPGA, it is not a good idea:
Essentially there is no difficulty to build a block RAM with multiple
write/read ports. If you add 4 sets of read decoder
and 4 sets of write decoder to a current block RAM, the block RAM
immediately would have 4 write/read ports.There is no data racing in
the design at all.
The problem is if there is big need in the market for FPGA
manufactures to do so.

5. Your final conclusion about write-and-read schedule is not right.
When people is using your method, they are still facing write-and-read
scheduling.
For example, there is a wait list pool to receive write and read
requests, and the pool can hold 200 write requests and 200 read
requests.
How to deliver the proper write and read request to your block RAM
ports has the write-and-read scheduling problem.
So your method doesn't eliminate the write-and-read scheduling
problem. And you can only say that your solution
doesn't generate new write-and-read scheduling problem and there is a
new write/read rule: if a write and a read with
same address are issued in the same clock cycle, the read would return
the data which is to be overwritten
by the write request issued in the same clock cycle.

Here is an answer to John_H and Rick:
> > For an M write, N read memory, your approach appears to need M x (N+1)
> > memories since you can have M writes all happening at the same time N
> > accesses are made to the same "most recently written" memory. Please
> > correct me if I'm wrong.

John_H is wrong. It needs M block RAM with another column used by
LVT.
The LVT column needs M write ports and N read ports and its data width
is Upper_bound( log M ).
Each write port writes its data into its block RAM and at the same
time writes its column number to LVT column
so that LVT column stores the latest write column number for the
latest write address.

When being read, read port first reads the LVT column to get the
column number which stores the latest write data with the read
address,
then read the latest write column ( block RAM ) into its read output
register.

It is really a clever idea !!!

Weng

From: John_H on 25 Apr 2010 22:57

On Apr 25, 12:03 am, Weng Tianxiang <wtx...(a)gmail.com> wrote:
> Each write port writes its data into its block RAM and at the same
> time writes its column number to LVT column
> so that LVT column stores the latest write column number for the
> latest write address.
>
> When being read, read port first reads the LVT column to get the
> column number which stores the latest write data with the read
> address,
> then read the latest write column ( block RAM ) into its read output
> register.
>

So what happens when the multiple reads all want to read from the same
memory because that happens to be where all the last values were
written for that read?

As to the XOR, I don't have code to share; I developed it a while ago
for some asynchronous stuff and it applies well to multi-port writes.

As I try to put together a clearer explanation, I find I may have been
wrong about the memory count for the XOR approach such that the LVT
would use fewer. I still believe the LVT approach requires M*N
BlockRAMs for an M-write, N-read multi-port memory plus the LVT; I'm
having trouble remembering why I thought the "+1" was needed. The XOR
approach appears to need M*(M-1+N) memories.

If you have 3 write ports and 4 read ports, you'll need 3 sets of *6*
memories. The 6 memories in the "write bank" set all have the same
write address and write data corresponding to that write port. When a
write occurs, the *read* value for that write address is accessed from
the other two write banks so each set of 6 must provide a read address
for the other two write bank addresses. The four reads have their own
memories for their own addresses in each of the write banks.

When the write occurs, the read values for that write address are
retrieved from the other write banks and the XOR of those values along
with the new write data are written to all the write ports for its
bank of 6 memories. When a read is performed, the read value within
each write bank is retrieved and the values (3 in this case) are XORed
to get the original data.

newDataWr0^oldDataWr1^oldDataWr2 overwrites oldDataWr0 for the write.

The later reads retrieve oldDataWr0, oldDataWr1, and oldDataWr2 but
since oldDataWr0 was updated to newDataWr0^oldDataWr1^oldDataWr2,
the XOR of the three read values are
oldDataWr0^oldDataWr1^oldDataWr2 ==
(newDataWr0^oldDataWr1^oldDataWr2)^oldDataWr1^oldDataWr2 ==
newDataWr0^(oldDataWr1^oldDataWr1)^(oldDataWr2^oldDataWr2) ==
newDataWr0^(0)^(0) ==
newDataWr0

So with a little coordination, the XOR approach requires M*(M-1+N)
BlockRAMs for an M write, N read port memory along with the XORs.

The LVT needs a memory for each write port but requires multiples of
them to accommodate every read port in case the multiple reads for any
one cycle are all from the same write bank for the most recently
updated value.

Depending on the complexity of the LVT, the number of write ports, and
the allowable latencies, the LVT could be a more effective approach.

From: Weng Tianxiang on 26 Apr 2010 00:04

On Apr 25, 7:57 pm, John_H <newsgr...(a)johnhandwork.com> wrote:
> On Apr 25, 12:03 am, Weng Tianxiang <wtx...(a)gmail.com> wrote:
>
> > Each write port writes its data into its block RAM and at the same
> > time writes its column number to LVT column
> > so that LVT column stores the latest write column number for the
> > latest write address.
>
> > When being read, read port first reads the LVT column to get the
> > column number which stores the latest write data with the read
> > address,
> > then read the latest write column ( block RAM ) into its read output
> > register.
>
> So what happens when the multiple reads all want to read from the same
> memory because that happens to be where all the last values were
> written for that read?
>
> As to the XOR, I don't have code to share; I developed it a while ago
> for some asynchronous stuff and it applies well to multi-port writes.
>
> As I try to put together a clearer explanation, I find I may have been
> wrong about the memory count for the XOR approach such that the LVT
> would use fewer. I still believe the LVT approach requires M*N
> BlockRAMs for an M-write, N-read multi-port memory plus the LVT; I'm
> having trouble remembering why I thought the "+1" was needed. The XOR
> approach appears to need M*(M-1+N) memories.
>
> If you have 3 write ports and 4 read ports, you'll need 3 sets of *6*
> memories. The 6 memories in the "write bank" set all have the same
> write address and write data corresponding to that write port. When a
> write occurs, the *read* value for that write address is accessed from
> the other two write banks so each set of 6 must provide a read address
> for the other two write bank addresses. The four reads have their own
> memories for their own addresses in each of the write banks.
>
> When the write occurs, the read values for that write address are
> retrieved from the other write banks and the XOR of those values along
> with the new write data are written to all the write ports for its
> bank of 6 memories. When a read is performed, the read value within
> each write bank is retrieved and the values (3 in this case) are XORed
> to get the original data.
>
> newDataWr0^oldDataWr1^oldDataWr2 overwrites oldDataWr0 for the write.
>
> The later reads retrieve oldDataWr0, oldDataWr1, and oldDataWr2 but
> since oldDataWr0 was updated to newDataWr0^oldDataWr1^oldDataWr2,
> the XOR of the three read values are
> oldDataWr0^oldDataWr1^oldDataWr2 ==
> (newDataWr0^oldDataWr1^oldDataWr2)^oldDataWr1^oldDataWr2 ==
> newDataWr0^(oldDataWr1^oldDataWr1)^(oldDataWr2^oldDataWr2) ==
> newDataWr0^(0)^(0) ==
> newDataWr0
>
> So with a little coordination, the XOR approach requires M*(M-1+N)
> BlockRAMs for an M write, N read port memory along with the XORs.
>
> The LVT needs a memory for each write port but requires multiples of
> them to accommodate every read port in case the multiple reads for any
> one cycle are all from the same write bank for the most recently
> updated value.
>
> Depending on the complexity of the LVT, the number of write ports, and
> the allowable latencies, the LVT could be a more effective approach.

"The LVT needs a memory for each write port but requires multiples of
them to accommodate every read port in case the multiple reads for
any
one cycle are all from the same write bank for the most recently
updated value. "

I don't see any problem reading any of 9 block RAM into one read
register by a selection data or to any number of read registers by a
selection data.

Your XOR method is inferior to the LVT method absolutely, even though
I have no full knowledge of your XOR method.

The LVT method is very clean, very fast and easy to implement and so
flexible that
even 2 block RAM with each having independent 2 write ports and 2 read
ports can be easily
expanded into 4 write ports and 4 read ports.

Weng

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: clock routing to generic IO pins?
Next: Raggedstone2 Spartan-6 Board Update