FPGA-based hardware accelerator for PC [FPGA]

Prev: systemc
Next: sqrt(a^2 + b^2) in synthesizable VHDL?

From: Wayne on 8 May 2006 10:21

Hypertransport offer 41GB/s speed. Maybe it is the best way to move data between PC and FPGA.

Wayne

From: Jeremy Ralph on 8 May 2006 12:45

Yes, 41GB/s would a nice rate for moving data around. Am I correct in
assuming this would require a motherboard board with two or more 939
AMD sockets? Any idea how much effort would be involved in programming
the host to move data between. I expect there are some open libraries
for this sort of thing. Also, how much work to have the FPGA handshake
the hypertransport protocol? Hopefully the FPGA board vendor would
have this covered.

Found this product, which looks interesting.
Anyone know of other HT products of interest?
http://www.drccomputer.com/pages/modules.html

Seems the HT route could get expensive (more costly FPGA board + new
motherboard & processor).

Thanks all for the great discussion!

---
PDTi [ http://www.productive-eda.com ]
SpectaReg -- Spec-down code and doc generation for register maps

From: Piotr Wyderski on 8 May 2006 13:13

JJ wrote:

> I have fantastic disbelief about that 6 ops /clock except in very
> specific circumstances perhaps in a video codec using MMX/SSE etc where
> those units really do the equiv of many tiny integer codes per cycle on
> 4 or more parallel 8 bit DSP values.

John, of course it is about peak performance, reachable with great effort.
But the existence of every accelerator is explained only when even that
peak performance is not enough. Otherwise you simply could write better
code at no additional hardware cost. I know that in most cases the CPU
sleeps because of lack of load or stalls because of a cache miss, but it
is completely different song...

> Now thats looking pretty much like what FPGA DSP can do pretty trivially
> except for the clock ratio 2GHz v 150MHz.

Yes, in my case a Cyclone @ 65MHz (130MHz internally + SDR interface,
260 MHz at the critical path with timesharing) is enough. But it is a
specialized
waveforming device, not a generic-purpose computer. As a processor, it could
reach 180MHz and then stabilize -- not an impressive value today, not to
mention
that it contsins no cache, as BRAMs are too precious resources to be wasted
that
way.

> A while back, Toms Hardware did a comparison of 3GHz P4s v the P100 1st
> pentium and all the in betweens and the plot was basically linear

Interesting. In fact I don't care about P4, as its architecture is one
big mistake, but linear speedup would be a shame for a Pentium 3...

> benchmark performance, it also used perhaps 100x the transistor count

Northwood has 55 million, the old Pentium had 4.5 million.

> as well and that is all due to the Memory Wall and the necessiity to
> avoid at all costs accessing DRAM.

Yes, that is true. 144 MiB of caches of a POWER5 does help.
A 1.5GHz POWER5 is as fast as a 3.2GHz Pentium 4 (measured
on a large memory-hungry application). But you can buy many P4s
at the price of a single POWER5 MQM.

> Try running a random number generator say R250 which can generate a new
> rand number every 3ns on a XP2400 (9 ops IIRC). Now use that no to
> address a table >> 4MB. All of a sudden my 12Gops Athlon is running at
> 3MHz ie every memory access takes 300ns

Man, what 4MiB... ;-) Our application's working set is 200--600MiB. That's
the PITA! :-/

> So on an FPGA cpu, without OoO, no Branch prediction, and with tiny
> caches, I would expect to see only abouit .6 to .8 ops/cycle and
> without caches

In a soft DSP processor it would be much less, as there is much vector
processing, which omits (or at least should) the funny caches built of
BRAMs.

> I have no experience with the Opterons yet, I have heard they might be
> 10x faster than my old 1GHx TB but I remain skeptical based on past
> experience.

I like the Cell approach -- no chache => no cache misses => tremendous
preformance.
But there are only 256KiB of local memory, so it is restricted to
specialized tasks.

Best regards
Piotr Wyderski

From: Piotr Wyderski on 8 May 2006 13:40

Andreas Ehliar wrote:

> One interesting application for most of the people on this
> newsgroup would be synthesis, place & route and HDL simulation.
> My guess would be that these applications could be heavily
> accelerated by FPGA:s.

A car is not the best tool to make another cars.
It's not a bees & butterflies story. :-) Same with FPGAs.

> My second guess that it is far from trivial to actually do this :)

And who actually would need that?

Best regards
Piotr Wyderski

From: JJ on 8 May 2006 14:20

Piotr Wyderski wrote:
> JJ wrote:
>
> > I have fantastic disbelief about that 6 ops /clock except in very
> > specific circumstances perhaps in a video codec using MMX/SSE etc where
> > those units really do the equiv of many tiny integer codes per cycle on
> > 4 or more parallel 8 bit DSP values.
>
> John, of course it is about peak performance, reachable with great effort.

Ofcourse, I don't think we differ much in opinion on the matter. But I
prefer to stick to avg throughputs available with C codes.

I think in summary any HW acceleration is justified when it is pretty
much busy all the time, embedded or or least can shrink very
significantly the time spent waiting to complete, but few opportunities
are going to get done I fear since the software experts are far from
having the knowhow to do this in HW.. For many apps that an FPGA might
barely be considered, one might also look at the GPUs or the Physix
chip or maybe wait for ClearSpeed to get on board (esp for flops) so
FPGA will be the least visible option.

> But the existence of every accelerator is explained only when even that
> peak performance is not enough. Otherwise you simply could write better
> code at no additional hardware cost. I know that in most cases the CPU
> sleeps because of lack of load or stalls because of a cache miss, but it
> is completely different song...
>
> > Now thats looking pretty much like what FPGA DSP can do pretty trivially
> > except for the clock ratio 2GHz v 150MHz.
>
> Yes, in my case a Cyclone @ 65MHz (130MHz internally + SDR interface,
> 260 MHz at the critical path with timesharing) is enough. But it is a
> specialized
> waveforming device, not a generic-purpose computer. As a processor, it could
> reach 180MHz and then stabilize -- not an impressive value today, not to
> mention
> that it contsins no cache, as BRAMs are too precious resources to be wasted
> that
> way.

The BRAMs are what define the opportunity, 500 odd BRAMs all whacking
data at say 300MHz & dual port is orders more bandwidth than any
commodity cpu will ever see, so if they can be used independantly,
FPGAs win hand down. I suspect alot of poorly executed software to
hardware conversion combines too many BRAMs into a single large and
relatively very expensive SRAM which gives all the points back to cpus.
That is also the problem with soft core cpus, to be usefull you wants
lots of cache, but merging BRAMs into useful size caches throws all
their individual bandwidth away. Thats why I propose using RLDRAM as it
allows FPGA cpus to use 1 BRAM each and share RLDRAM bandwidth over
many threads with full associativity of memory lines using hashed MMU
structure IPT sort of.

>
> > A while back, Toms Hardware did a comparison of 3GHz P4s v the P100 1st
> > pentium and all the in betweens and the plot was basically linear
>
> Interesting. In fact I don't care about P4, as its architecture is one
> big mistake, but linear speedup would be a shame for a Pentium 3...
>

Toms IIRC didn't have AMD on the lineup, must have been 1-2yrs ago. The
P4 end of the curve was still linear but the tests are IMO bogus as
they push linear memmory tests rather than the random test I use. I
hate when people talk of bandwidth for blasting GB of contiguous large
data around and completely ignore pushing millions of tiny blocks
around.

> > benchmark performance, it also used perhaps 100x the transistor count
>
> Northwood has 55 million, the old Pentium had 4.5 million.
>

100x overstating it a bit I admit, but the turn to multi cores puts cpu
back on the same path as FPGAs, Moores law for quantity rather than raw
clock speed which keeps the arguments for & against relatively
constant.

> > as well and that is all due to the Memory Wall and the necessiity to
> > avoid at all costs accessing DRAM.
>
> Yes, that is true. 144 MiB of caches of a POWER5 does help.
> A 1.5GHz POWER5 is as fast as a 3.2GHz Pentium 4 (measured
> on a large memory-hungry application). But you can buy many P4s
> at the price of a single POWER5 MQM.
>
> > Try running a random number generator say R250 which can generate a new
> > rand number every 3ns on a XP2400 (9 ops IIRC). Now use that no to
> > address a table >> 4MB. All of a sudden my 12Gops Athlon is running at
> > 3MHz ie every memory access takes 300ns
>
> Man, what 4MiB... ;-) Our application's working set is 200--600MiB. That's
> the PITA! :-/
>

Actually I ran that test from 32k doubling until I got to my ram limit
640MB (no swapping) on a 1GB system and the speed reduction is sort of
stair case log. At 32K obviously no real slow down, the step bumps
obviously indicate the memory system gradually failing, L1, L2, TLB,
after 16M, the drop to 300ns can't get any worse since the L2,TLBs have
long failed having so very little associativity. But then again it all
depends on temporal locality, how much work gets done per cache line
refill and is all the effort of the cache transfer thrown away every
time (trees). or only some of the time (code).

In the RLDRAM approach I use, the Virtex 2Pro would effectively see 3ns
raw memory issue rates for full random accesses but the true latency of
20ns is well hidden and the issue rate is reduced probably 2x to allow
for rehashing and bank collisions. Still 6ns issue rate v 300ns for
full random access is something to crow about. Ofcourse the technology
would work even better on full custom cpu. The OS never really gets
involved to fix up TLBs since there aren't any, the MMU does the rehash
work. The 2 big penalties are that tagging adds 20% to memory cost, 1
tag every 32bytes, and with hashing, the store should be left <80%
full, but memory is cheap, bandwidth isn't.

> > So on an FPGA cpu, without OoO, no Branch prediction, and with tiny
> > caches, I would expect to see only abouit .6 to .8 ops/cycle and
> > without caches
>
> In a soft DSP processor it would be much less, as there is much vector
> processing, which omits (or at least should) the funny caches built of
> BRAMs.
>

DSP has highly predicable data structures and high locality, not much
tree walking so SDRAM bandwidth can be better used directly, still code
should be cached.

> > I have no experience with the Opterons yet, I have heard they might be
> > 10x faster than my old 1GHx TB but I remain skeptical based on past
> > experience.
>
> I like the Cell approach -- no chache => no cache misses => tremendous
> preformance.
> But there are only 256KiB of local memory, so it is restricted to
> specialized tasks.
>

I suspect Cell will get used to accelerate as many apps as FPGAs or
more but it is so manually cached. I can't say I like it myself, so
much theoretical peak, but how to get at it. I much prefer the Niagara
approach to cpu design, if only the memory was done the same way.

> Best regards
> Piotr Wyderski

regards

John Jakson
transputer guy

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: systemc
Next: sqrt(a^2 + b^2) in synthesizable VHDL?