Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance [Computer Architecture]

Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance

From: Morten Reistad on 28 Jul 2010 10:38

In article <84ocdrj4ps.fsf(a)harrekilde.dk>,
Kai Harrekilde-Petersen <khp(a)harrekilde.dk> wrote:
>Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
>
>> Kai Harrekilde-Petersen wrote:
>>> Terje Mathisen<"terje.mathisen at tmsw.no"> writes:
>>>
>>>> Benny Amorsen wrote:
>>>>> Andy Glew<"newsgroup at comp-arch.net"> writes:
>>>>>
>>>>>> Network routing (big routers, lots of packets).
>>>>>
>>>>> With IPv4 you can usually get away with routing on the top 24 bits + a
>>>>> bit of special handling of the few local routes on longer than 24-bit.
>>>>> That's a mere 16MB table if you can make do with possible 256
>>>>> "gateways".
>>>>
>>>> Even going to a full /32 table is just 4GB, which is very cheap these
>>>> days. :-)

Or use some red-black trees. Just make them vertical in memory, not
horisontal.

>>>
>>> The problem that I saw when I was designing Ethernet switch/routers 5
>>> years ago, wasn't one particular lookup, but the fact that you need to
>>> do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
>>> learning), DIP, SIP, VLAN, ACLs, whatnot).
>>
>> I sort of assumed that not all of these would require the same size
>> table. What is the total index size, i.e. sum of all the index bits?
>
>I had to go back to some old documentation to remember it. So take the
>following as a real life example, but not absolute gospel.
>
>The first thing you need to do in a router, is to identify which logical

I thought this was about building clusters? There you switch where you
can, route where you must. You should get to the 4 nine point of traffic
just doing switching.

You need a route-cache first. There you make a hash of some sort, and
get a mac address back. This should be around 1.2 memory accesses per
packet. Next, you push it out the right interface with the right mac.
If you link aggregate, you need to update a counter. All of this is
in the end node.

>port you're on (think link aggregation) and whether you need to do
>routing or not (a VID,MAC lookup).
>
>If you're doing routing (L3 forwarding), next you need to do an ingress
>VLAN check and then a CIDR Longest-prefix lookup for IPv4 unicast + an
>ARP lookup, and another lookup for IPv4 multicast (the ARP and multicast
>tables can share storage). Oh, and you need to do a egress VLAN lookup
>as well.

And then you cache the result. IP address -> exit point and mac. With
a hashing cache to a small bucket you should be able to get this to be
reasonably fast.

>The CIDR lookup has around 55 bits, whereas the ARP was a straight 13bit
>index.
>
>At Layer two, you have a (VID, MAC) table is around 75-80 bits wide
>(12+48bit [VID,MAC] plus the data you need to store such as autolearning
>status, aging info & destination, CPU copy/move, mirroring etc), and has
>as many entries you want to throw at it. 8K-16K entries is the norm in
>the low end, door-stopper boxes.

The fast boxes, and a lot of consumer stuff as well, have associative
mac caches. Enter mac, exit one short int you can plug in to port selection.

All that other stuff goes in the cpu. Cpu has shadow table that tuns
3-5 orders of magnitude slower. Associative index is generated from the
shadow.

Consumer boxes have around 400 entries in the associative index,
large boxes have low tens of thousands. That is _one_ access to get
exit port. To speed things up you normally do a cut-through lookup so
the next hop is ready when the packet is in memory.

>I've probably forgotten a couply of minor lookups.

>The size of the tables also depends on whether you put real port bitmaps
>in all the tables, or you put an Id in there, and then have secondary
>Id-> port bitmap conversion tables later. We did the latter.

But in the cache you want the direct exit data.

>>> Each Gbit of Ethernet can generate 1.488M packets per second.
>>
>> And unless you can promise wire-speed for any N/2->N/2 full duplex
>> mesh you should not try to sell the product, right?
>
>Correct. The ability to do this has become cheap enough that it's become
>a tickmark requirement (at least at the low-port count end). For the big
>modular switches used at the campus/corporate/WAN level, this is not
>feasible.

Layer 3 switching is an abomintion when you want performance. Switch
where you can, route where you must.

-- mrr

From: Tim McCaffrey on 28 Jul 2010 11:18

In article <ggtgp-959AC7.01492426072010(a)news.isp.giganews.com>,
ggtgp(a)yahoo.com says...
>

>AMD would allow you to split the 128 bit memory bus into two
>independent 64 bit busses, for better transaction throughput.
>To the best of my knowledge no one turns this mode on, as one
>bus is faster for everyone?
>

For the Opteron system we used, the BIOS default was unganged (two
64 bit busses). The vendor confirmed that this usually gives
better performance.

- Tim

From: Kai Harrekilde-Petersen on 28 Jul 2010 11:40

Morten Reistad <first(a)last.name> writes:

> In article <84ocdrj4ps.fsf(a)harrekilde.dk>,
> Kai Harrekilde-Petersen <khp(a)harrekilde.dk> wrote:
>>Terje Mathisen <"terje.mathisen at tmsw.no"> writes:
>>
>>> Kai Harrekilde-Petersen wrote:
>>>> Terje Mathisen<"terje.mathisen at tmsw.no"> writes:
>>>>
>>>>> Benny Amorsen wrote:
>>>>>> Andy Glew<"newsgroup at comp-arch.net"> writes:
>>>>>>
>>>>>>> Network routing (big routers, lots of packets).
>>>>>>
>>>>>> With IPv4 you can usually get away with routing on the top 24 bits + a
>>>>>> bit of special handling of the few local routes on longer than 24-bit.
>>>>>> That's a mere 16MB table if you can make do with possible 256
>>>>>> "gateways".
>>>>>
>>>>> Even going to a full /32 table is just 4GB, which is very cheap these
>>>>> days. :-)
>
> Or use some red-black trees. Just make them vertical in memory, not
> horisontal.
>
>>>>
>>>> The problem that I saw when I was designing Ethernet switch/routers 5
>>>> years ago, wasn't one particular lookup, but the fact that you need to
>>>> do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
>>>> learning), DIP, SIP, VLAN, ACLs, whatnot).
>>>
>>> I sort of assumed that not all of these would require the same size
>>> table. What is the total index size, i.e. sum of all the index bits?
>>
>>I had to go back to some old documentation to remember it. So take the
>>following as a real life example, but not absolute gospel.
>>
>>The first thing you need to do in a router, is to identify which logical
>
> I thought this was about building clusters? There you switch where you
> can, route where you must. You should get to the 4 nine point of traffic
> just doing switching.
>
> You need a route-cache first. There you make a hash of some sort, and
> get a mac address back. This should be around 1.2 memory accesses per
> packet.

Obviously, there are many ways to do these things. But on a L2/l3 fully
integrated switch, one of the most direct and simplest ways of handling
these looks is to assign separate (internal) physical RAMs for each
lookup.

> Next, you push it out the right interface with the right mac.
> If you link aggregate, you need to update a counter. All of this is
> in the end node.
>
> And then you cache the result. IP address -> exit point and mac. With
> a hashing cache to a small bucket you should be able to get this to be
> reasonably fast.

If you do things Right(tm), you don't need caching and you don't get the
problems coming from caching: you simply build a system which is fast
enough to handle the maximum frame/packet rate. With a 24 x 1Gbps + 2 x
10Gbps system, that's just under 66M packets per second.

>>The CIDR lookup has around 55 bits, whereas the ARP was a straight 13bit
>>index.
>>
>>At Layer two, you have a (VID, MAC) table is around 75-80 bits wide
>>(12+48bit [VID,MAC] plus the data you need to store such as autolearning
>>status, aging info & destination, CPU copy/move, mirroring etc), and has
>>as many entries you want to throw at it. 8K-16K entries is the norm in
>>the low end, door-stopper boxes.
>
> The fast boxes, and a lot of consumer stuff as well, have associative
> mac caches. Enter mac, exit one short int you can plug in to port selection.

Of course. Typically 4-8 sets. More sets burn more power, and gets
unwieldy even inside an ASIC due to routing distances.

> All that other stuff goes in the cpu. Cpu has shadow table that tuns
> 3-5 orders of magnitude slower. Associative index is generated from the
> shadow.

Not necessarily. The CPU just needs to know how to search/insert for a
given (VID,MAC) tupple.

> Consumer boxes have around 400 entries in the associative index,
> large boxes have low tens of thousands.

No. The consumer boxes has MAC and ARP tables on the order of 4-8.000. I
believe the newer generations goes up to around 16K entries (still far
more than really needed in a consumer network).

> That is _one_ access to get
> exit port. To speed things up you normally do a cut-through lookup so
> the next hop is ready when the packet is in memory.
>
>>I've probably forgotten a couply of minor lookups.
>
>>The size of the tables also depends on whether you put real port bitmaps
>>in all the tables, or you put an Id in there, and then have secondary
>>Id-> port bitmap conversion tables later. We did the latter.
>
> But in the cache you want the direct exit data.

Not necessarily: Think on the update scenario. If you have the direct
port bitmap stored, you need to traverse the entire memory every time
someone plug a cable in or out. With an Id, you just edit that one
location.

>>>> Each Gbit of Ethernet can generate 1.488M packets per second.
>>>
>>> And unless you can promise wire-speed for any N/2->N/2 full duplex
>>> mesh you should not try to sell the product, right?
>>
>>Correct. The ability to do this has become cheap enough that it's become
>>a tickmark requirement (at least at the low-port count end). For the big
>>modular switches used at the campus/corporate/WAN level, this is not
>>feasible.
>
> Layer 3 switching is an abomintion when you want performance. Switch
> where you can, route where you must.

Why is L3 switching an abomination when you can do it at the same speed
as L2 switching, ie wire-speed all ports? It's really not that difficult
to do L3 switching, if you think it in from the start.

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

From: MitchAlsup on 28 Jul 2010 17:32

On Jul 26, 1:49 am, Brett Davis <gg...(a)yahoo.com> wrote:

> AMD would allow you to split the 128 bit memory bus into two
> independent 64 bit busses, for better transaction throughput.
> To the best of my knowledge no one turns this mode on, as one
> bus is faster for everyone?

The wide memory bus is invariably faster, especialy with small number
of DIMMs.

What the dual bus approach does is to allow all stuffings of the DIMMs
to end up with useable memory systems with the amount of memory that
got plugged in. This is in effect what the aftermarket customer wants,
and not what the performance customer wants. Which leads to:

The memory system is always fastest when all the DIMMs have the same
timing numbers.

Mitch

From: Paul A. Clayton on 28 Jul 2010 20:40

On Jul 28, 5:32 pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
[snip]
> The wide memory bus is invariably faster, especialy with small number
> of DIMMs.

Wouldn't having twice as many potentially active DRAM banks (two
independent channels vs. two DIMM channels merged to a single
addressed channel) be a significant benefit for many multithreaded
and some single-threaded applications where bank conflicts might be
more common (especially with a "small number of DIMMs")?

Paul A. Clayton
just a technophile

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance