Effects of Memory Latency and Bandwidth on Supercomputer,ApplicationPerformance [Computer Architecture]

Prev: IBM zEnterprise Announced
Next: High-bandwidth computing (hbc) wiki and mailing list

From: Terje Mathisen "terje.mathisen at on 28 Jul 2010 03:04

Kai Harrekilde-Petersen wrote:
> Terje Mathisen<"terje.mathisen at tmsw.no"> writes:
>
>> Benny Amorsen wrote:
>>> Andy Glew<"newsgroup at comp-arch.net"> writes:
>>>
>>>> Network routing (big routers, lots of packets).
>>>
>>> With IPv4 you can usually get away with routing on the top 24 bits + a
>>> bit of special handling of the few local routes on longer than 24-bit.
>>> That's a mere 16MB table if you can make do with possible 256
>>> "gateways".
>>
>> Even going to a full /32 table is just 4GB, which is very cheap these
>> days. :-)
>
> The problem that I saw when I was designing Ethernet switch/routers 5
> years ago, wasn't one particular lookup, but the fact that you need to
> do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
> learning), DIP, SIP, VLAN, ACLs, whatnot).

I sort of assumed that not all of these would require the same size
table. What is the total index size, i.e. sum of all the index bits?

> Each Gbit of Ethernet can generate 1.488M packets per second.

And unless you can promise wire-speed for any N/2->N/2 full duplex mesh
you should not try to sell the product, right?
>
> The DRAMs may be cheap enough, but the pins and the power to drive
> multiple banks sure ain't cheap.
>
> Remember that you want to do this in the switch/router hardware path (ie
> no CPU should touch the packet), and at wire-speed for all ports at the
> same time.

Obviously.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Bakul Shah on 29 Jul 2010 02:36

On 7/28/10 12:04 AM, Terje Mathisen wrote:
> Kai Harrekilde-Petersen wrote:
>> Terje Mathisen<"terje.mathisen at tmsw.no"> writes:
>>
>>> Benny Amorsen wrote:
>>>> Andy Glew<"newsgroup at comp-arch.net"> writes:
>>>>
>>>>> Network routing (big routers, lots of packets).
>>>>
>>>> With IPv4 you can usually get away with routing on the top 24 bits + a
>>>> bit of special handling of the few local routes on longer than 24-bit.
>>>> That's a mere 16MB table if you can make do with possible 256
>>>> "gateways".
>>>
>>> Even going to a full /32 table is just 4GB, which is very cheap these
>>> days. :-)
>>
>> The problem that I saw when I was designing Ethernet switch/routers 5
>> years ago, wasn't one particular lookup, but the fact that you need to
>> do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
>> learning), DIP, SIP, VLAN, ACLs, whatnot).
>
> I sort of assumed that not all of these would require the same size
> table. What is the total index size, i.e. sum of all the index bits?

The sum is too large to allow a single access in a realistic
system. You use whatever tricks you have to, to achieve wire-
speed switching/IP forwarding while staying within given
constraints. Nowadays ternary CAMs are used to do the heavy
lifting for lookups but indexing, some sort of tries are also
used. In addition to the things mentioned above, you also
have to deal with QoS (based on some subset of {src, dst}
{ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls
tags), policing, shaping, scheduling, counter updates etc and
everything requires access to memory or TCAM! Typically a
heavily pipelined network processor or special purpose ASIC
is used. The nice thing is packet forwarding is almost
completely parallelizable.

To bring this back to comp.arch, if you have on-chip L1, L2 &
L3 caches, chances are you have to xfer large chunks of data
on every access of external RAM for efficiency reasons. Seems
to me, you may as well wrap each such request, response in an
ethernet frame by putting multiple ethernet framers onboard!
Similarly memory modules should provide an ethernet internface.
Note that there is already AOE (ATA over Ethernet) to connect
disks via ethernet. And since external memories are now like
disk (or tape)... :-)

From: Terje Mathisen "terje.mathisen at on 29 Jul 2010 03:57

Bakul Shah wrote:
> On 7/28/10 12:04 AM, Terje Mathisen wrote:
>> Kai Harrekilde-Petersen wrote:
>>> Terje Mathisen<"terje.mathisen at tmsw.no"> writes:
>>>
>>>> Benny Amorsen wrote:
>>>>> Andy Glew<"newsgroup at comp-arch.net"> writes:
>>>>>
>>>>>> Network routing (big routers, lots of packets).
>>>>>
>>>>> With IPv4 you can usually get away with routing on the top 24 bits + a
>>>>> bit of special handling of the few local routes on longer than 24-bit.
>>>>> That's a mere 16MB table if you can make do with possible 256
>>>>> "gateways".
>>>>
>>>> Even going to a full /32 table is just 4GB, which is very cheap these
>>>> days. :-)
>>>
>>> The problem that I saw when I was designing Ethernet switch/routers 5
>>> years ago, wasn't one particular lookup, but the fact that you need to
>>> do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
>>> learning), DIP, SIP, VLAN, ACLs, whatnot).
>>
>> I sort of assumed that not all of these would require the same size
>> table. What is the total index size, i.e. sum of all the index bits?
>
> The sum is too large to allow a single access in a realistic

Sorry, I was unclear, but the OP did understand what I meant:

I was really asking for the full list of lookups, with individual sizes,
needed to do everything a router has to do these days.

> system. You use whatever tricks you have to, to achieve wire-
> speed switching/IP forwarding while staying within given
> constraints. Nowadays ternary CAMs are used to do the heavy
> lifting for lookups but indexing, some sort of tries are also
> used. In addition to the things mentioned above, you also
> have to deal with QoS (based on some subset of {src, dst}
> {ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls
> tags), policing, shaping, scheduling, counter updates etc and
> everything requires access to memory or TCAM! Typically a
> heavily pipelined network processor or special purpose ASIC
> is used. The nice thing is packet forwarding is almost
> completely parallelizable.

Yes, but if you want to take a chance and skip the trailing checksum
test, in order to forward packets as soon as you have the header, then
you would have even more severe timing restrictions, right?

(Skipping/delaying the checksum test would mean depending upon the end
node to detect the error.)

BTW, is anyone doing this? Maybe in order to win benchmarketing tests?

> To bring this back to comp.arch, if you have on-chip L1, L2 &
> L3 caches, chances are you have to xfer large chunks of data
> on every access of external RAM for efficiency reasons. Seems
> to me, you may as well wrap each such request, response in an
> ethernet frame by putting multiple ethernet framers onboard!
> Similarly memory modules should provide an ethernet internface.
> Note that there is already AOE (ATA over Ethernet) to connect
> disks via ethernet. And since external memories are now like
> disk (or tape)... :-)

Well, what does the current cross-cpu protocols look like?

AMD or Intel doesn't seem to matter, there's still a nice little HW
network stack in there.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Bakul Shah on 29 Jul 2010 04:27

On 7/29/10 12:57 AM, Terje Mathisen wrote:
> Bakul Shah wrote:
>> system. You use whatever tricks you have to, to achieve wire-
>> speed switching/IP forwarding while staying within given
>> constraints. Nowadays ternary CAMs are used to do the heavy
>> lifting for lookups but indexing, some sort of tries are also
>> used. In addition to the things mentioned above, you also
>> have to deal with QoS (based on some subset of {src, dst}
>> {ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls
>> tags), policing, shaping, scheduling, counter updates etc and
>> everything requires access to memory or TCAM! Typically a
>> heavily pipelined network processor or special purpose ASIC
>> is used. The nice thing is packet forwarding is almost
>> completely parallelizable.
>
> Yes, but if you want to take a chance and skip the trailing checksum
> test, in order to forward packets as soon as you have the header, then
> you would have even more severe timing restrictions, right?

You still have the same time budget: worst case you still have to send
out 64 byte packets back to back. Most lookups can be done as soon as the
NPU can get at the header.

> (Skipping/delaying the checksum test would mean depending upon the end
> node to detect the error.)
>
> BTW, is anyone doing this? Maybe in order to win benchmarketing tests?

You can drop a bad CRC packet at a later point in the pipeline but before
sending it out.

>
>> To bring this back to comp.arch, if you have on-chip L1, L2 &
>> L3 caches, chances are you have to xfer large chunks of data
>> on every access of external RAM for efficiency reasons. Seems
>> to me, you may as well wrap each such request, response in an
>> ethernet frame by putting multiple ethernet framers onboard!
>> Similarly memory modules should provide an ethernet internface.
>> Note that there is already AOE (ATA over Ethernet) to connect
>> disks via ethernet. And since external memories are now like
>> disk (or tape)... :-)
>
> Well, what does the current cross-cpu protocols look like?
>
> AMD or Intel doesn't seem to matter, there's still a nice little HW
> network stack in there.

Ethernet frames seem to have become the most common denominator in
networks so I was speculating may be that'd be the cheapest way to
shove lots of data around?

From: Terje Mathisen "terje.mathisen at on 29 Jul 2010 12:04

Bakul Shah wrote:
> On 7/29/10 12:57 AM, Terje Mathisen wrote:
>> Yes, but if you want to take a chance and skip the trailing checksum
>> test, in order to forward packets as soon as you have the header, then
>> you would have even more severe timing restrictions, right?
>
> You still have the same time budget: worst case you still have to send
> out 64 byte packets back to back. Most lookups can be done as soon as the
> NPU can get at the header.
>
>> (Skipping/delaying the checksum test would mean depending upon the end
>> node to detect the error.)
>>
>> BTW, is anyone doing this? Maybe in order to win benchmarketing tests?
>
> You can drop a bad CRC packet at a later point in the pipeline but before
> sending it out.

I meant sending out _before_ you have received it, as soon as you have
the dest address.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 1 2 3
Prev: IBM zEnterprise Announced
Next: High-bandwidth computing (hbc) wiki and mailing list