From: Terje Mathisen on
Morten Reistad wrote:
> In article<hacth1$b5o$1(a)smaug.linux.pwf.cam.ac.uk>,<nmm1(a)cam.ac.uk> wrote:
>> That is why I have posted why I think that modern architecture is
>> seriously outdated and mistaken, and a new generation should move
>> to an interrupt-free design. It could be done.
>
> The interrupt-coalescing code helps bring the interrupt rate
> down by an order of magnitude, so the interrupt rate is not
> a showstopper anymore.
>
> I have a strong stomack feeling there is something going on
> regarding l2 cache hit rate.

On a single cpu it is quite possible to access a small part of multiple
blocks with a power-of-two offset, in such a way that you only get the
use of a fraction of available cache space.

Is it possible for two (or more) cores to have private allocations that
map to the same L2 cache lines (and/or TLB entries), and then cause
problems?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Morten Reistad on
In article <N_udnSHThtMDz1fXnZ2dnUVZ8q1i4p2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>Morten Reistad wrote:
>> In article<hacth1$b5o$1(a)smaug.linux.pwf.cam.ac.uk>,<nmm1(a)cam.ac.uk> wrote:
>>> That is why I have posted why I think that modern architecture is
>>> seriously outdated and mistaken, and a new generation should move
>>> to an interrupt-free design. It could be done.
>>
>> The interrupt-coalescing code helps bring the interrupt rate
>> down by an order of magnitude, so the interrupt rate is not
>> a showstopper anymore.
>>
>> I have a strong stomack feeling there is something going on
>> regarding l2 cache hit rate.
>
>On a single cpu it is quite possible to access a small part of multiple
>blocks with a power-of-two offset, in such a way that you only get the
>use of a fraction of available cache space.
>
>Is it possible for two (or more) cores to have private allocations that
>map to the same L2 cache lines (and/or TLB entries), and then cause
>problems?

Possibly, but the allocations are straight malloc buffers filled
with recvfrom(), so I would not think this is a major issue.

However, all of the applications just open general udp sockets,
and read the mass of udp packets arriving. Which cpu that service
any read request should be pretty random, but the packet handling
needs to look up a few things, and will therefore hit cache locations
that were in, with odds of 12:1 or somesuch, last in use by another
cpu.

This may explain why we get such tremendous boosts with hyperchannel,
or keeping all the cores on one socket. On the two-socket machine
without hyperchannel that we tested the it proved essential for
speed to put all the packet processing; rtp and linux kernel, on
one cpu socket, and everything else on another.

This is why I want to see some cpu counters for cache misses.

-- mrr


From: Del Cecchi on
nmm1(a)cam.ac.uk wrote:
> In article <ugfmp6-vu.ln1(a)laptop.reistad.name>,
> Morten Reistad <first(a)last.name> wrote:
>> So far, se have tested lots of 16way (mostly xeon in HP3x0 packages, or
>> IBM xSeries 34x). The first bottleneck seems to be memory, because the
>> applications (media servers of all kinds) get such a tremendous boost from
>> hypertransport and larger caches. Cache coherency and Linux seems like
>> a brute-force endeavour as soon as we go past 4 processors or so,.
>
> 'Tain't just Linux. That's generally the main bottleneck.
>
>> We mostly tested Linux, but installed FreeBSD for a different view.
>> FreeBSD handles more interrupts than Linux, but seems odd in the balancing
>> results. It also seems to lock the kernel a bit more. The memory
>> footprint is also slightly smaller.
>
> Interesting.
>
>> Next on the performance list is I/O, and interrupt/dma scheduling in particular.
>> The fixes with interrupt coalescing in 2.6.24 seems to help a lot.
>
> Gug. Or, more, precisely, gug, gug, gug :-(
>
> That is NOT a nice area, not at all. I have no direct experience of
> Linux in that regard, but doubt that it is much different from IRIX
> and Solaris. The details will differ, of course.
>
>> The things I am specifically looking out for are cache coherency and
>> interrupt handling.
>
> Do you want a consultant? :-)
>
> More seriously, those were precisely the area that caused me such
> problems. At one stage, I locked the Origin up so badly that I
> couldn't power cycle it from the control panel, and had to flip
> the breakers on each rack. One of the keys is to separate the
> interrupt handling from the parallel applications - and I mean on
> separate cores.
>
>
>
> Regards,
> Nick Maclaren.

The Origin didn't have a service processor to handle things like power
on and off? I am shocked and appalled.

del
From: "Andy "Krazy" Glew" on
Morten Reistad wrote:
> This is why I want to see some cpu counters for cache misses.

If you are on an Intel x86 there are ENON counter events for L2 cache
misses - both number of requests sent to the L2 that miss, and number of
miss requests sent out (the difference being requests that get combined).

There are also counter events for lines in and lines out.

Finally, there are counter events for bus (interface) accesses of
different types.

There are plenty of ways of measuring L2 misses, in their several
different flavors.
From: nmm1 on
In article <dm7qp6-0q8.ln1(a)laptop.reistad.name>,
Morten Reistad <first(a)last.name> wrote:
>
>However, all of the applications just open general udp sockets,
>and read the mass of udp packets arriving. Which cpu that service
>any read request should be pretty random, but the packet handling
>needs to look up a few things, and will therefore hit cache locations
>that were in, with odds of 12:1 or somesuch, last in use by another
>cpu.
>
>This may explain why we get such tremendous boosts with hyperchannel,
>or keeping all the cores on one socket. On the two-socket machine
>without hyperchannel that we tested the it proved essential for
>speed to put all the packet processing; rtp and linux kernel, on
>one cpu socket, and everything else on another.

It would. Intel's previous memory system was a crock, and its
multi-socket support was worse, which is why it's effective limit
was 2 sockets. The new system is MUCH better but, by all accounts,
still doesn't handle multiple sockets well, and so is still not very
good on 4 sockets.

AMD's is much better, which is why it can get up to 4 sockets with
reasonable performance. The reason that it fails dismally with 8
is that the 7-transaction snoop protocol overloads the very limited
number of links possible with only 3 HyperTransport links and 8
sockets.


Regards,
Nick Maclaren.