From: Morten Reistad on
In article <ha9svv$bs0$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote:
>In article <5g6mp6-0p1.ln1(a)laptop.reistad.name>,
>Morten Reistad <first(a)last.name> wrote:
>>In article <7iknkoF31ud7sU1(a)mid.individual.net>,
>>
>>Speaking of which, are there any firm figures of letting Linux, BSDs, Inix,
>>Solaris etc through the hoops on an upper three digit processor machine
>>with intependent load processes where the OS has to do the MP handlig;
>>e.g. socket, pipe, semaphores in monolithic processes?
>
>To the best of my knowledge, no. I am one of the few people with any
>experience of even 64+ core systems, and the maximum I have personal
>experience with is 264, but that was Hitachi OSF/1 on a SR2201 (which
>was distributed memory, and VERY unlike those systems you mention).
>Beyond that, it's 72 (Solaris on a SunFire) and 64 (IRIX on an Origin).
>There are very, very few large SMPs anywhere.

So far, se have tested lots of 16way (mostly xeon in HP3x0 packages, or
IBM xSeries 34x). The first bottleneck seems to be memory, because the
applications (media servers of all kinds) get such a tremendous boost from
hypertransport and larger caches. Cache coherency and Linux seems like
a brute-force endeavour as soon as we go past 4 processors or so,.

We mostly tested Linux, but installed FreeBSD for a different view.
FreeBSD handles more interrupts than Linux, but seems odd in the balancing
results. It also seems to lock the kernel a bit more. The memory
footprint is also slightly smaller.

Next on the performance list is I/O, and interrupt/dma scheduling in particular.
The fixes with interrupt coalescing in 2.6.24 seems to help a lot.

>>I am looking for figures on OS performance itself, not user space.
>>
>>The reason I am asking is that I have been involved in a lot of testing
>>of application performance lately, and it seems to me we are measuring
>>the performance of BSD and Linux, not the application itself.
>
>That fails to surprise me. It took me 3 weeks of 80-hour weeks to
>get the SGI Origin usable, because of an issue that was in the area you
>are talking about. Having resolved that, it was fine.
>
>With that experience, the SunFire was a LOT easier - when we hit a
>performance issue of the form you describe, I knew exactly what to
>do to resolve it. Sun were less convinced, but it was one of the
>things on their 'to try' list.

The things I am specifically looking out for are cache coherency and
interrupt handling.

-- mrr
From: nmm1 on
In article <ugfmp6-vu.ln1(a)laptop.reistad.name>,
Morten Reistad <first(a)last.name> wrote:
>
>So far, se have tested lots of 16way (mostly xeon in HP3x0 packages, or
>IBM xSeries 34x). The first bottleneck seems to be memory, because the
>applications (media servers of all kinds) get such a tremendous boost from
>hypertransport and larger caches. Cache coherency and Linux seems like
>a brute-force endeavour as soon as we go past 4 processors or so,.

'Tain't just Linux. That's generally the main bottleneck.

>We mostly tested Linux, but installed FreeBSD for a different view.
>FreeBSD handles more interrupts than Linux, but seems odd in the balancing
>results. It also seems to lock the kernel a bit more. The memory
>footprint is also slightly smaller.

Interesting.

>Next on the performance list is I/O, and interrupt/dma scheduling in particular.
>The fixes with interrupt coalescing in 2.6.24 seems to help a lot.

Gug. Or, more, precisely, gug, gug, gug :-(

That is NOT a nice area, not at all. I have no direct experience of
Linux in that regard, but doubt that it is much different from IRIX
and Solaris. The details will differ, of course.

>The things I am specifically looking out for are cache coherency and
>interrupt handling.

Do you want a consultant? :-)

More seriously, those were precisely the area that caused me such
problems. At one stage, I locked the Origin up so badly that I
couldn't power cycle it from the control panel, and had to flip
the breakers on each rack. One of the keys is to separate the
interrupt handling from the parallel applications - and I mean on
separate cores.



Regards,
Nick Maclaren.
From: Morten Reistad on
In article <haa434$ts8$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote:
>In article <ugfmp6-vu.ln1(a)laptop.reistad.name>,
>Morten Reistad <first(a)last.name> wrote:
>>
>>So far, se have tested lots of 16way (mostly xeon in HP3x0 packages, or
>>IBM xSeries 34x). The first bottleneck seems to be memory, because the
>>applications (media servers of all kinds) get such a tremendous boost from
>>hypertransport and larger caches. Cache coherency and Linux seems like
>>a brute-force endeavour as soon as we go past 4 processors or so,.
>
>'Tain't just Linux. That's generally the main bottleneck.

I kinda knew that. But I was still surprised at HOW big the results were.
8way xeon machine with 24M l2 cache handles 2800 streams, 16way machine with
64mb L2 cache handles 22000 streams. Same processor, same clockrate. Same pci busses,
4way hyperchannel instead of 1way; and one extra south bridge.

Are the memory and cache access counters on the xeons accessible from a
Linux environment ?

>>We mostly tested Linux, but installed FreeBSD for a different view.
>>FreeBSD handles more interrupts than Linux, but seems odd in the balancing
>>results. It also seems to lock the kernel a bit more. The memory
>>footprint is also slightly smaller.
>
>Interesting.
>
>>Next on the performance list is I/O, and interrupt/dma scheduling in particular.
>>The fixes with interrupt coalescing in 2.6.24 seems to help a lot.
>
>Gug. Or, more, precisely, gug, gug, gug :-(
>
>That is NOT a nice area, not at all. I have no direct experience of
>Linux in that regard, but doubt that it is much different from IRIX
>and Solaris. The details will differ, of course.

The Linux irq/dma balancer tuning is wsomewere between whichcraft and
black magic. You can nudge it so it performs well, but on the next boot
is misperforms. It seems it needs a few billion interrupts to actually
get a good picture of where the interrupts are likely to be.

>>The things I am specifically looking out for are cache coherency and
>>interrupt handling.
>
>Do you want a consultant? :-)
>
>More seriously, those were precisely the area that caused me such
>problems. At one stage, I locked the Origin up so badly that I
>couldn't power cycle it from the control panel, and had to flip
>the breakers on each rack. One of the keys is to separate the
>interrupt handling from the parallel applications - and I mean on
>separate cores.

This is one thing Linux does very well. But we see that the actual
user mode code takes very small amounts of cpu, unless we are transcoding.

I have tested media with SER/rtppproxy, asterisk, various RTP code
written inhouse,

It makes a huge difference to do simple rtp NAT/ mixing in a kernel
driver, either as IPtables programming or as custom code.

- mrr
From: Terje Mathisen on
Morten Reistad wrote:
> In article<haa434$ts8$1(a)smaug.linux.pwf.cam.ac.uk>,<nmm1(a)cam.ac.uk> wrote:
>> 'Tain't just Linux. That's generally the main bottleneck.
>
> I kinda knew that. But I was still surprised at HOW big the results were.
> 8way xeon machine with 24M l2 cache handles 2800 streams, 16way machine with
> 64mb L2 cache handles 22000 streams. Same processor, same clockrate. Same pci busses,
> 4way hyperchannel instead of 1way; and one extra south bridge.

That's almost an order of magnitude...

The small machine had 3 MB L2/core, while the big one had 4 MB for each.

Did you stumble over the edge of a 3.5 MB working set cliff?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Morten Reistad on
In article <LaKdnRhs_a_zRVXXnZ2dnUVZ8lydnZ2d(a)lyse.net>,
Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote:
>Morten Reistad wrote:
>> In article<haa434$ts8$1(a)smaug.linux.pwf.cam.ac.uk>,<nmm1(a)cam.ac.uk> wrote:
>>> 'Tain't just Linux. That's generally the main bottleneck.
>>
>> I kinda knew that. But I was still surprised at HOW big the results were.
>> 8way xeon machine with 24M l2 cache handles 2800 streams, 16way machine with
>> 64mb L2 cache handles 22000 streams. Same processor, same clockrate. Same pci busses,
>> 4way hyperchannel instead of 1way; and one extra south bridge.
>
>That's almost an order of magnitude...
>
>The small machine had 3 MB L2/core, while the big one had 4 MB for each.
>
>Did you stumble over the edge of a 3.5 MB working set cliff?

I just cannot see what it is that has such effects per cpu.
We have seen similar effects (order of magnitude) with and
without hyperchannel linkage between the caches.

This class of problem is one where the problem is running
large numbers of identical, simple tasks where the multiprogramming
is done by the kernel and common driver software.

There was obviously some edge there, but we are very clearly
measuring Linux, not the application; because we can swap the
application between SER+rpttpoxy, asterisk and the yate proxy
with very little impact on the observed numbers. The user
mode code uses around 4-6% of the cpu time, about twice that
is used for task switching, and twice that again in interrupt
service mode. Linux 2.7.24 made a huge difference in how much
interrupt load a MP setup can use with the interrupt coalescing
code in the drivers. That brought

The applications are coded using all three main apis, select(),
poll() and monster numbers of synchronous threads. They behave
equally well, differences are too small to be significant.

These tasks are just bridging RTP streams of 160 octets payload,
24 octets RTP, 8 octets UDP, 20 IP and 16 Ethernet (2 extra for
alignment); 228 octets frames, 50 per second.

In the IP frame the TOS, TTL, source and destination addresses
plus header sum are changed, UDP sees ports and checksum change,
and RTP sees sequence, timestamp and ssid change.

I am working on a kernel driver for this substitution, so I
can put it directly in the routing code, and avoid all the
excursions into user mode.

But I would like to see what the memory caches really are
doing before I start optimising.

-- mrr