From: Benny Amorsen on
Morten Reistad <first(a)last.name> writes:

> The applications are coded using all three main apis, select(),
> poll() and monster numbers of synchronous threads. They behave
> equally well, differences are too small to be significant.

I am a little bit surprised that they behave equally well. Asterisk (the
only one I have looked at) seems to make an extra system call per packet
according to strace, and I would have expected that to have an impact.

> I am working on a kernel driver for this substitution, so I
> can put it directly in the routing code, and avoid all the
> excursions into user mode.

It seems like the splice system call ought to be able to do this, but I
don't think it works for UDP, and it probably isn't good for small
payloads like this. Conceptually it seems like the right path...


/Benny
From: Morten Reistad on
In article <m3r5ti3oft.fsf(a)ursa.amorsen.dk>,
Benny Amorsen <benny+usenet(a)amorsen.dk> wrote:
>Morten Reistad <first(a)last.name> writes:
>
>> The applications are coded using all three main apis, select(),
>> poll() and monster numbers of synchronous threads. They behave
>> equally well, differences are too small to be significant.
>
>I am a little bit surprised that they behave equally well. Asterisk (the
>only one I have looked at) seems to make an extra system call per packet
>according to strace, and I would have expected that to have an impact.

Asterisk actually performs best. But the user mode code represents
less than 1/20th of the cpu time expended, so user mode optimisations
will not have much impact.

As I said in an earlier posting, 1/20th is used in user mode,
1/10th in task switching, 1/4th in interrupt code (800 megabit
two way in small pakcet mode) and the remaining 2/3 inside the
Linux kernel.

>> I am working on a kernel driver for this substitution, so I
>> can put it directly in the routing code, and avoid all the
>> excursions into user mode.
>
>It seems like the splice system call ought to be able to do this, but I
>don't think it works for UDP, and it probably isn't good for small
>payloads like this. Conceptually it seems like the right path...

The bottleneck here isn't in user mode code at all. That was why
we tried FreeBSD as a test. If was not much different. Somewhat
tighter code and somewhat coarser locks, but not that big a
difference.

-- mrr
From: nmm1 on
In article <qrkmp6-pbh.ln1(a)laptop.reistad.name>,
Morten Reistad <first(a)last.name> wrote:
>
>I kinda knew that. But I was still surprised at HOW big the results were.
>8way xeon machine with 24M l2 cache handles 2800 streams, 16way machine with
>64mb L2 cache handles 22000 streams. Same processor, same clockrate. Same pci busses,
>4way hyperchannel instead of 1way; and one extra south bridge.

That fails to surprise me. My standard recommendation is that Intel
can handle 2 sockets but not 4, and AMD 4 but not 8.

>Are the memory and cache access counters on the xeons accessible from a
>Linux environment ?

As far as I know, they are still "work in progress" except for the
Itanium. Part of the problem is that Intel and AMD won't disclose
the interfaces.

>The Linux irq/dma balancer tuning is wsomewere between whichcraft and
>black magic. You can nudge it so it performs well, but on the next boot
>is misperforms. It seems it needs a few billion interrupts to actually
>get a good picture of where the interrupts are likely to be.

It's not something I looked at, but that doesn't surprise me, either.

>>One of the keys is to separate the
>>interrupt handling from the parallel applications - and I mean on
>>separate cores.
>
>This is one thing Linux does very well. But we see that the actual
>user mode code takes very small amounts of cpu, unless we are transcoding.

Then what you want to do is to separate the kernel threads from
the interrupt handling, and I doubt that you can. Interestingly,
that is where I had the main problems with Solaris - in the untuned
state, a packet could take over a second to get from the user code
to the device. And that was on a 72-CPU system with one (count it,
one) user process running. God alone knows what happened to it in
between.

>I have tested media with SER/rtppproxy, asterisk, various RTP code
>written inhouse,
>
>It makes a huge difference to do simple rtp NAT/ mixing in a kernel
>driver, either as IPtables programming or as custom code.

Yes. There was a time when you could transfer from SunOS to HP-UX
at 4 times the speed of the reverse direction (or the other way
round - I forget).


Regards,
Nick Maclaren.
From: Kim Enkovaara on
nmm1(a)cam.ac.uk wrote:
> Morten Reistad <first(a)last.name> wrote:
>> Are the memory and cache access counters on the xeons accessible from a
>> Linux environment ?
>
> As far as I know, they are still "work in progress" except for the
> Itanium. Part of the problem is that Intel and AMD won't disclose
> the interfaces.

My understanding is that they are quite well supported. For example
see the "event type" section in oprofile documentation
(http://oprofile.sourceforge.net/docs/)

There is also new tool called perf for linux, but I have not tried that
yet.

--Kim
From: Terje Mathisen on
Morten Reistad wrote:
> In article<LaKdnRhs_a_zRVXXnZ2dnUVZ8lydnZ2d(a)lyse.net>,
> Terje Mathisen<Terje.Mathisen(a)tmsw.no> wrote:
>> Did you stumble over the edge of a 3.5 MB working set cliff?
>
> I just cannot see what it is that has such effects per cpu.
> We have seen similar effects (order of magnitude) with and
> without hyperchannel linkage between the caches.
>
> This class of problem is one where the problem is running
> large numbers of identical, simple tasks where the multiprogramming
> is done by the kernel and common driver software.
>
> There was obviously some edge there, but we are very clearly
> measuring Linux, not the application; because we can swap the

OK

[snip]

> The applications are coded using all three main apis, select(),
> poll() and monster numbers of synchronous threads. They behave
> equally well, differences are too small to be significant.
>
> These tasks are just bridging RTP streams of 160 octets payload,
> 24 octets RTP, 8 octets UDP, 20 IP and 16 Ethernet (2 extra for
> alignment); 228 octets frames, 50 per second.
>
> In the IP frame the TOS, TTL, source and destination addresses
> plus header sum are changed, UDP sees ports and checksum change,
> and RTP sees sequence, timestamp and ssid change.

OK
>
> I am working on a kernel driver for this substitution, so I
> can put it directly in the routing code, and avoid all the
> excursions into user mode.

That will be interesting...
>
> But I would like to see what the memory caches really are
> doing before I start optimising.

Afaik, there is at least one portable (linux) library that gives you
access to the performance monitoring counters on several cpu architectures.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"