From: David Schwartz on
On May 10, 8:37 am, Chris Friesen <cbf...(a)mail.usask.ca> wrote:

> It seems like you are limiting a "properly designed multithreaded
> application" to those that use a "pool of worker threads" model.

No, no, not at all. However, a properly designed multithreaded
application would use a pool of worker threads model unless another
model provided, on balance, more benefits than detriments than that
approach. That is, no properly-designed multi-threaded application can
be worse than a pool of worker threads design. Because if it was
worse, then the choice of a pool of worker threads was improper.

> "properly designed" doesn't always mean "as fast as possible at
> runtime".  It also seems like by definition these "properly designed"
> apps must be rarely doing anything that would block, because that causes
> a context switch.  Does that mean that they need to use asynch I/O, or
> that such apps rarely have to wait for data from a disk?

Oh no, neither. You can certainly wait for data from a disk.

> Lastly, it
> seems like they must rarely execute system calls, because the cost of an
> thread-to-thread context switch isn't much more than that of a syscall.

I don't follow this argument. What does one have to do with the other?
You frequently have no choice but to do very expensive things. But
when you do make expensive system calls because you have to, that's
you making forward progress at full speed. That's good.

> Finally, the worker pool design doesn't rule out processes instead of
> threads.  In general it is possible to set up a multi-process model to
> mimic a multi-thread model, using shared memory and passing around file
> descriptors.  Once the setup is completed, at runtime the primary
> performance difference between the two models is the fact that you need
> to flush the TLB on a context switch between processes and not between
> threads.

I agree, that's just not very practical right now for the reasons
discussed elsewhere in this thread. But that might well be the best of
both worlds, allowing you to get most of the benefits of multi-
threading while avoiding contention when allocating and freeing
resources you don't need shared.

As I said, it would be a wonderful thing if someone would put the
effort needed into developing a library and toolkit for using this
model. It's much more doable now that 64-bit OSes are available, since
it helps to allocate a large chunk of address space before 'fork'ing
processes off so that you can dereference pointers in shared memory
without needing to manually alias them.

DS
From: Golden California Girls on
Scott Lurndal wrote:
> David Schwartz <davids(a)webmaster.com> writes:
>> On May 9, 4:08=A0pm, sc...(a)slp53.sl.home (Scott Lurndal) wrote:
>>
>>> Threads are a performance win because they don't need to flush the TLB's
>>> on context switches between threads in the same process.
>> Nope. That's like saying that cars are faster than bicycles because
>> they don't have pedals. While it's true that threads are a performance
>> win and it's true that context switches between threads of the same
>> process are faster than context switches between threads from
>> different processes, the latter does not cause the former.
>>
>>> A thread context switch is enormously less
>>> expensive than a process context switch. =A0 The larger the page size,
>>> the better.
>> It doesn't matter. In any sensible threaded application, there will be
>> so few context switches that making them faster will be lost in the
>> noise.
>
> I've never seen a thread that doesn't require a context switch, aside
> from the user-level M-N threads in the old SVR4.2MP threads library, and
> that was also a context switch, just done in the library rather than the
> kernel.
>
> If you degenerate your system to a single thread per core, and only have
> one process (i.e. a real-time embedded) system, then there won't be many
> context switches between threads.
>
> However, in real-world threaded applications there _are_ context switches,
> and there are _many_ context switches, and a thread context switch is
> more efficient than a process context switch.

I suspect David has been fortunate to only have had special case use of
threads and not run into the general case. If for some reason the problem
to be solved requires a large thread stack space, a recursion solution
perhaps, and used a large number of threads, sufficient that the physical
memory address for thread 1 and thread N is farther apart than the
available cache ...

From: Scott Lurndal on
David Schwartz <davids(a)webmaster.com> writes:
>On May 10, 7:32=A0am, sc...(a)slp53.sl.home (Scott Lurndal) wrote:
>
>> However, in real-world threaded applications there _are_ context switches=
>,
>> and there are _many_ context switches, and a thread context switch is
>> more efficient than a process context switch.
>
>Why would there be many context switches? All the reasons processes
>need to switch contexts don't apply to threads. For example:

Simple. There are generally more threads than there are processing units,
and all the threads want to accomplish something.

There is generally more than one 'process' running on any given unix system
at any given time. On my 4-core ubuntu box, I count 115 kernel threads. On
my 192 core RHEL box, I count over 1000 kernel threads (watchdog, events, kintegrityd,
kswapd, crypto, SCSI error handlers, etc.)

I assume your system is performing I/O? kswapd (which handles writing from
the file cache to the device) will run, resulting in a context switch. SoftIRQ
handling may require a context switch.

Then there are various system daemons (ntpd, xinetd, init, getty, nfs daemons)
(I preclude the 100 processes started by GNOME per logged in user, since a real
server wouldn't have GNOME running, nor would it even have a head).

The reason for all these threads _is_ that thread context switches are cheap.

Serving web pages (cf an earlier message in this thread) is a _solved_ problem[*]
and there is no need to craft a highly efficient web server anymore.

[*] scale out, not up.
From: Scott Lurndal on
David Schwartz <davids(a)webmaster.com> writes:
>On May 10, 7:40=A0am, Rainer Weikusat <rweiku...(a)mssgmbh.com> wrote:
>
>> Dedicating threads to particular subtasks of something which is
>> supposed to be done is also a sensible way to design 'a threaded
>> application', just one which is rather geared towards simplicity of
>> the implementation than maximum performance.
>
>You can always trade-off performance for something else. The point is
>that you *have* the performance in the first place and where that
>performance comes from.
>
>> Because a thread context
>> switch is cheaper than a process context switch, such simple designs
>> are useful for a wider range of tasks when using threads instead of
>> processes.
>
>Compared to the ability to avoid context switches entirely, the

Which _cannot be done_ on any reasonable modern general purpose
unix-like operating system.

>relative cost difference of process versus thread context switches is
>lost in the noise in realistic scenarios. Of course, things that only

Have you ever measured this? I have, several times, on various
architectures from mainframes to MPP boxes to hypervisors. The cost
difference is measurable and hardly in the noise.

>make things better are good, and this is certainly a small benefit to
>threads. But it isn't a game changer. On the other hand, the ability
>to reduce the number of context switches by an order of magnitude
>(because you never have a thread running that can't access the memory
>or file descriptor needed to make forward progress) *is* a game

So you never need to fault a page in? Context switch.
So you never need to read or write the file descriptor? Context switch.

scott
From: Scott Lurndal on
David Schwartz <davids(a)webmaster.com> writes:
>On May 10, 8:37=A0am, Chris Friesen <cbf...(a)mail.usask.ca> wrote:
>
>> It seems like you are limiting a "properly designed multithreaded
>> application" to those that use a "pool of worker threads" model.
>
>No, no, not at all. However, a properly designed multithreaded
>application would use a pool of worker threads model unless another
>model provided, on balance, more benefits than detriments than that
>approach. That is, no properly-designed multi-threaded application can
>be worse than a pool of worker threads design. Because if it was
>worse, then the choice of a pool of worker threads was improper.
>
>> "properly designed" doesn't always mean "as fast as possible at
>> runtime". =A0It also seems like by definition these "properly designed"
>> apps must be rarely doing anything that would block, because that causes
>> a context switch. =A0Does that mean that they need to use asynch I/O, or
>> that such apps rarely have to wait for data from a disk?
>
>Oh no, neither. You can certainly wait for data from a disk.
>
>>=A0Lastly, it
>> seems like they must rarely execute system calls, because the cost of an
>> thread-to-thread context switch isn't much more than that of a syscall.
>
>I don't follow this argument. What does one have to do with the other?
>You frequently have no choice but to do very expensive things. But
>when you do make expensive system calls because you have to, that's
>you making forward progress at full speed. That's good.
>
>> Finally, the worker pool design doesn't rule out processes instead of
>> threads. =A0In general it is possible to set up a multi-process model to
>> mimic a multi-thread model, using shared memory and passing around file
>> descriptors. =A0Once the setup is completed, at runtime the primary
>> performance difference between the two models is the fact that you need
>> to flush the TLB on a context switch between processes and not between
>> threads.
>
>I agree, that's just not very practical right now for the reasons
>discussed elsewhere in this thread. But that might well be the best of
>both worlds, allowing you to get most of the benefits of multi-
>threading while avoiding contention when allocating and freeing
>resources you don't need shared.

This exact model has been de rigour in the Mainframe world since the
1960's. On Unix, use Tuxedo or other MQ middleware, or system V message
queues, or POSIX message queues.

This is also the model used by 90% of the web servers in existence. An
Alteon (Nortel, Cisco, etc) load balancer queues requests to a pool of
servers.


>
>As I said, it would be a wonderful thing if someone would put the
>effort needed into developing a library and toolkit for using this
>model. It's much more doable now that 64-bit OSes are available, since

I see no advantage to this model for any application. Why do you
think a multiple process (hence multiple address space) model is
superior to a multi-threaded process? If you're worried about
the cost of poll on a large pool of file descriptors, then you've
posed your problem poorly and should rethink your solution.

>it helps to allocate a large chunk of address space before 'fork'ing
>processes off so that you can dereference pointers in shared memory
>without needing to manually alias them.

c'est what?

s