From: Morten Reistad on
In article <hqpqks$q26$1(a)news.eternal-september.org>,
Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote:
>On 4/22/2010 3:04 AM, Morten Reistad wrote:
>> In article<86666a83-4bed-472c-aacd-9fc6ef47e9e6(a)k33g2000yqc.googlegroups.com>,
>> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>>> On Apr 21, 11:02 am, Robert Myers<rbmyers...(a)gmail.com> wrote:
>>>> Even though the paper distinguishes between technical and commercial
>>>> workloads, and draws its negative conclusion only for commercial
>>>> workloads, it was interesting to me that, for instance, Blue Gene went
>>>> the same direction--many simple processors--for a technical workload so
>>>> as to achieve low power operation.
>>>
>>> Reading between the lines, Comercial and DB workloads are better
>>> served by slower processors accessing a thinner cache/memory hierarchy
>>> than by faster processors accessing a thicker cache/memory hierarchy.
>>> That is: a comercial machine is better served with larger first level
>>> cache backed up by large second cache running at slower frequencies,
>>> while a technical machine would be better served with smaller first
>>> level caches, medium second elvel cache and a large third level cache
>>> running at higher frequencies.
>>
>> I can confirm this from benchmarks for real-life workloads for
>> pretty static web servers, media servers and SIP telephony systems.
>> The cache size means everything in this context.
>
>Isn't this the kind of workload (relatively small instruction footprint,
>lots of cache misses, and lots of independent threads)
>that could benefit from a multi-threaded CPU?

Lots of these Internet services has such a signature. Relatively
light processing compared to i/o, but simple transforms are done
on the data. Instruction footprint is small, relatively good short-time
(nanoseconds) locality on the data, but the mid-term (milliseconds)
working set expands linearly, or more, with load.

This applies to web caches, real media broad/manycasters, telephony,
all kinds of authentication/profile servers (from dhcp to ldap/radius).

They either end as cache limited, or they have sufficient processing
for the cpus to get loaded, and the power usage to go up.

>> Actually, the Hyperchannel cache interconnects work very nicely to
>> make all the on-chip caches work as a system-wide cache, not as
>> die-wide. I may even suggest L4 cache attachment to static on-chip
>> hyperchannel memory; like a Xeon with no CPUs.
>
>Another advantage of this is that this "extra" chip provides more pins
>thus allowing higher system memory bandwidth for those applications that
>need it.

Yep. The 8-way Xeon, 4 chips, multiple hyperchannels has a
fantastic performance for these services. The 32-way, no cache-to-cache
links off-chip, smaller L3 cache perfoms abysmally in comparison.
But it makes a good transcoding and media mixer engine.

-- mrr
From: Morten Reistad on
In article <hqpv0v$3vq$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote:
>In article <hqpqks$q26$1(a)news.eternal-september.org>,
>Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote:
>>On 4/22/2010 3:04 AM, Morten Reistad wrote:

>>Isn't this the kind of workload (relatively small instruction footprint,
>>lots of cache misses, and lots of independent threads)
>>that could benefit from a multi-threaded CPU?
>
>Not really. That's a common mistake. There are two reasons:
>
> 1) Running lots of threads usually slows each thread down by
>causing conflicts in some of the shared components. Some designs
>do better and some worse, of course.

Some tasks are well scalable, and have good locality; and we
can let Linux handle the coordination. Thus, the threads are full
processes with just minimal semaphore etc handling.

> 2) The reason that throughput isn't pro-rata to the number of
>threads is that you also need cache size and bandwidth and memory
>bandwidth (and number of outstanding requests) pro rata to the
>number of threads. Now, back in the real world ....

Exactly. But there are three things that help enourmously.
Interrupt coalescing; from Linux 1.6.24; gives an order of
magnitude more i/o performance on networks and similar devices.

Inter-cache links helps with the cache flushes. It is
amazing how much this helps. It leads me to suggest handling
the whole cache more intelligently; those tasks that are not
cpu-bound become cache-bound.

And, this is from the real world. I compare such systems every day.

-- mrr
From: "Andy "Krazy" Glew" on
On 4/23/2010 12:51 AM, Andy "Krazy" Glew wrote:
> There are two sorts of database workloads:
> ...
>
> The second class tends to do things like joins. Which often have parts
> that look like
>
> LOOP small partition of left that stays in cache
> traverse left Btrees
> LOOP over all right that misses cache
> traverse Btree for right
> END LOOP
> END LOOP
>
> for various permutations of the inner and outer loops.
>
> For such JOINs, the inner loop may get clogged up wuth the cache misses,
> but the next iteration of the outer loop is essentially independent.
> E.g. if you are joining records, and not computing a sum. Even if you
> are computing a sum, you can parallelize all except the sum dependence.
>
>
> This is the classic example of code that does well with loop based
> speculative multithreading.
>
> Yes, software can do the parallelization too. And it often does. But it
> still tends to leave large chunks that look like the above. Which SpMT
> can handle.

By the way: in many ways such JOIN code is best expressed, in software, with explicit but lightweight multithreading:

LOOP small partition of left that stays in cache
traverse left Btrees
LOOP over all right that misses cache
FORK a thread to traverse Btree for right
END LOOP
END LOOP

where "best expressed" means natural, easy to read, etc. (The non-threaded version above is okay to read; but I have
seen attempts by compilers AND human programmers to software pipeline these things, i.e. interleaving separate index
tree traversals in the same loop. Mostly unsuccessful. IMHO one of the most important things about multithreaded code
is when it is easier and more natural to express than serial code.)

i.e. a more-outer serial loop, that is managing cache locality, and a more-inner loop that can be expressed as parallel.

This is a common pattern: serial outer, parallel inner, with the serial code managing locality. Probably with even
more-outer loops parallelizing across multiple processors, multiple machines, etc.

Why so? Management of locality. You could express everything in the maximally parallel form, but then your thread
scheduler, your run-time, has to take into account everything related to parallelism and performance. Nikhil's comment
on dataflow for serial machines versus paralll dataflow was the the Von Neuman "bottleneck" provides a lot of clues to
things like load balancing, etc. There is more thnan enough parallelism. The problem is how to manage it. In the
semi-serial form, your programmer can manage.

But... if you use the semi-serial/parallel form, you don't ALWAYS want to spawn a new thread. Nor do you ALWAYS want to
use task queues over existing threads. Rather, you want to do something like



LOOP small partition of left that stays in cache
traverse left Btrees
LOOP over all right that misses cache
IF the hardware situation is such that a new thread can run efficiently, and would improve performance THEN
spawn a new thread
ELSIF there are too many threads THEN
kill a thread (perhaps after it has finished its current task, perhaps not)
END IF

Put an item on the task queue for one of the threads to run that:
traverse Btree for right
END LOOP
END LOOP


Two problems with this:


(1) the IF statement in the innermost loop is really hardware and workload specific, and is also often dynamic. It
depends stuff like CPU utilization that the software has trouble handling.

Basically, it is something dynamic. Like managing a cache. Like a branch predictor.


(2) "Putting an item on a task queue" is basically creating a continuation or closure. Which is a wonderful programming
style. But which is not something available to people in many programing languages, like C and C++. I.e. you have to
roll your own. Also, preparing a continuation or closure is often expensive. It's often a lot cheaper to just roll
into the next iteration of the loop than it is to prepare a task packet.

So you get into the issue "Do I prepare a task packet every 4 loop iterations? Every 16? ..."


Whatever the programmer decies will change (a) by the time you are on a different machine, but also, worse (b) depending
on workload, the type of datastructures or database you are looking at.


My point is that the decisions for when to go parallel and when not are dynamic and hardware dependent.


In some ways things like speculative multithreading amount to providing efficient support for closures and continuations
- but only so long as those closures and continuations can fit within some register storage. Not generic support,
where the data has to spill to memory.




I personally actually prefer the EXPLICIT parallelism form of the code. Fork as many threads as possible, let the
run-time, both hardware and software, take care of it.

It's especially nice because those explicit threads are explicitly indicated not to have cross thread dependencies,
except at synch points. Therefore, you don't need to have all of memory speculation hardware that OOO has.

I.e. I like explicit parallelism because it allows us avoid unnecessary hardware.

SpMT on single threaded code is just a way of making such dynamic hardware and workload dependent thread scheduling systems.

In many ways, I would like to such dynamic hardware and workload dependent thread scheduling for fine grained explicit
parallel lightweight threaded code. That is nicer.



---


In my mind the debate is not between

Heavyweight single threaded hardware OOO

vs

Manycore explicit parallelism on sngle core.



It is, rather, between statically expressed parallelism, and dynamic parallelism.


Where the dynamic parallelism can be managed (a) at compile time, (b) by dynamic JIT-like software, but also (c) via
hardware mechanisms. Where the latter are more likely to be able to respond to things like current CPU utilization, etc.




One of the reasons I am so excited about GPU microarchitectures is that they have founbd a new point in this space.
They have fairly sophisticated dynamic schedulers in hardware - e.g. the "Ultra Threaded Dispatch Processor" - but the
units of scheduling are heavier than individual instructions in OOO, albeit much lighter that threads and processes on
conventional systems.


From: Morten Reistad on
In article <iIadnbiYmNH4ak3WnZ2dnUVZ_vCdnZ2d(a)speakeasy.net>,
Rob Warnock <rpw3(a)rpw3.org> wrote:
>Morten Reistad <first(a)last.name> wrote:
>+---------------
>| Interrupt coalescing; from Linux 1.6.24; gives an order of
>| magnitude more i/o performance on networks and similar devices.
>+---------------
>
>While your first and last phrases are certainly quite correct,
>the middle one gives the impression that Linux invented interrupt
>coalescing. Actually, it came *quite* late to that game!! E.g., I
>was doing interrupt coalescing in terminal device drivers[1] in
>TOPS-10 circa 1972 [and, yes, it *seriously* improved terminal I/O
>performance!!]; in Fortune Systems's FOR:OS in 1985; in SGI's Irix
>network code circa 1990; etc. And I certainly wasn't the only one.
>Linux is reinventing a very old wheel here.

Yep, Linux is picking up old tricks. But it is getting pretty
good at it; and has adapted a pretty long list of old tricks by
now. I just point out how dramatic such stuff can be for real life
performance. It isn't just for theorists; this is about real
kilowatts in real internet services with tens or hundreds of
thousands of daily users. And as much of an old BSD fan I may be,
Linux is where the big open source internet applications happen.

I was also referring to Linux because I have done extensive benchmarking
on a lot of internet based service farms, and we end up time and
time again to identify either memory performance/cache size or
cpu power per watt as the limiting factors. All the internet services
we are dealing with are either extremely frugal on system power, or
handle 64+ processors reasonably well, as long as the logic they
are configured with allow parallellism.

We had a stunning comparison of a 32-way system and an 8-way system,
almost exactly same clock rates and processor dhrystone performance,
but with cache size, cache interconnect and interrupt coalescing
stacked in the disfavour of the 32-way system. Otherwise we had the
same kernel version, and the same application.

The 8-way system handles 12800 streams, the 32-way system struggles
with 2400. In terms of cpu power they are almost 5:1 (independent
dhrystone sums) but in terms of throughput they are almost 1:5, in
the other direction. That is more than a 20:1 performance hit due
to these three factors.

-- mrr

From: nmm1 on
In article <4BD15197.3050202(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>So, it's happened again. Just as OOO CPU research was the poor cousin
>during the early days of the RISC revolution, so SpMT is the poor
>cousin in the early days of MultiCore and ManyCore. If M&M run out of
>steam, I hope that Haitham Akkary's SpMT research will be there, still
>ongoing, to pick up the pieces.

Er, what have I missed?

"A Minimal Dual-Core Speculative Multi-Threading Architecture"

That abstract states a potential dual-core gain of 20% - not
bad, but not exciting. Now, extensive experience with branch
prediction and preloading analyses indicates that the gain with
number of parallel lookaheads is rarely much better than log(N),
so that isn't going to give more than a factor of two or so.

Multi-core, on the other hand, has a potential LARGE gain, once
people bite the bullet and start designing software for it. Yes,
that's a big hurdle, and we know that some applications can't
make use of it, but that's a hell of a lot more scalable. We
know that some applications can get gains in the hundreds, and
a few can do even better.


Regards,
Nick Maclaren.