Processors stall on OLTP workloads about half the time--almost no matter what you do [Computer Architecture]

Prev: Looking for Sponsorship
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do

From: MitchAlsup on 22 Apr 2010 23:20

On Apr 22, 3:15 pm, Stephen Fuld <SF...(a)alumni.cmu.edu.invalid> wrote:
> I don't see how you get a multi-core design with the same transistor
> count as a multi-threaded one. I have seen numbers of 5% additional
> logic for a second thread. Mostly you duplicate the registers and add a
> little logic. But with two cores, clearly you get 100% overhead,
> duplicating the registers, the execution units, the L1 caches and all
> the other logic.

On (say) Pentium4, once the pipeline was sufficiently "screwed up"
that adding threading was easy (5%) the design team is in the position
of having to do it.

We looked at this for K9, it would have added several pipe steps, a
bunch of instruction buffering, and some minor register state. It
ended up closer to 9% than 5%. It would have also delivered similar
throughputs (+15%-ish) but it would have come with a monothreaded cost
of some 7% off the top. So after you lost 7%, you could add 15% back
in and look like a genius ((ahem and with a deep sounding voice
drawing out the enunciation):RIGHT). I wonder if some Intel engineer/
designer knows hat was lost in P4 such that threading became easy. My
guess is that we will not know for a very long time (2 decades)

On the other hand one could build a 1-wide in order core that gives
something like 37% of the K9 performance for half of the die additions
needed to add threading. That is: an attached 1W IO core added to a
GreatBig OoO core would add more performance and add less die area
than adding threading to the GB OoO core. {Caveat: To a pipeline
inherently slimmed for highest possible frequency (say 5GHz when
Opterons were at 3GHz).

On the third hand, more medium sized cores loose little in comercial
applications simply because they wait just as well as the GB OoO cors
wait (and just as long). And by not being so big,..... they consume
less power, area,... design time, debugging time,...

Its a multidimensional optimization puzzle. Once you drop the need for
maximal monothreaded performance, the GB OoO design point is no longer
optimal by any metric you want to apply to the comercial space (and
others). But since so much of todays benchmarks are SSSOOOOOO
monothreaded, the market gets what the benchmaks convince the
designers to build.

Also note: if you look at the volume of chips that go into servers and
other big iron, it represents an aftenoon in the FAB per year compared
to the desktop and notebooks,... A profitable afternoon, but not big
enough for an Intel nor AMD to alter design team directions.

Mitch

From: MitchAlsup on 22 Apr 2010 23:28

On Apr 22, 3:15 pm, Stephen Fuld <SF...(a)alumni.cmu.edu.invalid> wrote:
> Of course, comparing one design with nearly twice the number of
> transistors could outperform the single core design.

Counting transistors is a poor way to judge a CPU design, or compare
CPU utility functions.

A small/medium core with 6X-8X the cache might fit in exactly the same
die area as a GB OoO core.

An old example I used several years ago was comparing an Opteron core
with a quad postage stamp core with a 4 way interleaved 256KB shared
L2. Same die area, greater throughput, higher transistor count, more
cache, greater ILP, greater MLP, smaller power disipation.

Knock off 3 of those other PSPs, and one could have one small core and
512KB of cache in the same die footprint as the Opteron core (with no
L2 or NB or memory/DRAM controller or pins).

Mitch

From: Quadibloc on 23 Apr 2010 03:50

On Apr 22, 9:20 pm, MitchAlsup <MitchAl...(a)aol.com> wrote:
> Once you drop the need for
> maximal monothreaded performance, the GB OoO design point is no longer
> optimal by any metric you want to apply to the comercial space (and
> others). But since so much of todays benchmarks are SSSOOOOOO
> monothreaded, the market gets what the benchmaks convince the
> designers to build.

Hmm. While I think that maximal monothreaded performance is what is
generally needed - except in the relatively unusual application of
OLTP, where throughput is king - out-of-order execution involves a
great deal of complexity (although the note in this thread that 6600-
style scoreboards require much less is food for thought). Would a
superscalar chip that uses multithreading to make full use of the
computational resources that are there anyways, with generous cache,
be a good design point?

It would seem to me that one only needs to put multiple cores per
chip, aside from satisfying strange things like Windows licensing
requirements, if it's important to have the processors tightly
coupled. Independent jobs from different users that don't share
memory, and which could be running in different boxes connected by
network cables, hardly need to share cache.

But the fact that very small caches, like that on the original five-
volt Pentium, or the 360/85, were already enough to vastly improve
performance means that cache size involves diminishing returns. That
seems like a good reason to consider putting another core on the chip.

John Savard

From: Anne & Lynn Wheeler on 21 Apr 2010 16:03

Robert Myers <rbmyersusa(a)gmail.com> writes:
> If Intel management read this report, and I assume it did, it would
> have headed in the direction that Andy has lamented: lots of simple
> cores without energy-consuming cleverness that doesn't help much,
> anyway--at least for certain kinds of workloads. The only thing that
> really helps is cache.

in the time-frame we were doing cluster scaleup for both commercial
and numerical intensive ... commercial reference to jan92
http://www.garlic.com/~lynn/95.html#13

oracle made a big issue that they had done extensive tracing and
simulation work ... and major thruput factor at the time was having at
least 2mbyte processor caches ... and they worked with major server
vendors to have option for sufficient cache.

recent posts on cluster scaleup
http://www.garlic.com/~lynn/2010f.html#47 Nonlinear systems and nonlocal supercomputing
http://www.garlic.com/~lynn/2010f.html#50 Handling multicore CPUs; what the competition is thinking
http://www.garlic.com/~lynn/2010g.html#8 Handling multicore CPUs; what the competition is thinking
http://www.garlic.com/~lynn/2010g.html#52 Handling multicore CPUs; what the competition is thinking

the other issue was compare&swap had become widely used for large DBMS
multi-threaded operation (whether running multi-threaded or not) ... and
although rios/rs6000 did provide for smp operation ... it also didn't
provide an atomic compare&swap primitive. as a result, dbms thruput
suffered on rs/6000 platform because kernel calls were required to to
have serialized operation. eventually aix provided a simulation of
compare&swap semantics via a supervisor call (special fastpath in
supervisor call interrupt routine that operated while disabled for
interrupts ... works in a single processor environment). misc.
past posts mentioning compare&swap (&/or smp):
http://www.garlic.com/~lynn/subtopic.html#smp

compare&swap was originally invented by charlie working on fine-grain
multiprocessor cp67 kernel locking at the science center. an effort was
then made to try and get it included in 370 architecture ... which was
rebuffed by the favorite son operating system in pok (claiming test&set,
from 360, was more than sufficient). 370 architecture then provided
opening with challenge to come up uses for compare&swap that weren't
multiprocessor specific; thus was born the descriptions of compare&swap
for use by multithreaded applications.

--
42yrs virtualization experience (since Jan68), online at home since Mar1970

From: MitchAlsup on 21 Apr 2010 18:36

On Apr 21, 11:02 am, Robert Myers <rbmyers...(a)gmail.com> wrote:
> Even though the paper distinguishes between technical and commercial
> workloads, and draws its negative conclusion only for commercial
> workloads, it was interesting to me that, for instance, Blue Gene went
> the same direction--many simple processors--for a technical workload so
> as to achieve low power operation.

Reading between the lines, Comercial and DB workloads are better
served by slower processors accessing a thinner cache/memory hierarchy
than by faster processors accessing a thicker cache/memory hierarchy.
That is: a comercial machine is better served with larger first level
cache backed up by large second cache running at slower frequencies,
while a technical machine would be better served with smaller first
level caches, medium second elvel cache and a large third level cache
running at higher frequencies.

What this actually shows is that "one design point" cannot cover all
bases, and that one should configure a technical machine differently
than a comercial machine, differently than a database machine.

We saw this develop with the interplay between Alpha and HP, Alpha
taking the speed deamon approach, while HP took the brainiac approach.
Alpha had more layers of cache with thinner slices at each level. HP
tried to avoid even the second level of cache (7600 Snakes) and then
tried to avoid bringing the first level cache on die until the die
area was sufficient. On certain applications Alpha wins, on others HP
wins. We also witness this as the Alpha evolved, 8kB caches became 16
KB cache, then back to 8KB caches as teh cache hierarchy was
continually rebalanced to the wrokloads (benchmarks) the designers
cared about.

Since this paper was written slightly before the x86 crushed out RISCs
in their entirety, the modern reality is that technical, comercial,
and database applications are being held hostage to PC-based thinking.
It has become just too expensive to target (with more than lip
service) application domains other than PCs (for non-mobile
applications). Thus the high end PC chips do not have the memory
systems nor interconnects that would beter serve other workloads and
larger footprint serer systems.

A shame, really

Mitch

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Looking for Sponsorship
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do