Processors stall on OLTP workloads about half the time--almost nomatter what you do [Computer Architecture]

Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: c.arch dinner in silicon valley next week?

From: Ken Hagan on 23 Apr 2010 05:19

On Fri, 23 Apr 2010 08:50:33 +0100, Quadibloc <jsavard(a)ecn.ab.ca> wrote:

> Hmm. While I think that maximal monothreaded performance is what is
> generally needed

But that would be "per core" and there's no reason not to offer 4 or 8
cores on a single chip. In the consumer space, existing software will use
2 or 3 threads to put up flash-infested web pages and half a dozen when
running "heavy loads" such as multimedia packates.

Of course, the hot and expensive server chips already do this, but I can't
see any reason why the low-power chips intended for phones and netbooks
shouldn't be doing this too. The software is already out there.

From: Morten Reistad on 22 Apr 2010 06:04

In article <86666a83-4bed-472c-aacd-9fc6ef47e9e6(a)k33g2000yqc.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:
>On Apr 21, 11:02�am, Robert Myers <rbmyers...(a)gmail.com> wrote:
>> Even though the paper distinguishes between technical and commercial
>> workloads, and draws its negative conclusion only for commercial
>> workloads, it was interesting to me that, for instance, Blue Gene went
>> the same direction--many simple processors--for a technical workload so
>> as to achieve low power operation.
>
>Reading between the lines, Comercial and DB workloads are better
>served by slower processors accessing a thinner cache/memory hierarchy
>than by faster processors accessing a thicker cache/memory hierarchy.
>That is: a comercial machine is better served with larger first level
>cache backed up by large second cache running at slower frequencies,
>while a technical machine would be better served with smaller first
>level caches, medium second elvel cache and a large third level cache
>running at higher frequencies.

I can confirm this from benchmarks for real-life workloads for
pretty static web servers, media servers and SIP telephony systems.
The cache size means everything in this context.

Specifically, we see that an 8-way HP/Xeon with 8x8M L3 cache and
hyperchannel (and 2 cores per die) outperformans a 32-way Sun/Opteron with 32x
cpus, 6 per die, by almost an order of magnitude. 12000 rtp calls,
cs 2600, or slightly higher numbers for rtp duplication.

>What this actually shows is that "one design point" cannot cover all
>bases, and that one should configure a technical machine differently
>than a comercial machine, differently than a database machine.

But all of the popular applications are hitting one of two walls,
either the memory latency wall, or the power density wall.

Backing out slightly on raw per processor speed, getting lots and lots
of cache, and getting power consumption down will be huge wins
for the internet farms.

>We saw this develop with the interplay between Alpha and HP, Alpha
>taking the speed deamon approach, while HP took the brainiac approach.
>Alpha had more layers of cache with thinner slices at each level. HP
>tried to avoid even the second level of cache (7600 Snakes) and then
>tried to avoid bringing the first level cache on die until the die
>area was sufficient. On certain applications Alpha wins, on others HP
>wins. We also witness this as the Alpha evolved, 8kB caches became 16
>KB cache, then back to 8KB caches as teh cache hierarchy was
>continually rebalanced to the wrokloads (benchmarks) the designers
>cared about.
>
>Since this paper was written slightly before the x86 crushed out RISCs
>in their entirety, the modern reality is that technical, comercial,
>and database applications are being held hostage to PC-based thinking.
>It has become just too expensive to target (with more than lip
>service) application domains other than PCs (for non-mobile
>applications). Thus the high end PC chips do not have the memory
>systems nor interconnects that would beter serve other workloads and
>larger footprint serer systems.
>
>A shame, really

Actually, the Hyperchannel cache interconnects work very nicely to
make all the on-chip caches work as a system-wide cache, not as
die-wide. I may even suggest L4 cache attachment to static on-chip
hyperchannel memory; like a Xeon with no CPUs.

Also, backing out a little on clock speed saves lots of power.

-- mrr

|
Pages: 1
Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: c.arch dinner in silicon valley next week?