From: Nick Maclaren on

In article <d3e90bcf-a062-49d0-86b5-1e8445212dfb(a)s19g2000prg.googlegroups.com>,
David Kanter <dkanter(a)gmail.com> writes:
|>
|> > Don't believe what you read - half of it is propaganda :-)
|>
|> In this case, it would be your post. SMT is a very well established
|> technique, used by most of the high performance CPU vendors. Even the
|> embedded guys use SOEMT quite heavily.

I suggest that you read a thread before allowing your knees to jerk.
I specifically said that SOEMT was viable. Dammit, it was obviously
a good way to proceed even before the Tera MTA showed that it was!
Blindingly obvious is the expression that springs to mind.

|> > I took the trouble of reading Eggers' main paper (and others), and
|> > analysing her calculations. I started out impressed, and got less
|> > so as I proceeded. There was one very significant omission: the
|> > comparison between the throughput of a SMT system and a multi-core
|> > system with the same number of transistors and same amount of
|> > parallelism.
|>
|> Here's a hint Nick, if you think that ALU real estate was a 1970's
|> problem, then transistor count is a 1960's problem.
|>
|> CPUs today use around 400-600M transistors easily, with high-end ones
|> up to 2B. I don't know about you, but somehow I don't think
|> transistor count is a relevant metric today.

Would you like to explain what on earth you are wittering on about?
That was precisely the point that I made! And, before you jerk your
knees again, read by third point below.

|> SMT is an extremely power efficient technique, and that's a hell of a
|> lot more important than transistors. Every year, we get 2x as many
|> transistors, yet the number of watts remains the same. Which one do
|> you think is the bottleneck?

Sigh. Firstly, that is wrong. The number of watts fluctuates a bit
but, even over the past decade, the trend is up.

Secondly, all of the many chip designers, experts and vendors I have
spoken to have referred to process technologies as the key to keeping
power under control, followed by changing clock rates and/or turning
sections of chip off when unused. SMT was only mentioned before they
had tried it :-)

Thirdly, your claim that SMT is an extremely power efficient technique
doesn't make it so. You clearly haven't read Eggers' main paper with
any care, or haven't understood it. I used the transistor count as a
constraint only because that paper did.

Fourthly, you failed to understand the point I made above - replace the
constraint on number of transistors by the number of watts, and there
was STILL no comparison in those papers with an equivalent multi-core
system.

Fifthly, both Sun's and Intel's 'experimental' low-power systems use
larger numbers of simpler, non-SMT cores - which is PRECISELY the
technique I was saying that should also have been considered. And
THAT is the key to why you should update your beliefs!

|> > Also, the scalability was dire, and that was for the
|> > SMT-friendly MIPS chip - as history shows, Intel's attempt to use
|> > it on the x86 was not a great success.
|>
|> Here's a hint, implementations vary. Intel's first implementation
|> sucked. Their second one probably won't.

Here's a technique. Look at what Intel are experimenting with. It's
easy enough to find on the Web - if you can be bothered - I knew about
it earlier, of course, under NDA.

|> Every processor IBM has designed since the POWER5 has used SMT, and
|> generally, IBM has a tendency to make reasonable choices.

IBM's objective with those designs is not mainstream computing - more
like mainframe computing!

|> You should leave armchair architecture to those that actually
|> understand it...

So, are you claiming that you do? :-)


Regards,
Nick Maclaren.
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
Nick Maclaren <nmm1(a)cus.cam.ac.uk> wrote:

> In article
> <d3e90bcf-a062-49d0-86b5-1e8445212dfb(a)s19g2000prg.googlegroups.com>, David
> Kanter <dkanter(a)gmail.com> writes:
> |>
> |> > Don't believe what you read - half of it is propaganda :-)
> |>
> |> In this case, it would be your post. SMT is a very well established
> |> technique, used by most of the high performance CPU vendors. Even the
> |> embedded guys use SOEMT quite heavily.
>
> I suggest that you read a thread before allowing your knees to jerk.
> I specifically said that SOEMT was viable. Dammit, it was obviously
> a good way to proceed even before the Tera MTA showed that it was!
> Blindingly obvious is the expression that springs to mind.

The Tera MTA was not SOEMT, unless you accept every clock as an event.

> |> Every processor IBM has designed since the POWER5 has used SMT, and
> |> generally, IBM has a tendency to make reasonable choices.
>
> IBM's objective with those designs is not mainstream computing - more
> like mainframe computing!

The PS3 and Xbox 360 are hardly mainframes.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Nick Maclaren on

In article <1iekddk.165zhckhowzvuN%nospam(a)ab-katrinedal.dk>,
nospam(a)ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) writes:
|>
|> > I suggest that you read a thread before allowing your knees to jerk.
|> > I specifically said that SOEMT was viable. Dammit, it was obviously
|> > a good way to proceed even before the Tera MTA showed that it was!
|> > Blindingly obvious is the expression that springs to mind.
|>
|> The Tera MTA was not SOEMT, unless you accept every clock as an event.

Well, Tera told me that they didn't actually switch until they needed
to load from memory :-) Anyway, exactly what you call it doesn't
matter - the fact is that it demonstrated that SOEMT was viable,
where the event is a wait for memory.

|> > |> Every processor IBM has designed since the POWER5 has used SMT, and
|> > |> generally, IBM has a tendency to make reasonable choices.
|> >
|> > IBM's objective with those designs is not mainstream computing - more
|> > like mainframe computing!
|>
|> The PS3 and Xbox 360 are hardly mainframes.

True. But they are also not mainstream, and exactly how the CPUs work
isn't public knowledge, as far as I know - though a hell of a lot of
people CLAIM to know, based on parroting Web magazine articles. My
last close contacts with IBM were in the POWER4 days, and the gaming
CPUs had a lot more differences from the 'mainframe' ones than was
commonly believed. Note that was NOT in the CPU cores, but the memory,
SMP and RAS control.

One of the things that I was told was being considered was using the
'SMT' aspect to configure some systems into SIMD engines and others
into SMP ones. Whether that has been done for the gaming and
'mainframe' CPUs, respectively, I don't know. Del might.

As I said right back when the term "SMT" became trendy, it covered
a multitude of sins, and I now notice a trend to use the SMT label for
an ever-widening set of techniques. I don't know what the CPU design
of the year 2000 will look like, but it will be called SMT[*]!


[*] With apologies to whoever said it about Fortran :-)


Regards,
Nick Maclaren.
From: John Dallman on
In article
<d3e90bcf-a062-49d0-86b5-1e8445212dfb(a)s19g2000prg.googlegroups.com>,
dkanter(a)gmail.com (David Kanter) wrote:

> Every processor IBM has designed since the POWER5 has used SMT, and
> generally, IBM has a tendency to make reasonable choices.

And they say, up-front, that if you're doing something CPU-limited, you
should turn it off. Note to people who don't have experience with
AIX-type machines: the PowerPC-based CPUs in the XBox 360 and PS/3,
while produced in huge volumes, aren't much used in IBM's own products.

SMT arguments tend to futility because everyone acts as if their kind of
workload is "typical", even though they know that it isn't really. SMT
seems to work quite well for tasks that can use a lot of threads that
don't have huge amounts of work each, and have other caps on
performance. File serving, and some kinds of web serving are examples,
where disk and/or network speed can also be limiters.

The stuff I work on has quite a bit of multi-threading, but was reliably
slower on first-generation HyperThreading, because the time costs of
locking weren't made up for by access to increased processing power.
Intel were disappointed by this, and hoped that the second-generation
implementation in the "Prescott" series of Pentium 4s would convert us.
It did not; while it was better, it did not give a significant speed-up.

One explanation came out as "HyperThreading is a way to get higher
utilisation from the pool of execution units. However, if the threads
are running very similar code, there aren't (many) spare execution units
of the type that's the bottleneck, so there's no significant increase in
throughput". Me, I reckoned the limits of memory bandwidth - the threads
weren't working with the same data - also had something to do with it.

--
John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: Nick Maclaren on

In article <memo.20080329133047.1696A(a)jgd.compulink.co.uk>,
jgd(a)cix.co.uk (John Dallman) writes:
|> In article
|> <d3e90bcf-a062-49d0-86b5-1e8445212dfb(a)s19g2000prg.googlegroups.com>,
|> dkanter(a)gmail.com (David Kanter) wrote:
|>
|> > Every processor IBM has designed since the POWER5 has used SMT, and
|> > generally, IBM has a tendency to make reasonable choices.
|>
|> And they say, up-front, that if you're doing something CPU-limited, you
|> should turn it off. Note to people who don't have experience with
|> AIX-type machines: the PowerPC-based CPUs in the XBox 360 and PS/3,
|> while produced in huge volumes, aren't much used in IBM's own products.

Interesting. It fails to surprise me but, as I said, I have had little
direct contact with IBM since the POWER4.

|> One explanation came out as "HyperThreading is a way to get higher
|> utilisation from the pool of execution units. However, if the threads
|> are running very similar code, there aren't (many) spare execution units
|> of the type that's the bottleneck, so there's no significant increase in
|> throughput". Me, I reckoned the limits of memory bandwidth - the threads
|> weren't working with the same data - also had something to do with it.

Precisely. Interestingly, a careful reading of Eggers's papers made
that quite clear - which shows how many people actually READ the
references that they rely on :-(

The same thing applied to the SMP vector processors - if you were
running too many processes, it was BAD NEWS, even if only one was using
the vector unit. I managed to double the throughput on one system by
reducing the run limit by a factor of three.

That's why I have liked the idea of switching threads on a cache miss
for a good many years now - given that memory latency is THE problem,
a solution that starts by assuming that seems good to me.


Regards,
Nick Maclaren.