From: Paul A. Clayton on
On Mar 29, 9:30 am, j...(a)cix.co.uk (John Dallman) wrote:
> In article
> <d3e90bcf-a062-49d0-86b5-1e8445212...(a)s19g2000prg.googlegroups.com>,
>
> dkan...(a)gmail.com (David Kanter) wrote:
> > Every processor IBM has designed since the POWER5 has used SMT, and
> > generally, IBM has a tendency to make reasonable choices.
>
> And they say, up-front, that if you're doing something CPU-limited, you
> should turn it off. Note to people who don't have experience with
> AIX-type machines: the PowerPC-based CPUs in the XBox 360 and PS/3,
> while produced in huge volumes, aren't much used in IBM's own products.

Actually that assumes that the applications have enough parallelism to
at
least saturate one kind of functional units. If the code is heavy
with data
dependencies then a second thread is more likely to have opportunities
for making progress with a wide processor. If the code is heavy with
unpredictable control flow changes, then adding a thread (whether SMT
or FineGrainedMT) will effectively half the branch misprediction
penalty. (FGMT doesn't help much [any?] with data dependencies unless
they are multicycle.)

> SMT arguments tend to futility because everyone acts as if their kind of
> workload is "typical", even though they know that it isn't really. SMT
> seems to work quite well for tasks that can use a lot of threads that
> don't have huge amounts of work each, and have other caps on
> performance. File serving, and some kinds of web serving are examples,
> where disk and/or network speed can also be limiters.

The fact is that one can turn off SMT--so a single processor can look
like a very aggressive ILP non-MT processor or a pair of
processors targeted at more moderate ILP. Sure two moderate-ILP
cores might very well outperform a 2-way SMT with twice the
issue width and have comparable or smaller area and I am guessing
likely have somewhat lower power consumption; but then an
application with some extra available ILP that cannot be profitably
converted to TLP because of the closeness of communication
required will have lower performance.

> The stuff I work on has quite a bit of multi-threading, but was reliably
> slower on first-generation HyperThreading, because the time costs of
> locking weren't made up for by access to increased processing power.
> Intel were disappointed by this, and hoped that the second-generation
> implementation in the "Prescott" series of Pentium 4s would convert us.
> It did not; while it was better, it did not give a significant speed-up.

HPC/high ILP code? (BTW, properly rescheduled FP code could
benefit from the effectively shorter latencies of operations by
reducing the number of active registers needed [and avoid register
spill/fill].)

> One explanation came out as "HyperThreading is a way to get higher
> utilisation from the pool of execution units. However, if the threads
> are running very similar code, there aren't (many) spare execution units
> of the type that's the bottleneck, so there's no significant increase in
> throughput". Me, I reckoned the limits of memory bandwidth - the threads
> weren't working with the same data - also had something to do with it.

One bad part about any kind of multithreading is that if there is a
bottleneck,
contention for the resource can cause LOWER performance (even ignoring
additional communication/synchronization overheads).

Paul A. Clayton
reachable as 'paaronclayton'
at "embarqmail.com"
From: Paul A. Clayton on
On Mar 29, 10:00 am, n...(a)cus.cam.ac.uk (Nick Maclaren) wrote:
[snip]
> That's why I have liked the idea of switching threads on a cache miss
> for a good many years now - given that memory latency is THE problem,
> a solution that starts by assuming that seems good to me.

Memory latency is the MAIN problem, but unpredictable control flow
changes are not entirely trivial in some applications. Even data
dependencies might be significant limitations.

Providing a very wide processor with a 10% larger core to support SMT
can make as much or more sense than providing a non-SMT version that
uses the 10% space for another much simpler core. (Sure it would
generate better throughput at lower power to use only simpler cores;
but
it seems that ILP drives a lot of processor design.)


Paul A. Clayton
reachable as 'paaronclayton'
at "embarqmail.com"


From: John Dallman on
In article
<52acdbf3-c3ff-4de0-ae56-ed0db2381930(a)59g2000hsb.googlegroups.com>,
paaronclayton(a)earthlink.net (Paul A. Clayton) wrote:

> > The stuff I work on has quite a bit of multi-threading, but was
> > reliably slower on first-generation HyperThreading, because the
> > time costs of locking weren't made up for by access to increased
> > processing power. Intel were disappointed by this, and hoped that
> > the second-generation implementation in the "Prescott" series of
> > Pentium 4s would convert us. It did not; while it was better, it
> > did not give a significant speed-up.
>
> HPC/high ILP code?

Not exactly. A kind of mathematical modelling that has substantial
elements of goal-seeking. It tends to search memory in a good simulation
of a random pattern for a few micro-seconds, do some moderately serious
FP crunching for a vaguely similar length of time, albeit in algorithms
rather more complex than a typical HPC inner loop, and then go
memory-searching again. It repeats this until it gets an answer -
there's no real way to predict how long this will take - and then
returners to its caller It bashes memory hardest, FP next and branch
prediction third.

> (BTW, properly rescheduled FP code could benefit from the effectively
> shorter latencies of operations by reducing the number of active
> registers needed [and avoid register spill/fill].)

That's for the compilers to handle. This stuff is hard enough to write
and maintain that we avoid trying to recode algorithms in processor-
specific ways. It has to run at good speed on quite a few platforms and
the requirements for consistency between platforms are extremely strict.
Wrong answers are no use at all, no matter how fast you can get them.

--
John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
John Dallman <jgd(a)cix.co.uk> wrote:

> In article
> <52acdbf3-c3ff-4de0-ae56-ed0db2381930(a)59g2000hsb.googlegroups.com>,
> paaronclayton(a)earthlink.net (Paul A. Clayton) wrote:
>
> > > The stuff I work on has quite a bit of multi-threading, but was
> > > reliably slower on first-generation HyperThreading, because the
> > > time costs of locking weren't made up for by access to increased
> > > processing power. Intel were disappointed by this, and hoped that
> > > the second-generation implementation in the "Prescott" series of
> > > Pentium 4s would convert us. It did not; while it was better, it
> > > did not give a significant speed-up.
> >
> > HPC/high ILP code?
>
> Not exactly. A kind of mathematical modelling that has substantial
> elements of goal-seeking. It tends to search memory in a good simulation
> of a random pattern for a few micro-seconds, do some moderately serious
> FP crunching for a vaguely similar length of time, albeit in algorithms
> rather more complex than a typical HPC inner loop, and then go
> memory-searching again. It repeats this until it gets an answer -
> there's no real way to predict how long this will take - and then
> returners to its caller It bashes memory hardest, FP next and branch
> prediction third.

Smells like the problem is that you modify the datastructures that you
search, in order to speed up searches?

You could have 2 versions of the search, one that improve the data
structures and one that don't. Using the former a percentage of the time
would be enough to evolve the data structures.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Nick Maclaren on

In article <52acdbf3-c3ff-4de0-ae56-ed0db2381930(a)59g2000hsb.googlegroups.com>,
"Paul A. Clayton" <paaronclayton(a)earthlink.net> writes:
|> On Mar 29, 9:30 am, j...(a)cix.co.uk (John Dallman) wrote:
|> > In article
|> > <d3e90bcf-a062-49d0-86b5-1e8445212...(a)s19g2000prg.googlegroups.com>,
|> > dkan...(a)gmail.com (David Kanter) wrote:
|> >
|> > > Every processor IBM has designed since the POWER5 has used SMT, and
|> > > generally, IBM has a tendency to make reasonable choices.
|> >
|> > And they say, up-front, that if you're doing something CPU-limited, you
|> > should turn it off. Note to people who don't have experience with
|> > AIX-type machines: the PowerPC-based CPUs in the XBox 360 and PS/3,
|> > while produced in huge volumes, aren't much used in IBM's own products.
|>
|> Actually that assumes that the applications have enough parallelism to
|> at least saturate one kind of functional units. If the code is heavy
|> with data dependencies then a second thread is more likely to have
|> opportunities for making progress with a wide processor.

That is seriously misleading. The lesser point is that, yes, the
problem usually is saturation of one key 'unit' (and it's not always
a functional unit as such, but may be other CPU resources).

The major one is that is an argument for several simple cores, with
a rather larger number of contexts, and switching thread on cache
miss. It is NOT an argument for SMT. Sun and Intel seem to have
realised that.

|> If the code is heavy
|> with unpredictable control flow changes, then adding a thread
|> (whether SMT or FineGrainedMT) will effectively half the branch
|> misprediction penalty. (FGMT doesn't help much [any?] with data
|> dependencies unless they are multicycle.)

That is not true - analyse it more carefully. It would be true only
if no work were needed to recover from a misprediction. But, as
critical parts of the CPU are typically working flat-out to recover,
it doesn't help.

Now, on a coprocessor system (such as a vector machine), it CAN help,
because the parts of the CPU needed to execute intructions and the
parts needed to restart the pipeline are not the same. But it generally
doesn't, even on them, because of practical problems.

|> The fact is that one can turn off SMT--so a single processor can look
|> like a very aggressive ILP non-MT processor or a pair of
|> processors targeted at more moderate ILP. ...

You always pay for gimmicks you don't use, possibly by not being allowed
to have a feature that you WOULD use.

|> One bad part about any kind of multithreading is that if there is a
|> bottleneck,
|> contention for the resource can cause LOWER performance (even ignoring
|> additional communication/synchronization overheads).

And the smaller the grain, the worse the latter problem becomes. SMT
is about as fine a grain as is likely to be attempted.


Regards,
Nick Maclaren.