From: Nick Maclaren on

In article <4d98c7c6-9019-4d59-af03-82af04489c67(a)f63g2000hsf.googlegroups.com>,
"Paul A. Clayton" <paaronclayton(a)earthlink.net> writes:
|
|> > That's why I have liked the idea of switching threads on a cache miss
|> > for a good many years now - given that memory latency is THE problem,
|> > a solution that starts by assuming that seems good to me.
|>
|> Memory latency is the MAIN problem, but unpredictable control flow
|> changes are not entirely trivial in some applications. Even data
|> dependencies might be significant limitations.

Agreed. I didn't mean "the only" but "the dominant".

See my other response as to why SMT doesn't usually help much.

|> Providing a very wide processor with a 10% larger core to support SMT
|> can make as much or more sense than providing a non-SMT version that
|> uses the 10% space for another much simpler core. (Sure it would
|> generate better throughput at lower power to use only simpler cores;
|> but it seems that ILP drives a lot of processor design.)

That is actually the key - and the target is not technical.

SMT is a compromise between producing a pure benchmarketing SPEC engine,
which would have a dire performance in most servers, and one that would
look bad but perform well. As such, it is a good technology. But the
claims that it actually solves the technical problems are not backed
up by evidence.


Regards,
Nick Maclaren.
From: Nick Maclaren on

In article <1ielzej.18dmntm1moby6hN%nospam(a)ab-katrinedal.dk>,
nospam(a)ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) writes:
|> John Dallman <jgd(a)cix.co.uk> wrote:
|>
|> > Not exactly. A kind of mathematical modelling that has substantial
|> > elements of goal-seeking. It tends to search memory in a good simulation
|> > of a random pattern for a few micro-seconds, do some moderately serious
|> > FP crunching for a vaguely similar length of time, albeit in algorithms
|> > rather more complex than a typical HPC inner loop, and then go
|> > memory-searching again. It repeats this until it gets an answer -
|> > there's no real way to predict how long this will take - and then
|> > returners to its caller It bashes memory hardest, FP next and branch
|> > prediction third.
|>
|> Smells like the problem is that you modify the datastructures that you
|> search, in order to speed up searches?
|>
|> You could have 2 versions of the search, one that improve the data
|> structures and one that don't. Using the former a percentage of the time
|> would be enough to evolve the data structures.

Eh? I understood John Dallman to mean that the data structures are
essentially read-only, but are being accessed in a most cache-unfriendly
way. That is the normal characteristic of such programs - and they are
not rare.

As I have posted before, there are some fairly basic algorithms that
are provably of that form - i.e. mathematically unoptimisable. The
only real hope for a considerable speedup in those cases is to solve
the problem using an entirely different mathematical model - which is
quite often possible, but needs a VERY high calibre of person to do
the transformation!


Regards,
Nick Maclaren.
From: Nick Maclaren on

In article <fsnlii$p10$1(a)gemini.csx.cam.ac.uk>,
nmm1(a)cus.cam.ac.uk (Nick Maclaren) writes:
|> In article <52acdbf3-c3ff-4de0-ae56-ed0db2381930(a)59g2000hsb.googlegroups.com>,
|> "Paul A. Clayton" <paaronclayton(a)earthlink.net> writes:
|>
|> |> If the code is heavy
|> |> with unpredictable control flow changes, then adding a thread
|> |> (whether SMT or FineGrainedMT) will effectively half the branch
|> |> misprediction penalty. (FGMT doesn't help much [any?] with data
|> |> dependencies unless they are multicycle.)
|>
|> That is not true - analyse it more carefully. It would be true only
|> if no work were needed to recover from a misprediction. But, as
|> critical parts of the CPU are typically working flat-out to recover,
|> it doesn't help.

Sorry - I was being misleading. I should have said "it doesn't
effectively halve the branch misprediction penalty". Exactly what
effect it will have will depend on a lot of details - starting with
the question of how many instruction decode paths there are and how
independent they are of the actual execution!

What I was thinking of was the comparison between SMT and two
independent cores, with the same chip constraints (be they area,
transistor count, watts or whatever). And I have seen no decent
analysis of that, though I should be interested to see one.


Regards,
Nick Maclaren.
From: Terje Mathisen on
Nick Maclaren wrote:
> Eh? I understood John Dallman to mean that the data structures are
> essentially read-only, but are being accessed in a most cache-unfriendly
> way. That is the normal characteristic of such programs - and they are
> not rare.

I believe the latest HPC oil codes are moving in this direction.
>
> As I have posted before, there are some fairly basic algorithms that
> are provably of that form - i.e. mathematically unoptimisable. The
> only real hope for a considerable speedup in those cases is to solve
> the problem using an entirely different mathematical model - which is
> quite often possible, but needs a VERY high calibre of person to do
> the transformation!

The big hump is possibly that the parallel friendly algorithms might
start out with 10% of the performance, requiring 10 times the hw
resources just to recover the initial algorithmic overhead, before you
get any gains at all.

Terje
--
- <Terje.Mathisen(a)hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
From: Nick Maclaren on

In article <Uu6dneNs3YKI6HLanZ2dnUVZ_oSunZ2d(a)giganews.com>,
Terje Mathisen <terje.mathisen(a)hda.hydro.com> writes:
|>
|> > Eh? I understood John Dallman to mean that the data structures are
|> > essentially read-only, but are being accessed in a most cache-unfriendly
|> > way. That is the normal characteristic of such programs - and they are
|> > not rare.
|>
|> I believe the latest HPC oil codes are moving in this direction.

I didn't know that, but it doesn't surprise me.

|> > As I have posted before, there are some fairly basic algorithms that
|> > are provably of that form - i.e. mathematically unoptimisable. The
|> > only real hope for a considerable speedup in those cases is to solve
|> > the problem using an entirely different mathematical model - which is
|> > quite often possible, but needs a VERY high calibre of person to do
|> > the transformation!
|>
|> The big hump is possibly that the parallel friendly algorithms might
|> start out with 10% of the performance, requiring 10 times the hw
|> resources just to recover the initial algorithmic overhead, before you
|> get any gains at all.

That is a separate matter, but is certainly true :-(

Exercise for the reader: work out a parallel-friendly 3-D Fourier
transform algorithm, and analyse its performance with respect to a
parallel implementation of a conventional 3-D FFT. It is actually
very easy to do, and most instructive - if not useful!


Regards,
Nick Maclaren.