From: "Andy "Krazy" Glew" on
On 4/23/2010 10:34 AM, nmm1(a)cam.ac.uk wrote:
> In article<4BD15197.3050202(a)patten-glew.net>,
> Andy \"Krazy\" Glew<ag-news(a)patten-glew.net> wrote:
>>
>> So, it's happened again. Just as OOO CPU research was the poor cousin
>> during the early days of the RISC revolution, so SpMT is the poor
>> cousin in the early days of MultiCore and ManyCore. If M&M run out of
>> steam, I hope that Haitham Akkary's SpMT research will be there, still
>> ongoing, to pick up the pieces.
>
> Er, what have I missed?
>
> "A Minimal Dual-Core Speculative Multi-Threading Architecture"
>
> That abstract states a potential dual-core gain of 20% - not
> bad, but not exciting. Now, extensive experience with branch
> prediction and preloading analyses indicates that the gain with
> number of parallel lookaheads is rarely much better than log(N),
> so that isn't going to give more than a factor of two or so.
>
> Multi-core, on the other hand, has a potential LARGE gain, once
> people bite the bullet and start designing software for it. Yes,
> that's a big hurdle, and we know that some applications can't
> make use of it, but that's a hell of a lot more scalable. We
> know that some applications can get gains in the hundreds, and
> a few can do even better.


My analysis says that speedup for OOO and SpMT grows as sqrt(N). But, yes, it's sublinear.

The part that attracts me is that SpMT can run on a multithreaded CPU, and SpMT can run on a MultiCore or ManyCore MP
machine. I.e. it is not either/or. Build your ManyCore, but also get speedup for existing legacy code. The only minor
downside is that SpMT tends to do better if you do have a bit of multithreading support: the best way I know of to fork
a thread onto a different core is to fork a thread first, and then have that runt thread copy state to the different core.

I must admit that I totally blew it when I predicted that SpMT would drag ManyCore into the marketplace back in 2000.
ManyCore happened first, and SpMT may not ever happen. Or may happen in the way opposite to what I thought: first in
software, with not-so-good performance, and then with the relatively small hardware tweaks necessary to make SpMT run fast.

--

By the way, that dual core paper from Akkary came about at the time when people at Intel research were being pressured
towards the Itanium. Dual core - since dual core Itanium was on the roadmap, but not 4 core. No threading - since
Itanium was not going to do threading imitative of x86 Pentium 4.

Since I was not in the labs at that time, I had the luxury of studying SpMT on 4 cores with 4 threads/core. Haitham did
not - he had to toe the company line. And, besides, 20% wasn't bad at a time when (a) there was not much MP demand, and
(b) UP speedups in the 1% range were considered good.

I believe that Haitham is still pursuing this research, at low boil.

--

By the way^2: a few years back I encountered a paper that said that parallel algorithm performance was dramatically
reduced when you incorporated communications cost - essentially that most interesting algorithms' speedups were limited
to the d-th root of the number of processors, where d is the dimension of the space the processors ae laid out in. I.e.
sqrt(N) for 2D VLSI, cube-root for 3D. Since that is the similar scaling rule observed for most other speedup
mechanisms, it seems comparable.

However, it is reasonable to assume that while both ILP/OOO/SpMT and Multithreaded/ManyCore are O(sqrt(N)) in
performance, that the constant for multicore is better. There is low hanging fruit.

I greatly irritated Uzi Vishkin, the UMd professor who pushes the PRAM programming model, when I prepared a formal
argument that what I call an ILP-RAM - an abstract, theoretical, machine, with all of the known ILP tricks - was
equivalent in performance to a PRAM. In the usual way that "equivalent" is shown in theoretical computer science: I
showed that an ILP-RAM could emulate a PRAM in polynomial time, and that a PRAM can emulate an ILP-RAM in polynomial time.
From: nmm1 on
In article <4BD2614F.9010607(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>>>
>>> So, it's happened again. Just as OOO CPU research was the poor cousin
>>> during the early days of the RISC revolution, so SpMT is the poor
>>> cousin in the early days of MultiCore and ManyCore. If M&M run out of
>>> steam, I hope that Haitham Akkary's SpMT research will be there, still
>>> ongoing, to pick up the pieces.
>>
>> "A Minimal Dual-Core Speculative Multi-Threading Architecture"
>>
>> That abstract states a potential dual-core gain of 20% - not
>> bad, but not exciting. Now, extensive experience with branch
>> prediction and preloading analyses indicates that the gain with
>> number of parallel lookaheads is rarely much better than log(N),
>> so that isn't going to give more than a factor of two or so.
>
>My analysis says that speedup for OOO and SpMT grows as sqrt(N).
>But, yes, it's sublinear.

I think that we are at cross-purposes. I am talking about the
delivered, not potential, speedup - and that is dominated by the
predictability of the code. Except for vectorisable codes, and
similar simple cases, O(log N) is about the best achievable.

Now, OOO and SpMT aren't as limited as compiler optimisations,
because they can use dynamic information as well as static. But
they can use only admissible data (i.e. data from the past or
that which can be predicted using data from the past). Oracles
are not available.

The killer is the amount of code that involves a load or branch
based on the contents of something that needs loading. You can
do nothing except either wait for the first load, or follow all
possible paths. The latter is O(log N) in a good case or
O(log log N) in a bad one!


Regards,
Nick Maclaren.
From: Robert Myers on
nmm1(a)cam.ac.uk wrote:
>
> The killer is the amount of code that involves a load or branch
> based on the contents of something that needs loading. You can
> do nothing except either wait for the first load, or follow all
> possible paths. The latter is O(log N) in a good case or
> O(log log N) in a bad one!
>

The price of having an imperfect oracle is that it is very hard to
predict when a speculative branch is worth the effort. In the current
environment, where energy costs are a big and growing consideration,
it's hard to imagine cases where speculative threading is an attractive
strategy from the PoV of energy costs alone.

Robert.
From: "Andy "Krazy" Glew" on
On 4/24/2010 2:10 AM, nmm1(a)cam.ac.uk wrote:

> The killer is the amount of code that involves a load or branch
> based on the contents of something that needs loading. You can
> do nothing except either wait for the first load, or follow all
> possible paths. The latter is O(log N) in a good case or
> O(log log N) in a bad one!

Tjaden and Flynn showed that it is sqrt(N), way back (1968?).

I.e. that the speedup you could get by following all possible paths N levels deep was sqrt(N).

This was empirical. For whatever workloads they had awy back tgen.

---

The thing that you are missing, Nick, is that following all paths does NOT involve exponential work. Many paths
converge. There is much work that is common to all paths, or at least a large number of paths. If you can common
subexpression this, and avoid having to redo it on all paths, then the amount of wrk involved in all possible paths gos
down dramatically.
From: "Andy "Krazy" Glew" on
On 4/24/2010 7:11 AM, Robert Myers wrote:
> nmm1(a)cam.ac.uk wrote:
>>
>> The killer is the amount of code that involves a load or branch
>> based on the contents of something that needs loading. You can
>> do nothing except either wait for the first load, or follow all
>> possible paths. The latter is O(log N) in a good case or
>> O(log log N) in a bad one!
>>
>
> The price of having an imperfect oracle is that it is very hard to
> predict when a speculative branch is worth the effort. In the current
> environment, where energy costs are a big and growing consideration,
> it's hard to imagine cases where speculative threading is an attractive
> strategy from the PoV of energy costs alone.
>
> Robert.


Urgh!!!! I am having a conversation with amateurs who aren't even aware of the basics of power management. You are
expressing a point of view that was discredited for laptop computers more than 10 years ago.

What you say is true ONLY IF THERE IS NO ENERGY COST TO NOT SPECULATING.

But leakage remains big. So, if by speculating you can get to the point where you can turn the whole processor off more
quickly, and go into a no-leakage or lower leakage mode, and if the energy spent in the speculative work, both correct
and incorrectly speculated, is less than the leakage that would have been spent waiting for the thing that would
eliminate the need for the speculation - then SPECULATING CAN SAVE POWER.

And this has been proven time and again. It is why low power processors often have branch predictors. It's why OOO is
on ARM's and Samsung's roadmap.

Sometimes we call this "hurry up ad wait" versus "slow and steady"

It's not black and white. You need to know the statistics, the probabilities, and the tradeoff moves back and forth.

If you have a really low leakage mode that you can get into and out of quickly, it is better to stop speculating and
wait for results.

If you can switch to another thread, it may be better not to speculate and do so. So long as the hardware resources for
the other thread doesn't cause too much leakage.

If you get an improved predictor, you may want to speculate more.

If you have confidence predictors...