From: Andy 'Krazy' Glew on
On 1/25/2010 3:43 AM, nedbrek wrote:
> Hello all,
>
> "Andy "Krazy" Glew"<ag-news(a)patten-glew.net> wrote in message
> news:4B5C999C.9060301(a)patten-glew.net...
>> nedbrek wrote:
>>> That's where my mind starts to boggle. I would need see branch predictor
>>> and serialization data showing a window this big would deliver
>>> significant performance gains. We were looking at Itanium runs of Spec2k
>>> built by Electron (i.e. super optimized). We were assuming very heavy
>>> implementation (few serializing conditions). We were unable to scale this
>>> far.

By the way, Ned's comment about needing to see branch prediction data indicates a fundamental misunderstanding of
speculative multithreadng. This was almost exactly Jim Smith's misunderstanding of nigh on ten years ago. (Plus, Jim
was not aware, at that time, of the significant advances in branch prediction made during the Willamette era. Jim's
misunderstanding somewhat inspired my work on multilevel branch predictors, as a way of getting the accuracy of a larger
predictor with the latency, the short bubble on predicted taken branches, of a smaller branch predictor. Since I was
under NDA for the P4 branch predictor, I had to invent something almost as good to make my work relevant.)

If you have a single thread feeding your instruction window, then any branch misprediction might conceivably invalidate
all subsequent instructions.

However, in SpMT you have multiple instruction fetch threads, from the same logical thread of execution, feeding the
instruction window. Typically these threads are control independent of each other. E.g. stuff from after the return of
a function is control independent of any branch inside the function itself, except for branches that cause the function
not to return, e.g. to throw an exception. Ditto loops.

I.e. exploitation of control independence removes branch mispredictions as an impediment to large instruction windows.
Of course, one then has to worry about data value dependence.

Similarly, Ned mentions serialization. Now, admittedly much work on stuff like SpMT is still vulnerable to the SpMT
bottleneck. Particularly versions of SpMT or DMT or whatever that leave speculative state in the cache waiting to be
committed, snooping to determine if it is still correct. However, my log-based SpMT is NOT vulnerable to the
serialization bottleneck. You never have to stop speculation because of serialization. You never need to do this
because serialization constraints are completely satisfied during verify re-execution. The verify re-execution engine
may serialize itself, but it never needs to invalidate the speculative log that it is verifying - because the process of
verification avoids that need. If you were to build a dedicated verify re-execution engine, it would be a fairly simple
in-order machine, with low serialization costs. However, my preference is to minimize the amount of dedicated logic,
and reuse the normal processor, which may be OOO. Nevertheless, serializng that will be less costly in verify
re-execution mode than otherwise, because verify re-execution mode is so parallel.

Pretty much the only time you really need to serialize a log based SpMT machine is when you change the contrl register
bit that says "Never run in SpMT mode."

Now, of course, what we have really done is convert serialization int a prediction problem. If the places where we have
speculated pat what would have ben a serialization point on an old machine lead to many mis-speculations...