From: Terje Mathisen "terje.mathisen at on
nmm1(a)cam.ac.uk wrote:
> In article<5R4Yn.6004$cO.5854(a)newsfe09.iad>,
> Robert Myers<rbmyersusa(a)gmail.com> wrote:
>> Terje Mathisen wrote:
>>
>>>> 2. Anticipation cannot be anticipated. It must be learned.
>>>
>>> Experience has shown us that dynamic (i.e. learning) predictors perform
>>> far better than static compiler hints. :-)
>>
>> Was obvious to all right along, no? ;-)
>
> Well, yes and no. I don't actually think that it is right, as a
> universal rule.
>
> The point is that it has been known since time immemorial that
> highly disciplined languages enable much better static prediction.
> It always was obvious that static prediction was a clear loser in
> the sort of C/C++/etc. spaghetti that dominates so many areas at
> present. What wasn't clear (to me, at least) is whether dynamic
> prediction would be enough better to be worth the extra complexity.
> Well, the answer is that it is ....
>
> Also, all of the current approaches to static prediction that I
> know of seem to have been designed by hardware people living in
> an ivory tower that hasn't had any contact with the programmers
> since about 1960. In particular, using the program locality as
> the primary key is obviously inane, once you start using an
> 'object-oriented' approach.

This usually means that the HW either has to know about some compiler
ABI details, like "the self pointer is always located in ECX/RCX", in
which case a branch predictor which is based on a combination of the
current EIP + ECX could do better than either of them alone.

I think I've read about such studies here in c.arch?
>
> I think that we could do a lot better if we approached static
> prediction more intelligently, even with current programs, and
> potentially better even than current dynamic prediction with
> improved programming paradigms.

Currently we know that some compilers have been able to achieve at least
some speedup for some object-oriented programs by a fairly heavy-handed
rewrite:

Since

call [reg+method_offset]

isn't handled (at all/very well) by the BP structures, the compiler
guesses (based on program feedback?) that most such virtual calls will
actually go to one specific type:

method_trampoline:
cmp reg,GUESSED_TYPE_VMT
je guessed_method
;; Some other type so we have to do the indirect call instead:
jmp [reg+method_offset]

Used this way the calling code will do a CALL method_trampoline instead
of the indirect call above.

This is a win when both the static call and the CMP/JE is mostly
correctly predicted, but it hurts code size and can misbehave badly if
embedded in a library.

Personally I really dislike these kinds of contortions, it results in
code with quite brittle performance. It would be much better if the
hardware could learn to predict indirect calls!

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <ihq8g7-39v1.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>
>> Also, all of the current approaches to static prediction that I
>> know of seem to have been designed by hardware people living in
>> an ivory tower that hasn't had any contact with the programmers
>> since about 1960. In particular, using the program locality as
>> the primary key is obviously inane, once you start using an
>> 'object-oriented' approach.
>
>This usually means that the HW either has to know about some compiler
>ABI details, like "the self pointer is always located in ECX/RCX", in
>which case a branch predictor which is based on a combination of the
>current EIP + ECX could do better than either of them alone.

Which makes it language-dependent and worse :-(

>I think I've read about such studies here in c.arch?

I haven't seen any, though I have ranted on about their necessity
a few times.

>Currently we know that some compilers have been able to achieve at least
>some speedup for some object-oriented programs by a fairly heavy-handed
>rewrite:

God help me, yes :-( Profile-directed inter-procedural optimisations
and all. They work marvellously when using libraries for essential
components.

>Personally I really dislike these kinds of contortions, it results in
>code with quite brittle performance. It would be much better if the
>hardware could learn to predict indirect calls!

Not just performance - RAS, too.


Regards,
Nick Maclaren.
From: Andy 'Krazy' Glew on
On 7/5/2010 1:04 AM, Torben �gidius Mogensen wrote:

> Quite so. Dynamic prediction has the disadvantage that it (for
> practical reasons) can only look at a small fraction of the execution
> history (and only the past). Static prediction can model the entire
> execution, past present and future, albeit with some approximations (due
> to incomputability), so it can, theoretically, make better predictions.

This is a truism. It seems self evident.

Unfortunately, it is often not true.

First, let's talk about dynamic, hardware, analysis and prediction. Typical OOO instruction window sizes are circa 100
instructions. 15-25 basic blocks. Spanning function calls and returns. P4's trace cache was circa 12,000 uops (Tom's
HW). Branch predsictors extract smalll bits of history over longer perioids of time. (BTW, this might be a nice
littl;e research paper: how much history is contained in various predictors? E.g. what is the oldest piece of
information that, if changed, would change a prediction? Youngest? Average?)

Bottom line: HW dynamic predictors can look at spans of circa 100, to 1000s, of instructions. Very few limitations:
separate compilation, etc.

This is state of the art. Bleeding edge, bluesky: Kilo-instruction processors. In my imagination: I have discussed log
based processors here, with logs into the 100s of thousands of instructions. Also, we can imagine processors that look
at speculative execution, and analyze speculative paths that have not been committed to yet. This is much of the benefit
of highly speculative processors like DMT and RunAhead. So HW prediction can look some considerable distance into the
future.

By the way, we don't talk much about SW dynamic prediction, but it can be done, and might be a topic for another post.
It was described on my UIUC and UWISC websites for years, although I don't think I have yet put it up on my new
websites, now that I have the freedom to discuss such things (having left Intel).

Nevertheless, software analysis should be better.

But: software analysis often gets arbitrary limits imposed on it. E.g. intra-procedural only, not interprocedural. No
analysis across separately compiled units. Link time analysis, lack of. Non-applicability to dynamically linked code.
Non-applicability to dynamically generated code.

Oftentimes these limitations mean that SW static prediction models smaller scopes than HW dynamic prediction.

---

Let alone the leverage issue. HW dynamic prediction applies to all code on the system. Often SW static prediction only
appliesto people using a particular tool chain, a particular VM or jitter.
From: Andy 'Krazy' Glew on
On 7/5/2010 3:59 AM, Terje Mathisen wrote:

> call [reg+method_offset]
>
> isn't handled (at all/very well) by the BP structures, the compiler
> guesses (based on program feedback?) that most such virtual calls will
> actually go to one specific type:
>
> method_trampoline:
> cmp reg,GUESSED_TYPE_VMT
> je guessed_method
> ;; Some other type so we have to do the indirect call instead:
> jmp [reg+method_offset]
>
> Used this way the calling code will do a CALL method_trampoline instead
> of the indirect call above.
>
> This is a win when both the static call and the CMP/JE is mostly
> correctly predicted, but it hurts code size and can misbehave badly if
> embedded in a library.
>
> Personally I really dislike these kinds of contortions, it results in
> code with quite brittle performance. It would be much better if the
> hardware could learn to predict indirect calls!


(1) Most modern machines have an indirect predictor. Indirect predictors work pretty welll - up to the point where the
number of CALL-FROM locations exceeds the usually fairly small number of hardware entries alllocated.

As Nick points out, this is using locality, specifically the PC or IP (instruction pointer) of the indirect call
instruction. With various flavors of history. But seldom with knowledge of what the actual object is. As you, Terje,
point out, knowing the object register means coupling the hardware to the code generation, and is usually not a good
thing. There are hardware schemes to deduce the object register. But, furthermore, knowing the object register onlyt
helps you make a prediction when the branch is already in the hardware data structures - since what you do for an object
based predictor is have a table (IP,vtable_offset) that you join to tables of (object,class)
(class,vtable_offset,target_IP). If you don't have the locality based table, then you can't make the prediction at the
instruction fetch pipestage; you have to wait until you have decoded the instruction. Which typically loses you several
- 6 or so - pipestages. Such a late decode time, prediction can be useful, since it may save you another 20 cycles of
branch misprediction latency. But it's not as good as predicting right up front.

(2) Take your example a bit further, Terje. I think that you are almost talking about the method caching that was in
SOAR (Smallltalk on a RISC). However, instead of inserting code at the callsite or trampoline

if( objptr.type == expected_type )
goto (or call) expected_type.method
else
indirect jmp or call objptr->vtable->entry_point[offset]

SOAR provided such checking at the call target.

And - you'll love this - went and modified the CALL instruction in the binary, changing it from an indirect jump or call
to a direct jmp or call.

I trust I don't need to explain the issues with dynamic self modifying code. But, nevertheless, even today this is
often the best approach - and often the JITter accomplishes the same thing.

DAMN: must continue this posting later.
From: Andy 'Krazy' Glew on
On 7/4/2010 8:58 PM, Robert Myers wrote:
> Andy 'Krazy' Glew wrote:
>
>>
>> I think that the need for such mechanisms is not oveer. At least I
>> hope not. But I think that the money is always on the predictor guys.
>
> Because of a decision-making bias you don't agree with?

In this case, not really. It wasn't until I actually started working on branch predictors (warning: story below) that I
came to realize just how danged simple and cheap the successful predictors are.

I warned you: at UWisc, I wanted to work on big instruction windows. But Jim Smith said "big instruction windows cannot
be justified, because branch predictors are not good enough". Now, I knew something that Jim didn't: I knew that
Willamette's branch predictors were significantly better than any than had been published in academia (I don't know the
exact numbers off my head, but: if you go from 94& to 97%, you have doubled the instructions between mispredicted
branches on average. Often more so. Also, I knew that main memory latencies of 300 cycles were coming.

So, in order to justify working on large instruction windows, I had to show that better branch predictors were possible.
But I couldn't talk about Willamette's branch predictors.

So I invented my own. Mainly, low hanging fruit in loop based branch prediction - removing the misprediction at the end
of a counted loop. And multilevel branch predictors: a small fast branch predictor, to make predictions without having
too big of a bubble on a predicted taken branch, and a larger, more accurate, branch predictor. I added a BPQ, branch
prediction queue, to hold the results of the L1 branch predictor: if the L2 predictor could finish before the L1
prediction from the BPQ was used, it was easy to handle. (Being me, I also worked on ways of delivering L2 branch
predictions later in the pipeline, even after the branch was in the scheduler.)

I also worked on a number of other branch predictors. However, the above was the low hanging fruit: loops, and just
building a big L2 predictor, using algorithms much the same as the L1 predictor, just more entries.

I found a lot of other predictors that work. For example, I experimented with variable length history predictors -
building what amount to ternary CAMs for history patterns, must be 0, must be 1, don't care. These are more efficient
in terms of number of entries; but it is very, very, very hard to be more efficient in perf/transistor than a large
table of 2 bit counters. Although Mitch has shown that adding a few tag bits helps greatly, even these you have to
trade off against increasiing the number of two bit counters.

Prefetchers are a bit harder. But anything that can be represented in 2 bit counters...


>> E.g. my work in out of order execution and renaming: I remember being
>> *annoyed* when branch predictors became common, since I saw that as a
>> distraction. I was younger then.
>>
>> E.g. large window out-of-order execution tends to compete against
>> prefetchers.
>
> For cache space? Memory bandwidth?

Mainly, they get some of the same speedup.

Data prefetchers tend to do well on
a) stride-1 accesses
b) some stride-N accesses, although until recently they usually stop at page boundaries

OOO execution tends to get all of the above, and, in addition
c) accesses that cross page boundaries - no problem
d) certain truly random accesses, e.g.
loop
a[i]->random_data_ptr
data prefetchers tend not to do very well on the above. Although there has been work.

Neither does well on pointer chasiing
loop
ptr = ptr->next
or on hash table chasing
loop
h = hash(h+data[i])

(although I have discussed on comp.arch, and in ASPLOS WACI in what, 97?, how to use memory metadata freed up by
compression to do pointer prefetching. That rather merges OOO and prefetch.)

When prefetchers actually compete against the OOO machine, for cache space or memory bandwidth, the prefetcher is nearly
always wrong (except when the branch predictor is off in the weeds). Usually "demand" fetches are given priority over
prefetches. "Demand" fetches include OOO speculation, although one can imagine discussion true non-s[eculative from
speculative.