From: Terje Mathisen "terje.mathisen at on
Andy 'Krazy' Glew wrote:
>
> (2) Take your example a bit further, Terje. I think that you are almost
> talking about the method caching that was in SOAR (Smallltalk on a
> RISC). However, instead of inserting code at the callsite or trampoline
>
> if( objptr.type == expected_type )
> goto (or call) expected_type.method
> else
> indirect jmp or call objptr->vtable->entry_point[offset]
>
> SOAR provided such checking at the call target.
>
> And - you'll love this - went and modified the CALL instruction in the
> binary, changing it from an indirect jump or call to a direct jmp or call.

Ouch!

That's betting that this particular call site will never be reached by
any other object type: Pretty much never a safe bet unless the
compiler/JITter can determine that only a single object type exists.

OTOH, if that is the case, then you don't need indirect calls at all...
>
> I trust I don't need to explain the issues with dynamic self modifying
> code. But, nevertheless, even today this is often the best approach -
> and often the JITter accomplishes the same thing.

Fun stuff.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Andy 'Krazy' Glew on
On 7/7/2010 12:30 AM, Terje Mathisen wrote:
> Andy 'Krazy' Glew wrote:
>>
>> (2) Take your example a bit further, Terje. I think that you are almost
>> talking about the method caching that was in SOAR (Smallltalk on a
>> RISC). However, instead of inserting code at the callsite or trampoline
>>
>> if( objptr.type == expected_type )
>> goto (or call) expected_type.method
>> else
>> indirect jmp or call objptr->vtable->entry_point[offset]
>>
>> SOAR provided such checking at the call target.
>>
>> And - you'll love this - went and modified the CALL instruction in the
>> binary, changing it from an indirect jump or call to a direct jmp or
>> call.
>
> Ouch!
>
> That's betting that this particular call site will never be reached by
> any other object type: Pretty much never a safe bet unless the
> compiler/JITter can determine that only a single object type exists.
>
> OTOH, if that is the case, then you don't need indirect calls at all...
>>
>> I trust I don't need to explain the issues with dynamic self modifying
>> code. But, nevertheless, even today this is often the best approach -
>> and often the JITter accomplishes the same thing.
>
> Fun stuff.
>
> Terje


I meant to add: my recommendation in the P6 code generator's guide was to issue

CALL (reg) // indirect
CALL constant address of most likely method

This way, even if the indirect prediction did not work (P6 didn't have a fancy indirect predictor, just a BTB that
predicted that the indirect branch went the same place as it did last time), you would at least have branched to
whatever the compiler thought was the most likely method. So likely some prefetch would have worked.

You can think of this as a hammock, convergent code, with the divergence occupying one instruction.

Unfortunately, it required the called function to adjust the return address, which did not fly ABI-wise.

Another version was to have an indirect stub

....
CALL indirect_stub_for_calls_through_eax
....


indirect_stub_for_calls_through_eax:
JMP (eax)
JMP constant address of most likely method

but the extra instructions and extra indirect and proliferation of such stuboutweighed the benefit.


In the end, I think that the only version of this recommendation that survived was for non-calll indirect branches

JMP table(eax)
JM prredicted_case_lbel

since there is no ABI that requires the compiler to return.


---

I always though this was a pity.




From: Robert Myers on
Andy 'Krazy' Glew wrote:

> But the benchmark in question was contrived,. Run the program in the
> benchmark for a while, and you did not get this neat pattern. The
> linked list became randomized.
>
> I.e. your algo did not work as well in real life as it did in the
> benchmark, even for the program in the benchmark.
>
>
> However: your point is valid (becomes valid) when you have a garbage
> collector that tends to reorganize data so that certain pointer chases
> are allocated sequentially.
>

Some natural systems apparently acquire (learn) their own ad hoc
algorithms and lay them aside when they cease to be useful. Humans can
adjust to conditions they've never encountered before.

The surprise in computer architecture (for me) is the big payoff you get
with minimal resources, and my conjecture (or hope) is that elaboration
on that kind of success will allow increasingly accurate and elaborate
prediction without designing it in and without busting the power budget.

Robert.
From: Terje Mathisen "terje.mathisen at on
Ira Baxter wrote:
> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
> news:4C34A662.9080505(a)patten-glew.net...
>> On 7/7/2010 12:30 AM, Terje Mathisen wrote:
>>> Andy 'Krazy' Glew wrote:
>>>>
>> I meant to add: my recommendation in the P6 code generator's guide was to
>> issue
>>
>> CALL (reg) // indirect
>> CALL constant address of most likely method
>>
>> This way, even if the indirect prediction did not work (P6 didn't have a
>> fancy indirect predictor, just a BTB that predicted that the indirect
>> branch went the same place as it did last time), you would at least have
>> branched to whatever the compiler thought was the most likely method. So
>> likely some prefetch would have worked.
>>
>> You can think of this as a hammock, convergent code, with the divergence
>> occupying one instruction.
>>
>> Unfortunately, it required the called function to adjust the return
>> address, which did not fly ABI-wise.
>>
>
> I've often wanted to control where the return location of a subroutine
> was without doing a jump. A way to do this:
> MOV rn,=returnlocation
> JMP subroutine
> ....
> subroutine: PUSH rn
> ....
> RET
>
> (One can also just leave the return address in the register if the called
> procedure is a small leaf, but I'm interested in the PUSH-on-stack-case).
> One additional nice advantage is that it is easy to build in-line parameter
> lists.
>
> If you can do this, the double call above is possible without return address
> adjustment :-}
>
> MOV rn, =returnlocation
> JMP (reg)
> JMP constantaddress
> returnlocation:
> ...
>
> This raises register pressure some, but that's probably OK.
>
> But it has worse defect on modern machines.
> The hidden return stack used by branch predictors don't see a return address
> being PUSHed;
> just some data. So when the called routine returns, you get a mispredict
> because the
> shadow return stack is out of sync.
>
> To solve this problem, I've wished I had a
> PUSHRETURN<operand>
> to signal that ta return address is being pushed so as to help the shadow
> return stack predictor.
> (On the x86 one could easily add this with a prefix byte on push
> instructions).
>
> Why isn't this a good idea, and why haven't we seen it in x86 machines?
> I think I saw something similar for the 370 architecture. Is it a patent
> issue?
> Or just not worth the silicon according to somebody's study?

Definitely not worth the silicon:

It can be emulated very cheaply with something like

mov regN,returnaddress
jmp subroutine_which_will_jump_via_regN

and then end subroutine_which_will_jump_via_regN with just that, i.e.

...
jmp regN

If otoh you want a regular function, mostly called in the normal
fashion, to return elsewhere, then you can either place a tiny
trampoline at that site:

jmp trampoline_return
....

trampoline_return:
call function_to_be_called

or you can accept the fact that the return stack cache will be messed up.

I.e. if you do this all the time, then you need to synthesize your own
call/return ABI, or if it happens only under exceptional circumstances,
then it is OK to take a small performance hit.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Robert Myers on
nmm1(a)cam.ac.uk wrote:
>
> Yes. There is a lot of good mathematical work on this, in the area
> of real (i.e. Kolmogorov) complexity theory. Most of it can be
> summarised for practical use as "if you have a genuinely random
> finite state machine, you are up the creek without a paddle, so
> consider what empirical, semi-mathematical heuristics you can use."
>
The theory and practice of identifying patterns in semi-random time
series must be very well developed, if only because of those
front-running supercomputers humming away on Wall Street. Is there a
related subfield (Andy apparently has developed one for his own use)
where the available resources are extremely constrained? How much of
this is genuinely new ground and how much might already exist in some
other context?

The empirical, semi-mathematical heuristics must have themselves come
from a learning predictor somewhere, even if it is the brain of a
computer architect.

Robert.