From: MitchAlsup on
On Jul 6, 6:31 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> Neither does well on pointer chasiing
>                loop
>                     ptr = ptr->next
> or on hash table chasing
>                 loop
>                      h = hash(h+data[i])

The DRAM controler on my last design actually did data structures like
this with amazing clairvoiance. It ends up that the benchmark built
then traversed a very large data structure. Aftersetup, the data
structure did not change "all that much" and left a clear pattern of
+1,+1,-3 <repeat> that our DRAM prefetcher latched onto and eliminated
the DRAM latency. One still paid the on-die latencies ot the memory
controler and DRAM controler, however.

Call Indirect and Jump Indirect Tables with and without hashing are
generally pretty good given suficient area to contain the footprint
and a little cleverness of the designers tuning the hash. Here, I
think, it is much more worthwhile for the software people to choose
reasonable data structures to contain the indirect tables, and then,
over time, let the HW guys stumble over things to optimize. So,
without SW doing anything special, all of a sudden it just gets
faster. AND it looses none of its obviousness or RAS. Not so with the
"special creative bent of the moment stuff."

Stride 1 and Stride N stuff is so 1970. If you can't get these right,
you might consider playing golf for a living.....
Gather/Scater is where it is at today--yet I see so few that even make
a quality attempt here.

Mitch
From: Andy 'Krazy' Glew on
On 7/6/2010 5:49 PM, jacko wrote:
> JMP (PC+(SP++))
>
> Is this an instruction worth having? SP is return stack pointer, for
> forth people this would be RP.

Get the syntax right. I think you mean

JMP (PC+(*SP++))

As in

tmp := load M[SP]
inc sp += sizeof(tmp)
tmp := PC + tmp
jmp (tmp)


From: MitchAlsup on
On Jul 6, 7:49 pm, jacko <jackokr...(a)gmail.com> wrote:
> JMP (PC+(SP++))
>
> Is this an instruction worth having? SP is return stack pointer, for
> forth people this would be RP.

In the PDP-11 one had access to:

CALL (sp)+

Which would use the SP to access the TOS (and pop it off), then jump
to that location, and push the return PC back onto the stack. They
used to call this a co-routine, and one could build some amazing
protocol machines using this technique. In effect, the state of the
otherside of the protocol was kept in the return address instead of a
memory variable. So, when you called it back, you ended up at the code
that was appropriate for the state at hand.

Mitch
From: Andy 'Krazy' Glew on
On 7/6/2010 7:31 PM, MitchAlsup wrote:
> On Jul 6, 6:31 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>> Neither does well on pointer chasiing
>> loop
>> ptr = ptr->next
>> or on hash table chasing
>> loop
>> h = hash(h+data[i])
>
> The DRAM controler on my last design actually did data structures like
> this with amazing clairvoiance. It ends up that the benchmark built
> then traversed a very large data structure. Aftersetup, the data
> structure did not change "all that much" and left a clear pattern of
> +1,+1,-3<repeat> that our DRAM prefetcher latched onto and eliminated
> the DRAM latency. One still paid the on-die latencies ot the memory
> controler and DRAM controler, however.

Yes, Mitch.

But the benchmark in question was contrived,. Run the program in the benchmark for a while, and you did not get this
neat pattern. The linked list became randomized.

I.e. your algo did not work as well in real life as it did in the benchmark, even for the program in the benchmark.


However: your point is valid (becomes valid) when you have a garbage collector that tends to reorganize data so that
certain pointer chases are allocated sequentially.

From: Ira Baxter on

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4C34A662.9080505(a)patten-glew.net...
> On 7/7/2010 12:30 AM, Terje Mathisen wrote:
>> Andy 'Krazy' Glew wrote:
>>>
> I meant to add: my recommendation in the P6 code generator's guide was to
> issue
>
> CALL (reg) // indirect
> CALL constant address of most likely method
>
> This way, even if the indirect prediction did not work (P6 didn't have a
> fancy indirect predictor, just a BTB that predicted that the indirect
> branch went the same place as it did last time), you would at least have
> branched to whatever the compiler thought was the most likely method. So
> likely some prefetch would have worked.
>
> You can think of this as a hammock, convergent code, with the divergence
> occupying one instruction.
>
> Unfortunately, it required the called function to adjust the return
> address, which did not fly ABI-wise.
>

I've often wanted to control where the return location of a subroutine
was without doing a jump. A way to do this:
MOV rn,=returnlocation
JMP subroutine
....
subroutine: PUSH rn
....
RET

(One can also just leave the return address in the register if the called
procedure is a small leaf, but I'm interested in the PUSH-on-stack-case).
One additional nice advantage is that it is easy to build in-line parameter
lists.

If you can do this, the double call above is possible without return address
adjustment :-}

MOV rn, =returnlocation
JMP (reg)
JMP constantaddress
returnlocation:
...

This raises register pressure some, but that's probably OK.

But it has worse defect on modern machines.
The hidden return stack used by branch predictors don't see a return address
being PUSHed;
just some data. So when the called routine returns, you get a mispredict
because the
shadow return stack is out of sync.

To solve this problem, I've wished I had a
PUSHRETURN <operand>
to signal that ta return address is being pushed so as to help the shadow
return stack predictor.
(On the x86 one could easily add this with a prefix byte on push
instructions).

Why isn't this a good idea, and why haven't we seen it in x86 machines?
I think I saw something similar for the 370 architecture. Is it a patent
issue?
Or just not worth the silicon according to somebody's study?

Ira Baxter
Semantic Designs