From: EricP on
Terje Mathisen wrote:
> EricP wrote:
>> Brett Davis wrote:
>>> Only one in a hundred programers know an optimizaton like that, for
>>> half of comp.arch to be that good says good things about comp.arch.
>>
>> Ok, well, I'm going to take the risk of seeming a total doofuss
>> on a global scale, but I don't see what you are getting at.
>>
>> A[x] and C[y] are referenced only once so there is no reason
>> for the compiler to enregister their values, and all other
>> variables are locals or pass-by-value, and therefore
>> no reason for aliasing to occur.
>>
>> The SH-4 has byte memory access instructions, so this is just the
>> difference between LD ST LD ST LD ST and LD LD LD ST ST ST.
>> The latter requires 2 extra temp regs, which on x86 causes
>> a var spill into the stack, but probably is ok on sh-4.
>>
>> So I don't see the 2x speedup here.
>
> The key is that without alias knowledge, it is perfectly possible for
> the two arrays to intersect, in which case it is illegal for the
> compiler to unroll the loads and stores.

Ok, so how would it do better without considering buffer overlap
and unrolling the loop? It still seems to me that there would
be the same number of loads and stores. To really benefit it would
have change byte loads and stores to dword, as you say below.

Note that in the example MS code I posted, the compiler
generated the following code for the 3 byte memcpy:

; Line 9
mov bx, WORD PTR [eax]
mov WORD PTR [ecx], bx
mov bl, BYTE PTR [eax+2]
mov BYTE PTR [ecx+2], bl

which would give different results if the source and dest overlapped
but is legal because memcpy is defined as not considering overlap.

> However, the huge advantage here is to load 12 bytes = 4 pixels into 3
> registers, then do the same for the 3 following lines so that you have
> loaded a full 4x4 pixel block with 12 load instructions instead of 36.
>
> Next you do shifts & shuffles to rotate the 4x4 block, still ending up
> with 12 registers, before storing them back with 12 writes.
>
> This saves 2/3 of all the load/store operations.
>
> Terje

I wonder about this. Yes, it save lots of loads and stores,
but unless the sh-4 has byte extract and insert ops (and it doesn't look so)
you are going to spend ops doing moves, shifts, ands, ors.
Coupled with the fact that the transpose approach touches
each source cache line once, and each dest line two times
(once for the source to dest copy-transpose, then the row reverse)
I think this would loose to just a straight copy-while-rotating.

I was thinking just doing the rotate during the copy from source to dest,
but being cache aware so that each line is loaded just once.

It should start in the lower left corner of srcBuf, copy a
vertical stripe, column by column, from bottom to top so the
output to dstBuf is serial and contiguous.
Having stores be contiguous might help if there was store queue merging.

The width of the stripe is 32 elements (96 bytes).
It should move 32 (or less) of the 3 byte elements,
then go back and do the next column in the stripe.

dst[0,0] = src[h-1,0]
dst[0,1] = src[h-2,0]
....
dst[0,31] = src[h-32,0]
dst[1,0] = src[h-1,1]
dst[1,1] = src[h-2,1]
....
dst[31,31] = src[h-32,31]

This gives a data cache working set of 33 lines,
the writes of dst are contiguous,
and each line is loaded from main memory just once.

Eric



From: EricP on
EricP wrote:
> Terje Mathisen wrote:
>> However, the huge advantage here is to load 12 bytes = 4 pixels into 3
>> registers, then do the same for the 3 following lines so that you have
>> loaded a full 4x4 pixel block with 12 load instructions instead of 36.
>>
>> Next you do shifts & shuffles to rotate the 4x4 block, still ending up
>> with 12 registers, before storing them back with 12 writes.
>>
>> This saves 2/3 of all the load/store operations.
>>
>> Terje
>
> I wonder about this. Yes, it save lots of loads and stores,
> but unless the sh-4 has byte extract and insert ops (and it doesn't look
> so)
> you are going to spend ops doing moves, shifts, ands, ors.
> Coupled with the fact that the transpose approach touches
> each source cache line once, and each dest line two times
> (once for the source to dest copy-transpose, then the row reverse)
> I think this would loose to just a straight copy-while-rotating.
>
> I was thinking just doing the rotate during the copy from source to dest,
> but being cache aware so that each line is loaded just once.
>
> It should start in the lower left corner of srcBuf, copy a
> vertical stripe, column by column, from bottom to top so the
> output to dstBuf is serial and contiguous.
> Having stores be contiguous might help if there was store queue merging.
>
> The width of the stripe is 32 elements (96 bytes).
> It should move 32 (or less) of the 3 byte elements,
> then go back and do the next column in the stripe.
>
> dst[0,0] = src[h-1,0]
> dst[0,1] = src[h-2,0]
> ....
> dst[0,31] = src[h-32,0]
> dst[1,0] = src[h-1,1]
> dst[1,1] = src[h-2,1]
> ....
> dst[31,31] = src[h-32,31]
>
> This gives a data cache working set of 33 lines,
> the writes of dst are contiguous,
> and each line is loaded from main memory just once.
>
> Eric

These two could be combined so it would rotate a 4*4 block of 3 byte
elements in a vertical stripe and writing 4 sequential rows at once.
I was thinking it would require too many registers to rotate
the block but I think it could be done with just 6 if you
spill each pending row value as soon as it is ready:
4 to hold the pending write value for each row,
one for the just loaded value, one for shifting, masking, merging.
Other regs for holding dstPtr, dstTmpPtr, srcPtr, srcTmpPtr, height, etc,
and it looks like it would pretty much fit in 16 registers,
and only load each cache line once.

That's probably as good as your going to get.

Eric

From: Brett Davis on
In article <0cgh97-unc2.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
> EricP wrote:
> > So I don't see the 2x speedup here.
>
> The key is that without alias knowledge, it is perfectly possible for
> the two arrays to intersect, in which case it is illegal for the
> compiler to unroll the loads and stores.
>
> However, the huge advantage here is to load 12 bytes = 4 pixels into 3
> registers, then do the same for the 3 following lines so that you have
> loaded a full 4x4 pixel block with 12 load instructions instead of 36.
>
> Next you do shifts & shuffles to rotate the 4x4 block, still ending up
> with 12 registers, before storing them back with 12 writes.
>
> This saves 2/3 of all the load/store operations.
>
> Terje

That is an excellent idea, but you will note that I only suggested long
accesses for the reads, and writing bytes. Because I have made that
optimization before and ended up with longer slower code. Much to my
dismay I might add. (~5% slower.)
The reason is that almost all CPUs have a write combiner to reduce
memory traffic. Four bytes written in sequence end up as one long write
to the cache or memory. So all those extra shifts and adds just waste
cycles. Plus many of the SH chips have slow shift instructions, making
things even worse.

Brett
From: Brett Davis on
In article <a5mxn.262830$Dv7.69033(a)newsfe17.iad>,
EricP <ThatWouldBeTelling(a)thevillage.com> wrote:
> Brett Davis wrote:
> >>> If you do the following you can get a 2x speedup, it looks like
> >>> more code, but will generate less, and the results will be
> >>> pipelined correctly.
> >>> Extra bonus points to those that understand why. Half the posters here?
> >>>
> >>> {
> >>> unsigned char *C = B+(H*j+H-i-1)*3;
> >>> temp0 = A[0];
> >>> temp1 = A[1];
> >>> temp2 = A[2];
> >>> C[0] = temp0;
> >>> C[1] = temp1;
> >>> C[2] = temp2;
> >>> A += 3;
> >>> }
> >>>
> >>> Do not use *C++ = *A++;
> >
> > Only one in a hundred programers know an optimizaton like that, for
> > half of comp.arch to be that good says good things about comp.arch.
>
> Ok, well, I'm going to take the risk of seeming a total doofuss
> on a global scale, but I don't see what you are getting at.
>
> A[x] and C[y] are referenced only once so there is no reason
> for the compiler to enregister their values, and all other
> variables are locals or pass-by-value, and therefore
> no reason for aliasing to occur.
>
> So I don't see the 2x speedup here.
>
> Eric

Benchmark it, try and stay in cache as a 66 MHz embedded CPU does
not have memory 600 cycles away.
Only an Intel class chip has enough OoO stuff to have a chance
of coming close to the same speed as my code.

Brett
From: Brett Davis on
In article
<17e7e045-9599-4000-a41c-8885ccfc95d0(a)g30g2000yqc.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:

> On Apr 11, 10:24�pm, Brett Davis <gg...(a)yahoo.com> wrote:
> > If you do the following you can get a 2x speedup, it looks like
> > more code, but will generate less, and the results will be
> > pipelined correctly.
> > Extra bonus points to those that understand why.
>
> You have explicitly gotten rid of the aliasing issues so the compiler
> can avoid having to assume aliasing conflicts between C[0] and
> A[1],...
>
> Mitch

Excellent start, there are at least two or three other reasons, all
of which effect performance, unlike the aliasing issue that will
only rear its ugly head when an actual alias violation occurs.

Brett