From: EricP on
Brett Davis wrote:
>>> The most important number is cache line size, which you missed.
>>> If your image is 1,024 lines tall, that will completely thrash
>>> the cache, resulting in 3 bytes copied per cache line load/spill.
>>>
>>> If you do the following you can get a 2x speedup, it looks like
>>> more code, but will generate less, and the results will be
>>> pipelined correctly.
>>> Extra bonus points to those that understand why. Half the posters here?
>>>
>>> {
>>> unsigned char *C = B+(H*j+H-i-1)*3;
>>> temp0 = A[0];
>>> temp1 = A[1];
>>> temp2 = A[2];
>>> C[0] = temp0;
>>> C[1] = temp1;
>>> C[2] = temp2;
>>> A += 3;
>>> }
>>>
>>> Do not use *C++ = *A++;
>>>
>> Programming hotshots have done so much damage.
>
> Damage?
> That is clean code that is easy to read and understand.
>
>> And they brag about it.
>
> Only one in a hundred programers know an optimizaton like that, for
> half of comp.arch to be that good says good things about comp.arch.

Ok, well, I'm going to take the risk of seeming a total doofuss
on a global scale, but I don't see what you are getting at.

A[x] and C[y] are referenced only once so there is no reason
for the compiler to enregister their values, and all other
variables are locals or pass-by-value, and therefore
no reason for aliasing to occur.

The SH-4 has byte memory access instructions, so this is just the
difference between LD ST LD ST LD ST and LD LD LD ST ST ST.
The latter requires 2 extra temp regs, which on x86 causes
a var spill into the stack, but probably is ok on sh-4.

So I don't see the 2x speedup here.

Eric

From: Casper H.S. Dik on
EricP <ThatWouldBeTelling(a)thevillage.com> writes:

>A[x] and C[y] are referenced only once so there is no reason
>for the compiler to enregister their values, and all other
>variables are locals or pass-by-value, and therefore
>no reason for aliasing to occur.


A proper compiler will generate optimal code for:


for (i = 0; i < len; i++)
C[i] = A[i];

or similar code and will generate about the same code for:

pC = C; pA = A;
while (*pC < C[len])
*pC++ = *pA++;


If you can generate faster code by unrolling loops in C then you should get
a better compiler.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
From: nmm1 on
In article <4bc601f5$0$22917$e4fe514c(a)news.xs4all.nl>,
Casper H.S. Dik <Casper.Dik(a)Sun.COM> wrote:
>EricP <ThatWouldBeTelling(a)thevillage.com> writes:
>
>>A[x] and C[y] are referenced only once so there is no reason
>>for the compiler to enregister their values, and all other
>>variables are locals or pass-by-value, and therefore
>>no reason for aliasing to occur.
>
>A proper compiler will generate optimal code for:
>
> for (i = 0; i < len; i++)
> C[i] = A[i];
>
>or similar code and will generate about the same code for:
>
> pC = C; pA = A;
> while (*pC < C[len])
> *pC++ = *pA++;

Including when A, C and len are arguments? Oh, come now! Few
start with a test for whether A and C overlap, which is what is
needed.

>If you can generate faster code by unrolling loops in C then you should get
>a better compiler.

You should know better than to post that nonsense. There is a LOT
of C code where the programmer knows that two arrays can't overlap,
but the compiler can't tell.

C99 restrict brings programs that use it religiously and correctly
up to about the optimisability of Fortran 77, but that's all. And,
without it, the compiler has one hand tied behind its back.


Regards,
Nick Maclaren.
From: EricP on
Casper H.S. Dik wrote:
> EricP <ThatWouldBeTelling(a)thevillage.com> writes:
>
>> A[x] and C[y] are referenced only once so there is no reason
>> for the compiler to enregister their values, and all other
>> variables are locals or pass-by-value, and therefore
>> no reason for aliasing to occur.
>
>
> A proper compiler will generate optimal code for:
>
>
> for (i = 0; i < len; i++)
> C[i] = A[i];
>
> or similar code and will generate about the same code for:
>
> pC = C; pA = A;
> while (*pC < C[len])
> *pC++ = *pA++;
>
>
> If you can generate faster code by unrolling loops in C then you should get
> a better compiler.
>
> Casper

Ok, but that isn't what he said or what I was commenting on.

The MS compiler does a surprisingly good job on the
original cache-hostile code:

void Rotate90ToRight
(int height, int width, const char *srcBuf, char *dstBuf)
{
int i, j;
for (i = 0; i < height; i++)
for (j = 0; j < width; j++)
memcpy (dstBuf+(height*j+height-1-i)*3, srcBuf+(width*i+j)*3, 3);
}


tv410 = -4 ; size = 4
tv434 = 8 ; size = 4
_height$ = 8 ; size = 4
_width$ = 12 ; size = 4
_srcBuf$ = 16 ; size = 4
_dstBuf$ = 20 ; size = 4
_Rotate90ToRight(a)16 PROC ; COMDAT
; Line 3
push ecx
; Line 7
mov eax, DWORD PTR _height$[esp]
test eax, eax
jle SHORT $LN4(a)Rotate90To
mov edx, DWORD PTR _dstBuf$[esp]
push ebx
push ebp
mov ebp, DWORD PTR _srcBuf$[esp+8]
push esi
mov esi, DWORD PTR _width$[esp+12]
push edi
lea ecx, DWORD PTR [esi+esi*2]
lea edi, DWORD PTR [eax+eax*2]
mov DWORD PTR tv410[esp+20], ecx
lea edx, DWORD PTR [edi+edx-3]
mov DWORD PTR tv434[esp+16], eax
npad 5
$LL13(a)Rotate90To:
; Line 8
test esi, esi
jle SHORT $LN5(a)Rotate90To
mov eax, ebp
mov ecx, edx
npad 8
$LL3(a)Rotate90To:
; Line 9
mov bx, WORD PTR [eax]
mov WORD PTR [ecx], bx
mov bl, BYTE PTR [eax+2]
mov BYTE PTR [ecx+2], bl
add eax, 3
add ecx, edi
sub esi, 1
jne SHORT $LL3(a)Rotate90To
; Line 8
mov esi, DWORD PTR _width$[esp+16]
$LN5(a)Rotate90To:
; Line 7
add ebp, DWORD PTR tv410[esp+20]
sub edx, 3
sub DWORD PTR tv434[esp+16], 1
jne SHORT $LL13(a)Rotate90To
pop edi
pop esi
pop ebp
pop ebx
$LN4(a)Rotate90To:
; Line 10
pop ecx
ret 16 ; 00000010H

Eric

From: Terje Mathisen "terje.mathisen at on
EricP wrote:
> Brett Davis wrote:
>> Only one in a hundred programers know an optimizaton like that, for
>> half of comp.arch to be that good says good things about comp.arch.
>
> Ok, well, I'm going to take the risk of seeming a total doofuss
> on a global scale, but I don't see what you are getting at.
>
> A[x] and C[y] are referenced only once so there is no reason
> for the compiler to enregister their values, and all other
> variables are locals or pass-by-value, and therefore
> no reason for aliasing to occur.
>
> The SH-4 has byte memory access instructions, so this is just the
> difference between LD ST LD ST LD ST and LD LD LD ST ST ST.
> The latter requires 2 extra temp regs, which on x86 causes
> a var spill into the stack, but probably is ok on sh-4.
>
> So I don't see the 2x speedup here.

The key is that without alias knowledge, it is perfectly possible for
the two arrays to intersect, in which case it is illegal for the
compiler to unroll the loads and stores.

However, the huge advantage here is to load 12 bytes = 4 pixels into 3
registers, then do the same for the 3 following lines so that you have
loaded a full 4x4 pixel block with 12 load instructions instead of 36.

Next you do shifts & shuffles to rotate the 4x4 block, still ending up
with 12 registers, before storing them back with 12 writes.

This saves 2/3 of all the load/store operations.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"