From: Terje Mathisen "terje.mathisen at on
EricP wrote:
> These two could be combined so it would rotate a 4*4 block of 3 byte
> elements in a vertical stripe and writing 4 sequential rows at once.
> I was thinking it would require too many registers to rotate
> the block but I think it could be done with just 6 if you
> spill each pending row value as soon as it is ready:
> 4 to hold the pending write value for each row,
> one for the just loaded value, one for shifting, masking, merging.
> Other regs for holding dstPtr, dstTmpPtr, srcPtr, srcTmpPtr, height, etc,
> and it looks like it would pretty much fit in 16 registers,
> and only load each cache line once.
>
> That's probably as good as your going to get.

This is effectively what I'm suggesting, i.e. do the rotate while
copying, trying to minimize the number of load/store operations.

Working in 4x4 blocks is a good compromise.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: EricP on
Brett Davis wrote:
>>>>> If you do the following you can get a 2x speedup, it looks like
>>>>> more code, but will generate less, and the results will be
>>>>> pipelined correctly.
>>>>> Extra bonus points to those that understand why. Half the posters here?
>>>>>
>>>>> {
>>>>> unsigned char *C = B+(H*j+H-i-1)*3;
>>>>> temp0 = A[0];
>>>>> temp1 = A[1];
>>>>> temp2 = A[2];
>>>>> C[0] = temp0;
>>>>> C[1] = temp1;
>>>>> C[2] = temp2;
>>>>> A += 3;
>>>>> }
>>>>>
>>>>> Do not use *C++ = *A++;
>>> Only one in a hundred programers know an optimizaton like that, for
>>> half of comp.arch to be that good says good things about comp.arch.
>
> Benchmark it, try and stay in cache as a 66 MHz embedded CPU does
> not have memory 600 cycles away.
> Only an Intel class chip has enough OoO stuff to have a chance
> of coming close to the same speed as my code.
>
> Brett

Oh, you are avoiding read after write data
dependency pipeline stalls on in-order cpus.

The buffer overlap aliasing considerations in C prevents the
compiler from automatically rearranging the original LD ST
order to be more pipeline friendly for a particular cpu,
but Fortran would allow it.

Yeah ok that could be about 2x speed for that kind of cpu.

Eric

From: robin on
"Noob" <root(a)127.0.0.1> wrote in message news:hphm36$puk$1(a)speranza.aioe.org...

| I need to rotate a picture clockwise 90 degrees.
|
| Conceptually, my picture is represented by a matrix
| (W columns, H rows) of pixels, with each pixel
| represented by 3 octets.
|
| pixel original_picture[W][H];
| pixel rotated_picture[H][W];
|
| At the implementation level, I'm dealing with 1D integer arrays.
|
| int src[W*H*3];
| int dest[W*H*3];
|
| Conceptually, the rotation operation comes down to
|
| rotated_picture[j][H-1-i] = original_picture[i][j]
|
| My actual code is
|
| for (i = 0; i < H; ++i)
| for (j = 0; j < W; ++j)
| memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

This is not going to be fast because:-
1. you are calling a subroutine 1,913,616 times to move
9 or 48 bits each time.
2. You are multiplying by 3 twice for each move.
3. You are using integer arrays.

1. Best to use ordinary assignment.
2. Eliminate *3 by changing to adds.
3. Use byte arrays (but I'm not clear about what you say.
You say that you have octets. Now 3 octets will be 9 bits,
not 3 integers equal to 48 bits.)

This is how it would look in PL/I:-

declare (src(m,n), dest(n,m)) char (3);
do i = 1 to m;
dest(*, m-i+1) = src(i,*);
end;

If the color information really is octets,
then the declaration is changed to bit (9) aligned.


From: glen herrmannsfeldt on
In comp.lang.pl1 robin <robin_v(a)bigpond.com> wrote:
> "Noob" <root(a)127.0.0.1> wrote in message news:hphm36$puk$1(a)speranza.aioe.org...
> | I need to rotate a picture clockwise 90 degrees.

> | Conceptually, my picture is represented by a matrix
> | (W columns, H rows) of pixels, with each pixel
> | represented by 3 octets.

> | pixel original_picture[W][H];
> | pixel rotated_picture[H][W];

> | At the implementation level, I'm dealing with 1D integer arrays.

> | int src[W*H*3];
> | int dest[W*H*3];

> | Conceptually, the rotation operation comes down to

> | rotated_picture[j][H-1-i] = original_picture[i][j]

> | My actual code is

> | for (i = 0; i < H; ++i)
> | for (j = 0; j < W; ++j)
> | memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

This is basically a matrix transpose (with the difference that
one goes down instead of up). There are cache friendly matrix
transpose algorithms that should also speed this one up.

> This is not going to be fast because:-
> 1. you are calling a subroutine 1,913,616 times to move
> 9 or 48 bits each time.

Many C compilers implement memcpy() inline. I suppose you
believe that PL/I does a subroutine call for the MOD function?

> 2. You are multiplying by 3 twice for each move.

Most can also figure that one out.

> 3. You are using integer arrays.

> 1. Best to use ordinary assignment.
> 2. Eliminate *3 by changing to adds.
> 3. Use byte arrays (but I'm not clear about what you say.
> You say that you have octets. Now 3 octets will be 9 bits,
> not 3 integers equal to 48 bits.)

-- glen
From: Brett Davis on
In article <Kw6yn.224427$wr5.103281(a)newsfe22.iad>,
EricP <ThatWouldBeTelling(a)thevillage.com> wrote:

> Brett Davis wrote:
> >>>>> If you do the following you can get a 2x speedup, it looks like
> >>>>> more code, but will generate less, and the results will be
> >>>>> pipelined correctly.
> >>>>> Extra bonus points to those that understand why. Half the posters here?
> >>>>>
> >>>>> {
> >>>>> unsigned char *C = B+(H*j+H-i-1)*3;
> >>>>> temp0 = A[0];
> >>>>> temp1 = A[1];
> >>>>> temp2 = A[2];
> >>>>> C[0] = temp0;
> >>>>> C[1] = temp1;
> >>>>> C[2] = temp2;
> >>>>> A += 3;
> >>>>> }
> >>>>>
> >>>>> Do not use *C++ = *A++;
> >>> Only one in a hundred programers know an optimizaton like that, for
> >>> half of comp.arch to be that good says good things about comp.arch.
> >
> > Benchmark it, try and stay in cache as a 66 MHz embedded CPU does
> > not have memory 600 cycles away.
> > Only an Intel class chip has enough OoO stuff to have a chance
> > of coming close to the same speed as my code.
> >
> > Brett
>
> Oh, you are avoiding read after write data
> dependency pipeline stalls on in-order cpus.

No, nice guess.
I dont know of any CPU that stalls on a read after write, instead
they try and forward the data, and in the rare case when a violation
occurs the CPU will throw an interrupt and restart the instructions.
This is an important comp.arch point, so someone will correct me
if I am wrong.

> The buffer overlap aliasing considerations in C prevents the
> compiler from automatically rearranging the original LD ST
> order to be more pipeline friendly for a particular cpu,
> but Fortran would allow it.

This is close, the answer is a mundane but important point.
CPUs today have a load to use delay of ~2 cycles from level one
cache. (less for slower chips, more in faster chips.)
An OoO chip can try and find other instructions to execute, but
even they are subject to this delay. (I believe.)
This copy loop is so small I dont think there is much even a
Intel/AMD chip can do. I was hoping you would benchmark it. ;)

A lot of the dual issue CPUs are partial OoO and support hit
under miss for reads, but these will epic fail at running code
like this faster.

The future of computing (this decade) is lots of simple in-order CPUs.
Rules of die size, heat and efficiency kick in. Like ATI chips.

> Yeah ok that could be about 2x speed for that kind of cpu.
>
> Eric