From: Skybuck Flying on
original

.------------------->X
|
|
|
|
|
|
\|/
Y

rotated 90 degrees clockwise

y<--------------------.
|
|
|
|
|
|
\|/
x

new_x <---------------.
|
|
\|/
new_y

new_x = (height-1)-y;
new_y = x;

Source := 0;
Dest := (Height-1);
Count := 0;
for Index := 0 to PixelCount-1 do
begin
Source := Source + 1;
Dest := Dest-1;
Count := Count + 1;
if Count >= Height then
begin
Count := 0;
Dest := Dest + Height;
end;
Dest^ := Source^;
end;

Untested, but that's what I would conceptually try.

This should get rid of all the (slower?) multiplications, in return for one
single branch, and some fast adds and subs.
Plus a memory copy per pixel.

(What target you on ? ;))

Bye,
Skybuck.


From: nmm1 on
In article <4ec6f608-1d16-4c97-b47a-7c1d71879142(a)o30g2000yqb.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>>
>> >> I need to rotate a picture clockwise 90 degrees.
>>
>> For extra marks, produce a much more cache-friendly version, using
>> blocks rather than rows ....
>>
>> You can even do that in place, and make it efficient. =A0But I suspect
>> that you (Terje) know all of this!
>>
>What would happen if you changed the default storage order for arrays
>of dimension two and greater, so that they were always stored in
>blocks with some cache or node-friendly order? You could make it
>nearly transparent in a language like c++ by defining appropriate
>objects.

The access time would go up, and some other operations would run a
lot more slowly. That technique works, but isn't any use as a
general solution. It can help for some applications.

TANSTAAFL.


Regards,
Nick Maclaren.
From: "Andy "Krazy" Glew" on
On 4/7/2010 8:59 AM, Terje Mathisen wrote:
> Follow-up restored to include a group I read)
> Noob wrote:
>> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
>> [ Followup-To: set to comp.programming ]
>>
>> Hello,
>>
>> I imagine the following problem has been efficiently solved
>> over a million times in the past.
>>
>> I need to rotate a picture clockwise 90 degrees.
>>
>> Conceptually, my picture is represented by a matrix
>> (W columns, H rows) of pixels, with each pixel
>> represented by 3 octets.
>>
>> pixel original_picture[W][H];
>> pixel rotated_picture[H][W];
>>
>
> This is indeed a well-known problem, the key to make it fast is to
> realize that it is very similar to a transpose operation!
>
> I.e. first copy everything to the target array (in a single block move),
> then transpose it, finally you reverse each of the rows.
>
> The difference between clock and counter-clock-wise rotation is in doing
> the row reverse either before or after the transpose. Try both since I
> don't remeber which is which!
>
> The key here is that the transpose operation is quite cache-friendly,
> much better than your current naive code.

I usually defer to you wrt optimization, Terje, but unless you have an especially fast block copy instruction that the
user cannot roll himself (not unlike my P6-era fast strings), I doubt that it will be faster to copy and then transpose
in place.

It is probably better to subdivide the original picture into blocks, and perform the rotation reading a block of the
original_picture, and writing the block of the rotated_picture that it maps to.

The only real question is the size of the blocks: cache sized? Maybe not: if you have N write combining buffers, there
may be a benefit in writing blocks whose vertical height is equal to the number of WC buffers. However, when I have
tuned transposes in the past I have usually found it better to write out rows of the destination, reading columns of the
origin. This way you don't really depend on the number of WC buffers: you completely fill one WC buffer, and then move
to the next.

With the secondary issue of what width you read the columns in - e.g. if transposing byte sized elements, you might read
32 or 64 at a time, and then do a register transpose to assemble the blocks to write out. Nowadays, you have to look at
128 bit SSE and 256 bit AVX (or 512 bit LRB registers) as options. (Hmm, on a machine with scatter/gather, you might be
tempted to read in a full cache line of the source, and scatter write to the destination. But then you *really* want to
make sure that you fit into the WC buffers. Scatter read and single write probably better. But I suspect that an in
register transpose is better overall. How efficiently can LRB transpose a set of vector registers? E.g. with 32 bits
per element, a full cache line 64B register is 8 elements wide - so how efficiently can you transpose a matrix stored in
8 vector registers?

Interestingly, most GPUs allow indirect addressing on their register files. I think that I have seen index registers -
i.e. access register[register[i]] - as well as strided access TO THE GPU RF. In such a system, transposing or rotating
is just (1) read the data in, (2) write the data out. No shuffling about. Although there is cross lane movement - but
then the ALU lanes are not necessarily the same as the memory lanes)

In any case, you wat to avoid the unnecessary read-for-ownerships that you might naively get when writing into memory
one element of the time. SSE streaming stores.
From: Noob on
Terje Mathisen wrote:

> Noob wrote:
>
>> I imagine the following problem has been efficiently solved
>> over a million times in the past.
>>
>> I need to rotate a picture clockwise 90 degrees.
>>
>> Conceptually, my picture is represented by a matrix
>> (W columns, H rows) of pixels, with each pixel
>> represented by 3 octets.
>>
>> pixel original_picture[W][H];
>> pixel rotated_picture[H][W];
>
> This is indeed a well-known problem, the key to make it fast is to
> realize that it is very similar to a transpose operation!
>
> I.e. first copy everything to the target array (in a single block move),
> then transpose it, finally you reverse each of the rows.
>
> The difference between clock and counter-clock-wise rotation is in doing
> the row reverse either before or after the transpose. Try both since I
> don't remember which is which!
>
> The key here is that the transpose operation is quite cache-friendly,
> much better than your current naive code.

Hello Terje,

Your sig is the reason I thought it would be a good idea to include
comp.arch in the list :-)

Let me take a 5x3 image as an example.

0 1 2 3 4
5 6 7 8 9
10 11 12 13 14

Transpose

0 5 10
1 6 11
2 7 12
3 8 13
4 9 14

Reverse the rows

10 5 0
11 6 1
12 7 2
13 8 3
14 9 4

This is, indeed, equivalent to a 90� clockwise rotation.

I don't see why the transpose operation would be more
cache-friendly than directly doing the rotation.

TRANSPOSE = copy ROW(i) to COL(i)
90�CW ROT = copy ROW(i) to COL(H-1-i)

Is it because the column access is incremented rather than
decremented? I must be missing something...

Could you provide some insight, or some links?

Regards.
From: Terje Mathisen "terje.mathisen at on
Noob wrote:
> Let me take a 5x3 image as an example.
>
> 0 1 2 3 4
> 5 6 7 8 9
> 10 11 12 13 14
>
> Transpose
>
> 0 5 10
> 1 6 11
> 2 7 12
> 3 8 13
> 4 9 14
>
> Reverse the rows
>
> 10 5 0
> 11 6 1
> 12 7 2
> 13 8 3
> 14 9 4
>
> This is, indeed, equivalent to a 90� clockwise rotation.
>
> I don't see why the transpose operation would be more
> cache-friendly than directly doing the rotation.
>
> TRANSPOSE = copy ROW(i) to COL(i)
> 90�CW ROT = copy ROW(i) to COL(H-1-i)
>
> Is it because the column access is incremented rather than
> decremented? I must be missing something...
>
> Could you provide some insight, or some links?

The basic (fast) transpose operation works on squares that have power of
two sides, this is what allows it to work recursively, and thereby to
fit into lower cache levels as soon as possible.

The innermost loop of the transpose works with N registers each capable
of holding N elements, so you can transpose a 4x4 block of 32-bit values
using 4 SSE registers, or an 8x8 block of 16-bit values (as long as
you're in 64-bit mode so you have at least 16 regs available).

If you actually work with very small blocks, like your 5x3 example, then
it really doesn't matter, you can just generate unrolled code to do it
directly while copying, in which case the fastest code will be like
Andy's suggestion where you gather in data to be able to write full
cache lines to the target buffer.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"