From: Noob on
[ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
[ Followup-To: set to comp.programming ]

Hello,

I imagine the following problem has been efficiently solved
over a million times in the past.

I need to rotate a picture clockwise 90 degrees.

Conceptually, my picture is represented by a matrix
(W columns, H rows) of pixels, with each pixel
represented by 3 octets.

pixel original_picture[W][H];
pixel rotated_picture[H][W];

At the implementation level, I'm dealing with 1D integer arrays.

int src[W*H*3];
int dest[W*H*3];

Conceptually, the rotation operation comes down to

rotated_picture[j][H-1-i] = original_picture[i][j]

My actual code is

for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

Consider as an example,
W = 1644
H = 1164
size = 5.74 MB

On my target platform, the rotation takes 1.65 seconds.

I'm trying to get this down under half a second.

I'm using the GNU tool chain. For some weird reason, we
compile everything -O0. The first thing I'll try is crank
gcc's optimization level.

I'm hoping gcc can perform some strength reduction, as the
index calculation seems to be taking a non-negligible fraction
of the total run-time.

Changing the loop to

for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
{
memcpy(dest+(H*j+H-i-1)*3, src, 3);
src += 3;
}

brings a 5% improvement.

I thought changing the memcpy call to 3 explicit assignments
might help, but it actually slowed things down.

Perhaps I need to perform some tiling magic... I don't think
gcc performs automatic tiling?

Comments and insight would be greatly appreciated.

Regards.
From: Meindert Sprang on
"Noob" <root(a)127.0.0.1> wrote in message
news:hphm36$puk$1(a)speranza.aioe.org...
> My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

Rewrite this in assembler and you'll be amazed. Calling memcpy to copy only
3 bytes is killing you.

Meindert


From: James Dow Allen on
On Apr 7, 5:17 pm, Noob <r...(a)127.0.0.1> wrote:
> I need to rotate a picture clockwise 90 degrees.

As Meindert mentions, bypassing 'memcpy' might help,
though some compilers might be doing that for you already.

As you imply, calculating addresses (and with '+' not '*')
in a more efficient way may help, though again some compilers
might be doing it for you.

Some other suggestions that might help:

1. Break large square into smaller squares for better
cache utilization. Your exact problem is discussed at
http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
(Perhaps that wiki article links to source code.)

2. Often 4-byte pixels (allowing uint32 as datatype)
are used instead of 3-byte pixels. May seem very unlikely,
but using 33% *more* storage just *might* lead to speedup
because of faster pixel ops.

When I saw the subject line, I though you might have
been rotating by arbitrary angle, not 90-degrees.
Once you get fast 90-degree rotate. the elegant 3-shear
algorithm can build from that to get fast accurate rotation
by any angle.

James Dow Allen
From: Paul Carpenter on
In article <hphm36$puk$1(a)speranza.aioe.org>, root(a)127.0.0.1 says...
> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
> [ Followup-To: set to comp.programming ]
>
> Hello,
>
> I imagine the following problem has been efficiently solved
> over a million times in the past.
>
> I need to rotate a picture clockwise 90 degrees.
>
> Conceptually, my picture is represented by a matrix
> (W columns, H rows) of pixels, with each pixel
> represented by 3 octets.
>
> pixel original_picture[W][H];
> pixel rotated_picture[H][W];
>
> At the implementation level, I'm dealing with 1D integer arrays.
>
> int src[W*H*3];
> int dest[W*H*3];

So you have an array of ints (undetermined width) and a lot of them
as a block of memory.

Why int for octets?

> Conceptually, the rotation operation comes down to
>
> rotated_picture[j][H-1-i] = original_picture[i][j]

At this point I went where does H come from?

Do mean dimensions where maximum i is always greater than maximum j
therefore you need to make a square o/p array, so the zeroth column
is still the same height as original and consists of the bottom row
of the input image.

> My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

very messy

> Consider as an example,
> W = 1644
> H = 1164
> size = 5.74 MB
>
> On my target platform, the rotation takes 1.65 seconds.
>
> I'm trying to get this down under half a second.
>
> I'm using the GNU tool chain. For some weird reason, we
> compile everything -O0. The first thing I'll try is crank
> gcc's optimization level.

You nee to optimise your methods.

> I'm hoping gcc can perform some strength reduction, as the
> index calculation seems to be taking a non-negligible fraction
> of the total run-time.

Don't rely on compiler to sort out excessive algorith overhead.

> Changing the loop to
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> memcpy(dest+(H*j+H-i-1)*3, src, 3);
> src += 3;
> }
>
> brings a 5% improvement.

So you have seen that removing some complicated expression
from every iteration [e.g. W*i+j)*3 ] and replacing with
a simple addition speeds up operation.

This should give you a clue.

> I thought changing the memcpy call to 3 explicit assignments
> might help, but it actually slowed things down.

Because the array indecii are calculated three times (for input
and output), adding back lots of complicated expressions.

> Perhaps I need to perform some tiling magic... I don't think
> gcc performs automatic tiling?

You need to organise your data and use pointers, the rotation
is using fixed strides through arrays, which is simple pointer
addition (on most platforms)

You could still use memcpy but for three int/byte at a time is
wasteful. the following is an OUTLINE of doing three copies.

*op = *ip++;
op += H;
*op = *ip++;
op += H;
*op = *ip++;

You will need a few variables for pointers and tracking through inner
loop, to make sure things get reset correctly.

Only calculate your strides once and reuse them.

Simple do calculations minimum times.

> Comments and insight would be greatly appreciated.
>
> Regards.
>

--
Paul Carpenter | paul(a)pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/> PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate
From: Noob on
[ Adding comp.arch.embedded back to the mix, since most of the replies come from that group ]

Paul Carpenter wrote:

> Noob wrote:
>
>> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
>> [ Followup-To: set to comp.programming ]
>>
>> I imagine the following problem has been efficiently solved
>> over a million times in the past.
>>
>> I need to rotate a picture clockwise 90 degrees.
>>
>> Conceptually, my picture is represented by a matrix
>> (W columns, H rows) of pixels, with each pixel
>> represented by 3 octets.
>>
>> pixel original_picture[W][H];
>> pixel rotated_picture[H][W];
>>
>> At the implementation level, I'm dealing with 1D integer arrays.
>>
>> int src[W*H*3];
>> int dest[W*H*3];

I made a mistake. I meant to write
uint8_t src[W*H*3];
uint8_t dest[W*H*3];

> So you have an array of ints (undetermined width) and a lot of them
> as a block of memory.
>
> Why int for octets?

You're right. I meant uint8_t not int.

>> Conceptually, the rotation operation comes down to
>>
>> rotated_picture[j][H-1-i] = original_picture[i][j]
>
> At this point, I went where does H come from?

I'm not sure I understand what you mean.

I was trying to describe the algorithm at a slightly higher level.

pixel original_picture[W][H];
pixel rotated_picture[H][W];
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
rotated_picture[j][H-1-i] = original_picture[i][j]

There is perhaps a better (clearer) way to express the algorithm?

> Do mean dimensions where maximum i is always greater than maximum j
> therefore you need to make a square o/p array, so the zeroth column
> is still the same height as original and consists of the bottom row
> of the input image.
>
>> My actual code is
>>
>> for (i = 0; i< H; ++i)
>> for (j = 0; j< W; ++j)
>> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
>
> very messy

In what sense?

>> Consider as an example,
>> W = 1644
>> H = 1164
>> size = 5.74 MB
>>
>> On my target platform, the rotation takes 1.65 seconds.
>>
>> I'm trying to get this down under half a second.
>>
>> I'm using the GNU tool chain. For some weird reason, we
>> compile everything -O0. The first thing I'll try is crank
>> gcc's optimization level.
>
> You nee to optimise your methods.

Methods? Do you mean the functions?

>> I'm hoping gcc can perform some strength reduction, as the
>> index calculation seems to be taking a non-negligible fraction
>> of the total run-time.
>
> Don't rely on compiler to sort out excessive algorithm overhead.
>
>> Changing the loop to
>>
>> for (i = 0; i< H; ++i)
>> for (j = 0; j< W; ++j)
>> {
>> memcpy(dest+(H*j+H-i-1)*3, src, 3);
>> src += 3;
>> }
>>
>> brings a 5% improvement.
>
> So you have seen that removing some complicated expression
> from every iteration [e.g. W*i+j)*3 ] and replacing with
> a simple addition speeds up operation.

Strength reduction is typically the kind of task where a compiler
should excel.

http://en.wikipedia.org/wiki/Strength_reduction

> This should give you a clue.
>
>> I thought changing the memcpy call to 3 explicit assignments
>> might help, but it actually slowed things down.
>
> Because the array indecii are calculated three times (for input
> and output), adding back lots of complicated expressions.

for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
{
unsigned char *C = B+(H*j+H-i-1)*3;
C[0] = A[0];
C[1] = A[1];
C[2] = A[2];
A += 3;
}

was slower than

for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
{
memcpy(B+(H*j+H-i-1)*3, A, 3);
A += 3;
}

on my platform.

>> Perhaps I need to perform some tiling magic... I don't think
>> gcc performs automatic tiling?
>
> You need to organise your data and use pointers, the rotation
> is using fixed strides through arrays, which is simple pointer
> addition (on most platforms)
>
> You could still use memcpy but for three int/byte at a time is
> wasteful. the following is an OUTLINE of doing three copies.
>
> *op = *ip++;
> op += H;
> *op = *ip++;
> op += H;
> *op = *ip++;
>
> You will need a few variables for pointers and tracking through inner
> loop, to make sure things get reset correctly.
>
> Only calculate your strides once and reuse them.
>
> Simple do calculations minimum times.

Thanks for the tips.

Regards.
 |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.