Faster image rotation [Computer Architecture]

Prev: Which is the most beautiful and memorable hardware structure in a ?CPU?
Next: Energy usage per application ?

From: Robert Myers on 18 Apr 2010 20:05

Brett Davis wrote:

>
> The Story of Mel, a Real Programmer
> http://www.cs.utah.edu/~elb/folklore/mel.html
>
>> I am not him, my code is clear and readable.
>>

I've hacked machine code.

You probably know of the proof of the Poincare Conjecture and the
attempt to claim credit for it by filling in "missing" steps.

Grigory Perelman had a right to be annoyed. If what was obvious to him
was not obvious to others, that was not his failing.

The same rules, I claim, do not apply to programming. What you have
done, and why, should be obvious to anyone with enough competence to
read the syntax.

If it's not obvious through the code itself, then it should be obvious
from the comments.

Robert.

From: Brett Davis on 18 Apr 2010 21:59

In article <4BCB4EA2.4020706(a)patten-glew.net>,
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:

> On 4/18/2010 4:57 AM, Niels J�rgen Kruse wrote:
> > Andy "Krazy" Glew<ag-news(a)patten-glew.net> wrote:
> > Cortex A9 is not shipping in any product yet (I believe). Lots of
> > preannouncements though. The Apple A4 CPU is currently believed to be a
> > tweaked Cortex A8, perhaps related to the tweaked A8 that Intrinsity did
> > for Samsung before being acquired by Apple.
>
> One conspiracy-theorist type seems to think that it might actually be the
> PA Semi OOO PowerPC, running an ARM emulator.

A PowerPC running an emulator was within the realm of possibility.
I bet Apple looked at it.

The problem is that PowerPC offers little over what ARM provides.
(Besides 64 bit address mode, and a nice vector processor,
both of which would cost battery power.)
Thumb mode offers smaller code, a benefit for a handheld.

Apple knows that Thumb is a kludge, I hope Apple is looking at designing
their own CPU instruction set. Engineering a competitive advantage.
Hopefully they design something nice like my CLIW.

If the ARM chip has a nice MMU then Apple can stay 32 bits for a decade,
otherwise the roadmap will be looking dire for ARM in ~2 years.
Fixing the MMU would be the easy solution, it buys time.

Brett

From: Anton Ertl on 19 Apr 2010 04:40

"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>Hmm... From my point of view, the Itanium was the first computer architecture driven mainly by academics with PhDs.

I guess the Berkeley RISC and Stanford MIPS was driven by the
students, then, not the advisors (who had PhDs).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

From: Noob on 20 Apr 2010 10:32

John W Kennedy wrote:

> robin said:
>
>> Now 3 octets will be 9 bits, [...]

http://en.wikipedia.org/wiki/Octet_(computing)

> The rest of this is addressed to the original poster: I don't understand
> why you're using int variables for octets; they should be char.

You're right. It was a typo.

> For the rest, I'd do the following:
>
> typedef struct __Pixel {
> unsigned char red, green, blue;
> } Pixel;
>
> Pixel src[W][H];
> Pixel dest[H][W];
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i) {
> dest[j][i].red = src[i][j].red;
> dest[j][i].green = src[i][j].green;
> dest[j][i].blue = src[i][j].blue;
> }

Are you sure the above is a description of a rotation? :-)
Clockwise or counter-clockwise?

> I'd also try:
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i)
> memcpy(dest[j][i], src[i][j], sizeof (Pixel));
>
> and see which one is faster; it will depend on the individual compiler.
> (Make sure you test at the same optimization level you plan to use.)

The target system is
CPU: 266 MHz, dual issue, 5-stage integer pipeline, SH-4
RAM: Two 64-MB, 200-MHz, DDR1 SDRAM modules (on separate memory buses)

After much testing, it dawned on me that the system's memory
allocator returns non-cached memory. (I found no way to request
large contiguous buffers in cached memory.) All cache-specific
optimizations thus became irrelevant.

On this system, a load from non-cached memory has a latency of
~45 cycles, thus the only optimization that made sense was to
load 32-bit words instead of octets. I configured libjpeg to
output 32-bit pixels instead of 24-bit pixels.

Then I got away with trivial code:

void rotate_right(uint32 *A, uint32 *B, int W, int H)
{
int i, j;
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
B[H*j + H-1-i] = A[W*i + j]; /* B[j][H-1-i] = A[i][j] */
}

void rotate_left(uint32 *A, uint32 *B, int W, int H)
{
int i, j;
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
B[H*(W-1-j) + i] = A[W*i + j]; /* B[W-1-j][i] = A[i][j] */
}

gcc-4.2.4 -O2 was smart enough to strength-reduce the index
computation for both arrays.

00000000 <_rotate_right>:
0: 86 2f mov.l r8,@-r15
2: 15 47 cmp/pl r7
4: 96 2f mov.l r9,@-r15
6: 63 68 mov r6,r8
8: 15 8f bf.s 36 <_rotate_right+0x36>
a: 73 61 mov r7,r1
c: 08 47 shll2 r7
e: 83 69 mov r8,r9
10: fc 77 add #-4,r7
12: 13 66 mov r1,r6
14: 7c 35 add r7,r5
16: 08 49 shll2 r9
18: 04 77 add #4,r7
1a: 15 48 cmp/pl r8
1c: 07 8b bf 2e <_rotate_right+0x2e>
1e: 43 60 mov r4,r0
20: 53 63 mov r5,r3
22: 83 62 mov r8,r2
24: 06 61 mov.l @r0+,r1
26: 10 42 dt r2
28: 12 23 mov.l r1,@r3
2a: fb 8f bf.s 24 <_rotate_right+0x24>
2c: 7c 33 add r7,r3
2e: 10 46 dt r6
30: fc 75 add #-4,r5
32: f2 8f bf.s 1a <_rotate_right+0x1a>
34: 9c 34 add r9,r4
36: f6 69 mov.l @r15+,r9
38: 0b 00 rts
3a: f6 68 mov.l @r15+,r8
3c: 09 00 nop
3e: 09 00 nop

The loop kernel is

24: 06 61 mov.l @r0+,r1
26: 10 42 dt r2
28: 12 23 mov.l r1,@r3
2a: fb 8f bf.s 24 <_rotate_right+0x24>
2c: 7c 33 add r7,r3

Thanks to all for your suggestions (especially Terje).

Regards.

From: MitchAlsup on 19 Apr 2010 12:17

On Apr 18, 4:32 pm, n...(a)cam.ac.uk wrote:
> Yup. In my view, interrupts are doubleplus ungood - message passing
> is good.

CDC was the only company to get this one right. The OS ran mostly* in
the perifferal processors, leaving the great big number cruncher to
(ahem) crunch numbers.

(*) The interupt processing and I/O was run in the PPs and most of the
OS scheduling was run in the PPs.

I remember a time back in 1979, I was logged into a CDC 7600 in
California doing text editing. There were a dozen other members of the
SW team doing similarly. There was a long silent pause where no
character echoing was taking place. A few moments later (about 30
seconds) the processing returned to normal. However, we found out that
we were now logged into a CDC 7600 in Chicago. The Ca machine had
crashed, and the OS had picked up all the nonfaulting tasks, shipped
them up to another machine half way across the country and restarted
the processes.

Why can't we do this today? We could 30 years ago!

Mitch

First | Prev | Next | Last
Pages: 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Prev: Which is the most beautiful and memorable hardware structure in a ?CPU?
Next: Energy usage per application ?