From: MitchAlsup on
On Apr 11, 10:24 pm, Brett Davis <gg...(a)> wrote:
> If you do the following you can get a 2x speedup, it looks like
> more code, but will generate less, and the results will be
> pipelined correctly.
> Extra bonus points to those that understand why.

You have explicitly gotten rid of the aliasing issues so the compiler
can avoid having to assume aliasing conflicts between C[0] and

From: Robert Myers on
On Apr 11, 11:24 pm, Brett Davis <gg...(a)> wrote:
> Reposted from comp.arch.embeded
> >> I need to rotate a picture clockwise 90 degrees.
> > The data sheet states
> > SH-4 32-bit super-scalar RISC CPU
> > o 266 MHz, 2-way set associative 16-Kbyte ICache, 32-Kbyte DCache, MMU
> > o 5-stage pipeline, delayed branch support
> > o floating point unit, matrix operation support
> > o debug port, interrupt controller
> The most important number is cache line size, which you missed.
> If your image is 1,024 lines tall, that will completely thrash
> the cache, resulting in 3 bytes copied per cache line load/spill.
> If you copy 16x16 tiles you can get a 10x speedup.
> CopyTile16(source, dest, x, y, width, height)...
> You can also try 8x8 and 4x4, smaller loops can be faster due
> to all the args fitting in memory, and the loop getting unrolled.
> but the difference will be ~25% which is hardly worth the time to code.
> >    for (i = 0; i < H; ++i)
> >      for (j = 0; j < W; ++j)
> >      {
> >        unsigned char *C = B+(H*j+H-i-1)*3;
> >        C[0] = A[0];
> >        C[1] = A[1];
> >        C[2] = A[2];
> >        A += 3;
> >      }
> If you do the following you can get a 2x speedup, it looks like
> more code, but will generate less, and the results will be
> pipelined correctly.
> Extra bonus points to those that understand why. Half the posters here?
>      {
>        unsigned char *C = B+(H*j+H-i-1)*3;
>        temp0 = A[0];
>        temp1 = A[1];
>        temp2 = A[2];
>        C[0] = temp0;
>        C[1] = temp1;
>        C[2] = temp2;
>        A += 3;
>      }
> Do not use *C++ = *A++;

Programming hotshots have done so much damage.

And they brag about it.

I watched some doing one-upsmanship while the earth was still being
created, and I decided I wanted nothing to do with it. I think I
showed good judgment (rare for me).

From: Noob on
Hello Terje :-)

Terje Mathisen wrote:

> Noob wrote:
>> My target is a 266-MHz SH-4.
>> Assuming the vendor's implementation of memcpy is properly optimized,
>> I think it is safe to assume that it provides a lower bound for the
>> run-time of the rotation operation. Correct?

There seems to be something fishy with the vendor's memcpy: it handles (only)
~18 MB/s while the straight-forward foo() below handles ~21 MB/s

void foo(unsigned int *dest, unsigned int *src, int wordcount)
for (i = 0; i < wordcount; ++i) *dest++ = *src++;

The loop kernel translates to

mov.l @r4+,r1
dt r2
mov.l r1,@r5
bf/s .L137
add #4,r5

Unrolling the loop twice makes the code 50% (huh?!?) slower

mov.l @r4,r1
dt r3
mov.l @(4,r4),r2
add #8,r4
mov.l r1,@r5
mov.l r2,@(4,r5)
bf/s .L137
add #8,r5

I'm starting to think that the array is allocated in a non-cached region.

>> As an example, copying a 5.7-MB picture (1644 x 1164) takes 320 ms
>> (i.e. 18 MB/s)
> Both dimensions are divisible by 4, but not by 8, so 4x4 basic blocks is
> a natural starting point.

The dimensions are, indeed, divisible by 4, because I do the following ;-)

cinfo.output_width &= ~3;
cinfo.output_height &= ~3;

>> Right now, my naive (strength-reduced, no tiling) C implementation
>> takes 1100 ms. I think there is still room for improvement.
>> One thing you may have overlooked in the advice you've given me is
>> that the pixel values are packed in 24-bits (RGB, no alpha).
> Aha!
> This is crucial, it probably means that you should work with 4x4 pixel
> blocks, each using 3x4x4 = 48 bytes.

Someone showed me how to get libjpeg to output 4-byte pixels, and
this alone gave me a 3x speedup.

> 48 bytes would fit in 12 32-bit registers, I don't know the details of
> the SH-4, but I believe it has at least 16 integer registers, right?


General register file:
o Sixteen 32-bit general registers (and eight 32-bit shadow registers)
o Seven 32-bit control registers
o Four 32-bit system registers

> Starting from this point I would first try a simple setup where I read
> in 4x4 pixels in 3x4 registers, then shuffle/shift/combine these into a
> rotated/transposed version, which I then store to 4 small temporary
> buffers.
> Each time these 4 buffers reach 64 bytes each (or 128, whatever the
> cache line size is)

The cache line size is 32 bytes.

I can play every cache trick in the book, but it won't help if I'm
working with non-cached memory.

> Back in the 486-33MHz days I used to say that 10 MB/s was my initial
> speed target, your SH-4 is probably quite a bit faster than this?

AFAICT, my 266-MHz SH-4 should be 5-10 faster than a 33-MHz 486 in
pure integer code.

I found a better description of the CPU:

o 32-bit internal data bus
o General register file:
� sixteen 32-bit general registers (and eight 32-bit shadow registers)
� seven 32-bit control registers
� four 32-bit system registers
o RISC-type instruction set
� fixed 16-bit instruction length for improved code efficiency
� load-store architecture
� delayed branch instructions
� conditional execution
o Superscalar architecture: parallel execution of two instructions
o Instruction execution time: maximum 2 instructions/cycle
o On-chip multiplier
o Five-stage pipeline (100, 200, 400 and 500 series)
o Seven-stage pipeline (300 series)
o Branch target cache for bubble-free branching along predicted path (300 series)

and the cache sub-system:

o Instruction cache (IC) features:
� 16 Kbyte, 2-way set associative
� 512 entries, 32-bytes block length
� compatibility mode (8 Kbyte direct mapped)
� index mode
o Operand cache (OC) features:
� 32 Kbyte, 2-way set associative
� 1024 entries, 32 bytes block length
� compatibility mode (16 Kbyte direct mapped)
� index mode
� RAM mode (16 Kbyte cache + 16 Kbyte RAM)
o Single-stage copy-back buffer, single-stage write-through buffer
o Cache memory contents can be accessed directly by address mapping (usable as on-chip memory)
o Store queue (32 bytes x 2 entries).

cf. also

From: Terje Mathisen "terje.mathisen at on
Noob wrote:
> I'm starting to think that the array is allocated in a non-cached region.


That sounds like a frame buffer...

Can you direct these temporary image buffers to be in cacheable ram instead?


- <Terje.Mathisen at>
"almost all programming can be viewed as an exercise in caching"
From: "Andy "Krazy" Glew" on
On 4/10/2010 5:13 PM, MitchAlsup wrote:
> On Apr 9, 4:42 pm, "Andy \"Krazy\" Glew"<ag-n...(a)>
> wrote:
>> The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs.
>> Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs.
>> The best is usually to have several large contiguous chunks per CPU.
> It turns out that interleaving should be performed at the level of the
> DRAM page for maximal effect with memory controllers that are smart
> enough. This is far larger than the CPU page, but if the DRAMs do not
> all have the same page sisze that the largest page that fits in all
> DRAMs is generally an excellent granule.
> A DRAM controller is sufficiently smart when it can hold dozens to
> hundreds of cache lines waiting to be written while still servicing
> demand reads with some protocal as to when writes actually happen and
> some means to access the write buffer on behalf of demand reads.
> This strategy commits the fewest number of DIMMs to the typical large
> scale strip mine active access patterns. {Leaving the rest of the DRAM
> banks to do more random activities.}
> Mitch

On many memory systems the DRAM page is approximately the same size as the virtual memory page - 4KB or thereabouts. So
in this situation our advice coincides.

When the DRAM page is smaller than the virtual memory page, it is probably a bad idea to interleave between CPUs at the
DRAM page. System software cannot control.

By the way, interleaving across channels between DRAM pages is often a win - if you are limited by channel bandwidth.
But this, unfortunately, makes the effective DRAM page even larger, exacerbating the following effect. It can be an
effective way of increasing a 2KB DRAM page to a 4KB page size that the OS can deall with.

When the DRAM page is larger than the e.g. 4KB virtual memory page, if you are only using that page size, on a long
running system all the address bits beyond bit 12 are effectively randomized. You might hope for page clustering or
large pages in the OS to help. Which they do. However, in my experience, large pages have usually only been available
for benchmark specials, and not for truly general purpose computing. Still, if you have an HPC system, use them if you
got them.