From: Terje Mathisen "terje.mathisen at on
Noob wrote:
> My target is a 266-MHz SH-4.
>
> Assuming the vendor's implementation of memcpy is properly optimized,
> I think it is safe to assume that it provides a lower bound for the
> run-time of the rotation operation. Correct?
>
> As an example, copying a 5.7-MB picture (1644 x 1164) takes 320 ms
> (i.e. 18 MB/s)

Both dimensions are divisible by 4, but not by 8, so 4x4 basic blocks is
a natural starting point.
>
> Right now, my naive (strength-reduced, no tiling) C implementation
> takes 1100 ms. I think there is still room for improvement.
>
> One thing you may have overlooked in the advice you've given me is
> that the pixel values are packed in 24-bits (RGB, no alpha).

Aha!

This is crucial, it probably means that you should work with 4x4 pixel
blocks, each using 3x4x4 = 48 bytes.

48 bytes would fit in 12 32-bit registers, I don't know the details of
the SH-4, but I believe it has at least 16 integer registers, right?

Starting from this point I would first try a simple setup where I read
in 4x4 pixels in 3x4 registers, then shuffle/shift/combine these into a
rotated/transposed version, which I then store to 4 small temporary buffers.

Each time these 4 buffers reach 64 bytes each (or 128, whatever the
cache line size is), I'd copy them to the target array: This has the big
benefit of always writing complete cache lines, while keeping the
working buffers small enough that they will stay put in L1 cache.

During the load part you'll read 12 bytes from each of 4 different
addresses, this probably isn't quite optimal, but you'll at least get
some reasonable benefit from burst dram transfers.
>
> Thus I can't (blindly) deal with 32-bit words. There's some
> bit-twiddling to do to merge different parts of pixels, and
> then I have to worry about endian-ness. Good times.
>
>> If you actually work with very small blocks, like your 5x3 example, then
>> it really doesn't matter, you can just generate unrolled code to do it
>> directly while copying, in which case the fastest code will be like
>> Andy's suggestion where you gather in data to be able to write full
>> cache lines to the target buffer.
>
> I was surprised that I got slightly better performance when I iterated
> over consecutive bytes of the SOURCE buffer, than when I iterated over
> consecutive bytes of the DEST buffer.
>
> My current code handles slightly over 5 MB/s. It's fast enough for pictures
> smaller than, say, 2 MB. But some pictures weigh 6-7 MB. Having the user
> wait 1.5 seconds might be acceptable, but barely.

Back in the 486-33MHz days I used to say that 10 MB/s was my initial
speed target, your SH-4 is probably quite a bit faster than this?

OTOH your library block copy speed of 18 MB/s seems quite low. :-(

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: "Andy "Krazy" Glew" on
On 4/9/2010 8:43 AM, nmm1(a)cam.ac.uk wrote:
> IBM did memory mapping across CPUS between the POWER3 and POWER4,
> and got it BADLY wrong. They backed off for the POWER5. And IBM
> aren't exactly tyros at that game. It's very easy to get wrong.
> The history of large computers is littered with worthy attempts at
> using memory banking in clever ways, most of which have failed to
> a greater or lesser degree.

So, tell us what they did?

The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs. (Did the Power3
have a CPU-attached memory controller.) Not usually a good idea, unless local and remote memory are very close in
latency and bandwidth.

Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs. The OS can work around
this by playing with virtual address mappings, but it catches them by surprise.

The best is usually to have several large contiguous chunks per CPU. Let the OS do NUMA placement. Where you can
specify what physical address they appear at. Several, because there are occasionally assumptions about physical address
ranges. (E.g. the OS assumes CPU0 has exclusive access to physical address [64K,128K), CPU1 to [128K,+64K), etc.) The
remappability is just to cope with such assumptions. Which sometimes crop up wrt I/O, especially when you transition
across a boundary like 32 to 64 bit.

But I still like having the option of being to interleave at finer granularity - cache line, or page - across a subset
of memory controllers, for a subset of memory. Because for certain apps it's a win. If the bandwidth of a single
channel cannot saturate a CPU.
From: MitchAlsup on
On Apr 9, 4:42 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
> The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs.  
> Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs.  
> The best is usually to have several large contiguous chunks per CPU.  

It turns out that interleaving should be performed at the level of the
DRAM page for maximal effect with memory controllers that are smart
enough. This is far larger than the CPU page, but if the DRAMs do not
all have the same page sisze that the largest page that fits in all
DRAMs is generally an excellent granule.

A DRAM controller is sufficiently smart when it can hold dozens to
hundreds of cache lines waiting to be written while still servicing
demand reads with some protocal as to when writes actually happen and
some means to access the write buffer on behalf of demand reads.

This strategy commits the fewest number of DIMMs to the typical large
scale strip mine active access patterns. {Leaving the rest of the DRAM
banks to do more random activities.}

Mitch
From: nmm1 on
In article <4BBF9F62.6020409(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>> IBM did memory mapping across CPUS between the POWER3 and POWER4,
>> and got it BADLY wrong. They backed off for the POWER5. And IBM
>> aren't exactly tyros at that game. It's very easy to get wrong.
>> The history of large computers is littered with worthy attempts at
>> using memory banking in clever ways, most of which have failed to
>> a greater or lesser degree.
>
>So, tell us what they did?

How? Until and unless IBM tell all, we peasants will merely have to
guess. All we know is that they did what you describe below, but
that wasn't the whole story and wasn't the reason for the major
problems.

>The standard usually turns out to be a bad idea thing here is to
>interleave cache lines across CPUs. (Did the Power3 have a CPU-
>attached memory controller.) Not usually a good idea, unless local
>and remote memory are very close in latency and bandwidth.

That is, I believe, what the POWER4 did. And the details of the
memory controller have never been published, nor has why the change
was so catastrophic.

Incidentally, it worked very well on the vector systems, and was very
common on them. My guess is that it would work very well with modern
Fortran codes (i.e. ones using arrays as first-class objects). Heaven
help C/C++, but what else is new?

>Not quite so bad, but still often surpringly painful, is to interleave
>4K pages across CPUs/MCs. The OS can work around this by playing with
>virtual address mappings, but it catches them by surprise.

I would regard that as a bug in the OS design. The contiguous approach
can be almost as painful for many shared-memory languages/programs, as
almost everything gets allocated into a single bank. And I regard
that as a bug in the OS design, too ....


Regards,
Nick Maclaren.
From: Brett Davis on
Reposted from comp.arch.embeded

>> I need to rotate a picture clockwise 90 degrees.
>
> The data sheet states
>
> SH-4 32-bit super-scalar RISC CPU
> o 266 MHz, 2-way set associative 16-Kbyte ICache, 32-Kbyte DCache, MMU
> o 5-stage pipeline, delayed branch support
> o floating point unit, matrix operation support
> o debug port, interrupt controller

The most important number is cache line size, which you missed.
If your image is 1,024 lines tall, that will completely thrash
the cache, resulting in 3 bytes copied per cache line load/spill.

If you copy 16x16 tiles you can get a 10x speedup.

CopyTile16(source, dest, x, y, width, height)...

You can also try 8x8 and 4x4, smaller loops can be faster due
to all the args fitting in memory, and the loop getting unrolled.
but the difference will be ~25% which is hardly worth the time to code.

> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> unsigned char *C = B+(H*j+H-i-1)*3;
> C[0] = A[0];
> C[1] = A[1];
> C[2] = A[2];
> A += 3;
> }

If you do the following you can get a 2x speedup, it looks like
more code, but will generate less, and the results will be
pipelined correctly.
Extra bonus points to those that understand why. Half the posters here?

{
unsigned char *C = B+(H*j+H-i-1)*3;
temp0 = A[0];
temp1 = A[1];
temp2 = A[2];
C[0] = temp0;
C[1] = temp1;
C[2] = temp2;
A += 3;
}

Do not use *C++ = *A++;

Brett