From: Noob on
Terje Mathisen wrote:

> Noob wrote:
>
>> I don't see why the transpose operation would be more
>> cache-friendly than directly doing the rotation.
>>
>> TRANSPOSE = copy ROW(i) to COL(i)
>> 90�CW ROT = copy ROW(i) to COL(H-1-i)
>>
>> Is it because the column access is incremented rather than
>> decremented? I must be missing something...
>>
>> Could you provide some insight, or some links?
>
> The basic (fast) transpose operation works on squares that have power of
> two sides, this is what allows it to work recursively, and thereby to
> fit into lower cache levels as soon as possible.
>
> The innermost loop of the transpose works with N registers each capable
> of holding N elements, so you can transpose a 4x4 block of 32-bit values
> using 4 SSE registers, or an 8x8 block of 16-bit values (as long as
> you're in 64-bit mode so you have at least 16 regs available).

Hold on. SSE? 64-bit mode? :-)

My target is a 266-MHz SH-4.

Assuming the vendor's implementation of memcpy is properly optimized,
I think it is safe to assume that it provides a lower bound for the
run-time of the rotation operation. Correct?

As an example, copying a 5.7-MB picture (1644 x 1164) takes 320 ms
(i.e. 18 MB/s)

Right now, my naive (strength-reduced, no tiling) C implementation
takes 1100 ms. I think there is still room for improvement.

One thing you may have overlooked in the advice you've given me is
that the pixel values are packed in 24-bits (RGB, no alpha).

Thus I can't (blindly) deal with 32-bit words. There's some
bit-twiddling to do to merge different parts of pixels, and
then I have to worry about endian-ness. Good times.

> If you actually work with very small blocks, like your 5x3 example, then
> it really doesn't matter, you can just generate unrolled code to do it
> directly while copying, in which case the fastest code will be like
> Andy's suggestion where you gather in data to be able to write full
> cache lines to the target buffer.

I was surprised that I got slightly better performance when I iterated
over consecutive bytes of the SOURCE buffer, than when I iterated over
consecutive bytes of the DEST buffer.

My current code handles slightly over 5 MB/s. It's fast enough for pictures
smaller than, say, 2 MB. But some pictures weigh 6-7 MB. Having the user
wait 1.5 seconds might be acceptable, but barely.

Regards.
From: Robert Myers on
On Apr 8, 2:57 am, n...(a)cam.ac.uk wrote:
> In article <4ec6f608-1d16-4c97-b47a-7c1d71879...(a)o30g2000yqb.googlegroups..com>,
> Robert Myers  <rbmyers...(a)gmail.com> wrote:
>

> >What would happen if you changed the default storage order for arrays
> >of dimension two and greater, so that they were always stored in
> >blocks with some cache or node-friendly order?  You could make it
> >nearly transparent in a language like c++ by defining appropriate
> >objects.
>
> The access time would go up, and some other operations would run a
> lot more slowly.  That technique works, but isn't any use as a
> general solution.  It can help for some applications.
>
> TANSTAAFL.

Someone I worked with referred to this as conservation of difficulty.

I didn't actually mean to imply that c++ would be the best way to do
it, only that it was a way of doing it that would make question
concrete.

Let me ask a more vague question.

Suppose that there were a progammable hardware memory remapper that
worked either at the node or cluster level.

Robert.

From: nmm1 on
In article <cd219a29-b16e-4c3e-9598-ab16f1144b45(a)x20g2000yqb.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>
>> >What would happen if you changed the default storage order for arrays
>> >of dimension two and greater, so that they were always stored in
>> >blocks with some cache or node-friendly order? =A0You could make it
>> >nearly transparent in a language like c++ by defining appropriate
>> >objects.
>>
>> The access time would go up, and some other operations would run a
>> lot more slowly. =A0That technique works, but isn't any use as a
>> general solution. =A0It can help for some applications.
>>
>> TANSTAAFL.
>
>Someone I worked with referred to this as conservation of difficulty.
>
>I didn't actually mean to imply that c++ would be the best way to do
>it, only that it was a way of doing it that would make question
>concrete.

I read it the way you meant.

>Let me ask a more vague question.
>
>Suppose that there were a progammable hardware memory remapper that
>worked either at the node or cluster level.

Same answer, with variations. Sorry.

While there are some applications where an 'odd' memory layout is
much better than the obvious one, most of those can be rewritten
to use a standard layout efficiently. What is far more common is
that one layout is best in one part of a program, and another in
another. Remapping CAN be worthwhile, but it isn't a cheap
operation.

I once got a factor of 4 improvement on a linear solver by turning
the arrays the other way round and keeping both the ordinary and
transposed code of one matrix - using whichever was best. And the
dreaded 3-D FFT algorithms are remapping ones. But it's rarely as
simple.

IBM did memory mapping across CPUS between the POWER3 and POWER4,
and got it BADLY wrong. They backed off for the POWER5. And IBM
aren't exactly tyros at that game. It's very easy to get wrong.
The history of large computers is littered with worthy attempts at
using memory banking in clever ways, most of which have failed to
a greater or lesser degree.

Randomising memory layout would have the advantage that it would
reduce the chances of an application suffering catastrophic loss
of efficiency, at the expense of making 'perfect' tuning infeasible.
Apparently, it's been done and it worked. But it didn't have the
expense of remapping.


Regards,
Nick Maclaren.
From: Robert Myers on
On Apr 9, 11:43 am, n...(a)cam.ac.uk wrote:
> In article <cd219a29-b16e-4c3e-9598-ab16f1144...(a)x20g2000yqb.googlegroups..com>,
> Robert Myers  <rbmyers...(a)gmail.com> wrote:
>
>
>
>
>
>
>
> >> >What would happen if you changed the default storage order for arrays
> >> >of dimension two and greater, so that they were always stored in
> >> >blocks with some cache or node-friendly order? =A0You could make it
> >> >nearly transparent in a language like c++ by defining appropriate
> >> >objects.
>
> >> The access time would go up, and some other operations would run a
> >> lot more slowly. =A0That technique works, but isn't any use as a
> >> general solution. =A0It can help for some applications.
>
> >> TANSTAAFL.
>
> >Someone I worked with referred to this as conservation of difficulty.
>
> >I didn't actually mean to imply that c++ would be the best way to do
> >it, only that it was a way of doing it that would make question
> >concrete.
>
> I read it the way you meant.
>
> >Let me ask a more vague question.
>
> >Suppose that there were a progammable hardware memory remapper that
> >worked either at the node or cluster level.
>
> Same answer, with variations.  Sorry.
>
> While there are some applications where an 'odd' memory layout is
> much better than the obvious one, most of those can be rewritten
> to use a standard layout efficiently.  What is far more common is
> that one layout is best in one part of a program, and another in
> another.  Remapping CAN be worthwhile, but it isn't a cheap
> operation.
>
> I once got a factor of 4 improvement on a linear solver by turning
> the arrays the other way round and keeping both the ordinary and
> transposed code of one matrix - using whichever was best.  And the
> dreaded 3-D FFT algorithms are remapping ones.  But it's rarely as
> simple.
>
> IBM did memory mapping across CPUS between the POWER3 and POWER4,
> and got it BADLY wrong.  They backed off for the POWER5.  And IBM
> aren't exactly tyros at that game.  It's very easy to get wrong.
> The history of large computers is littered with worthy attempts at
> using memory banking in clever ways, most of which have failed to
> a greater or lesser degree.
>
Banking worked pretty well on the Cray-1.

I was actually thinking more in terms of transparency and portability
than I was in terms of efficiency. Take the burden off the programmer
and the cleverness out of the source code. Or, if there is burden on
a programmer, let it be a systems programmer who doesn't have to mess
with the source code to make it even less readable.

Robert.
From: nmm1 on
In article <e5285b8b-53b7-4ae2-b53a-21c6f283fc8a(a)z6g2000yqz.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>>
>> IBM did memory mapping across CPUS between the POWER3 and POWER4,
>> and got it BADLY wrong. =A0They backed off for the POWER5. =A0And IBM
>> aren't exactly tyros at that game. =A0It's very easy to get wrong.
>> The history of large computers is littered with worthy attempts at
>> using memory banking in clever ways, most of which have failed to
>> a greater or lesser degree.
>>
>Banking worked pretty well on the Cray-1.

And on many other systems. The experience is that simple banking
is easy to get right, and works fairly well, though with some very
nasty gotchas. It's the attempts to be clever with it that I was
referring to.

>I was actually thinking more in terms of transparency and portability
>than I was in terms of efficiency. Take the burden off the programmer
>and the cleverness out of the source code. Or, if there is burden on
>a programmer, let it be a systems programmer who doesn't have to mess
>with the source code to make it even less readable.

That's where randomisation should score. It would be trivial to
transform every address, using a cheap and insecure 'cryptographic'
methods - a fixed network of bit shuffles and XORs isn't exactly
hard to put into hardware, after all - it's in hardware 101 in
most courses :-)

That would eliminate 99% of the gotchas, at the expense of stopping
the extreme tuning of algorithm cores. The impact on benchmarketing
may well be why it isn't done.


Regards,
Nick Maclaren.