Faster image rotation [Computer Architecture]

Prev: Which is the most beautiful and memorable hardware structure in a ?CPU?
Next: Energy usage per application ?

From: Thomas Womack on 19 Apr 2010 12:44

In article <fb35005d-6ee0-42f4-818a-1b7120f6ca3e(a)11g2000yqr.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:

>I remember a time back in 1979, I was logged into a CDC 7600 in
>California doing text editing. There were a dozen other members of the
>SW team doing similarly. There was a long silent pause where no
>character echoing was taking place. A few moments later (about 30
>seconds) the processing returned to normal. However, we found out that
>we were now logged into a CDC 7600 in Chicago. The Ca machine had
>crashed, and the OS had picked up all the nonfaulting tasks, shipped
>them up to another machine half way across the country and restarted
>the processes.
>
>Why can't we do this today? We could 30 years ago!

I think, in a setup with the (nowadays clearly unaffordably high)
level of computing staff that would be attached to an organisation
with two CDC 7600s in 1979, that is entirely possible today: you log
into a front-end machine which connects to a back-end machine which is
kept as a migratable VM.

Nobody bothers doing it for text editing because it's crazily
uneconomical.

Tom

From: "Andy "Krazy" Glew" on 19 Apr 2010 14:24

On 4/18/2010 2:29 PM, nmm1(a)cam.ac.uk wrote:
> In article<3782bf12-b3f5-4003-94a9-0299859358ed(a)y17g2000yqd.googlegroups.com>,
> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>> On Apr 18, 1:15=A0pm, "Andy \"Krazy\" Glew"<ag-n...(a)patten-glew.net>
>> wrote:
>>
>>> System code tends to have unpredictable branches, which hurt many OOO mac=
>> hines.
>>
>> I think it is easier to think that system codes have so much inherent
>> serializations that the efforts applied in doing OoO are "for want"
>> and that these great big OoO machines degrade down to just about the
>> same performance as the absolutely in-order cousins.
>>
>> Its a far bigger issue than simple branch mispredictability. Pointer
>> chasing into poorly cached data structures is rampant; "dangerous"
>> instructions that are inherently serialized; and poor TLB translation
>> success rates. Overall, there just is not that much ILP left in many
>> of the paths through system codes.
>
> That was the experience in the days of the System/370. User code
> got a factor of two better ILP than system code.

I surprised a friend who is working on speculative multithreading when he asked what benchmark I used for my SpMT work.
I said "gcc". In my experience, gcc is the user mode benchmark tha is most challenging, and which most resembles system
code.

I reject "inherently serialized" instructions. Very little need be inherently serialized. Such serialiations tend to
happen because you have not wanted to rename or predict the result. Only true MSR/creg accesses need be inherently
serialized.

Pointer chasing: I'm the MLP guy. I can show you a dozen ways to make pointer chasing run faster. Mainly: very
seldom do you just access the pointer. Usually you acccess p=p->nxt or p->p->link, plus several fields p->field1,
p->field2. You always need to consider the ratio of non-pointer chases to pointer chases. Of late, the ratio has been
INCREASING, i.e. system code has been becoming more amenable.

TLB miss rates: again, I can show/have shown many ways to improve these. One of my favorites is to cache a predicted
TB translation inside a data memory cache line, possibly using space freed up by compression.

Mitch: you're a brilliant guy, but you have only seen a small fraction of my ideas. Too bad we never got to work
together at AMD or Motorola.

From: "Andy "Krazy" Glew" on 19 Apr 2010 14:36

On 4/19/2010 9:17 AM, MitchAlsup wrote:
> On Apr 18, 4:32 pm, n...(a)cam.ac.uk wrote:
>> Yup. In my view, interrupts are doubleplus ungood - message passing
>> is good.
>
> CDC was the only company to get this one right. The OS ran mostly* in
> the perifferal processors, leaving the great big number cruncher to
> (ahem) crunch numbers.
>
> (*) The interupt processing and I/O was run in the PPs and most of the
> OS scheduling was run in the PPs.
>
> I remember a time back in 1979, I was logged into a CDC 7600 in
> California doing text editing. There were a dozen other members of the
> SW team doing similarly. There was a long silent pause where no
> character echoing was taking place. A few moments later (about 30
> seconds) the processing returned to normal. However, we found out that
> we were now logged into a CDC 7600 in Chicago. The Ca machine had
> crashed, and the OS had picked up all the nonfaulting tasks, shipped
> them up to another machine half way across the country and restarted
> the processes.
>
> Why can't we do this today? We could 30 years ago!
>
> Mitch

At Intel, on P6, we made some deliberate decisions that prevented this. We "deliberately" decided not to provide fault
containment within shared memory - most notably, we had incomplete cache tag snooping. When an error was detected, we
could not guarantee how far it had propagated - it might have propagated anywhere in cache coherent shared memory.

I quote "deliberately" because I was aware of this decision - I flagged it, and its consequences - I don't know how far
up the chain of command it propagated. Actually, I don't think it mattered - we would probably have made the smae
decision no matter what. The real problem was that, when demand arose to have better error containment, the knowledge
was lost, and had to be reconstructed. Usually without involving the original designer (me).

Nehalem has added error poison propagation, so this sort of thing can now be done.

When will you see OSes taking advantage? Don't hold your breath.

By the way, OSes have nearly always been able to do this using message passing. But apparently there was not enough demand.

From: nmm1 on 19 Apr 2010 14:37

In article <4BCC9FCA.5010007(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>On 4/18/2010 2:29 PM, nmm1(a)cam.ac.uk wrote:
>> In article<3782bf12-b3f5-4003-94a9-0299859358ed(a)y17g2000yqd.googlegroups.com>,
>> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>>> On Apr 18, 1:15=A0pm, "Andy \"Krazy\" Glew"<ag-n...(a)patten-glew.net>
>>> wrote:
>>>
>>>> System code tends to have unpredictable branches, which hurt many OOO
>>> machines.
>>>
>>> I think it is easier to think that system codes have so much inherent
>>> serializations that the efforts applied in doing OoO are "for want"
>>> and that these great big OoO machines degrade down to just about the
>>> same performance as the absolutely in-order cousins.
>>>
>>> Its a far bigger issue than simple branch mispredictability. Pointer
>>> chasing into poorly cached data structures is rampant; "dangerous"
>>> instructions that are inherently serialized; and poor TLB translation
>>> success rates. Overall, there just is not that much ILP left in many
>>> of the paths through system codes.
>>
>> That was the experience in the days of the System/370. User code
>> got a factor of two better ILP than system code.
>
>I surprised a friend who is working on speculative multithreading when
>he asked what benchmark I used for my SpMT work. I said "gcc". In my
>experience, gcc is the user mode benchmark that is most challenging, and
>which most resembles system code.

Isn't there a GUI benchmark? Most of that code is diabolical. But I
agree that gcc is an excellent bellwether for a lot of kernel, daemon
and utility code.

>I reject "inherently serialized" instructions. Very little need be
>inherently serialized. Such serialiations tend to happen because you
>have not wanted to rename or predict the result. Only true MSR/creg
>accesses need be inherently serialized.

There are some, but they tend to be used sparsely in run-time systems
and language libraries, rather than open code. But I don't know
what you are counting as MSR/creg accesses.

Regards,
Nick Maclaren.

From: Chris Gray on 20 Apr 2010 19:28

Noob <root(a)127.0.0.1> writes:

> The loop kernel is

> 24: 06 61 mov.l @r0+,r1
> 26: 10 42 dt r2
> 28: 12 23 mov.l r1,@r3
> 2a: fb 8f bf.s 24 <_rotate_right+0x24>
> 2c: 7c 33 add r7,r3

[Not referring to this specific code, but just following up.]

Why can't modern CPU's optimize the heck out of the relatively simple
code that a compiler might produce for a block copy? They have all of
the information they need - the addresses, the length, the alignments,
the position relative to page boundaries, cache lines, write buffers, etc.

Compilers often look at large chunks of code to figure out what they
are doing (e.g. Sun's "heroic optimizations" of a few years ago). CPUs
have transistors to burn now, why can't they look for patterns that
can be executed faster? Detect block copies, and turn them into
streaming fetches and stores, limited only by memory speeds. Don't
cache the data, don't purge any existing nonconflicting write buffers,
etc. Is the latency of detecting the situation too large?

Lots of code does a lot of copying - there could be a real benefit.

--
Experience should guide us, not rule us.

Chris Gray

First | Prev | Next | Last
Pages: 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Prev: Which is the most beautiful and memorable hardware structure in a ?CPU?
Next: Energy usage per application ?