From: "Andy "Krazy" Glew" on
On 4/18/2010 8:25 PM, Brett Davis wrote:
> In article<uf-dnbaq5J9o_lbWnZ2dnUVZ8sCdnZ2d(a)giganews.com>,
> jgd(a)cix.compulink.co.uk wrote:
>
>> In article<4BCB60FF.9030306(a)patten-glew.net>, ag-news(a)patten-glew.net (
>> Glew) wrote:
>>
>>> Itanium was designed by people who thought that P6-style out-of-order
>>> was going to fail.
>>
>> Ah, that makes sense. Thanks. In some ways the Itanium method of running
>> several instructions at once seems more "obvious". I was convinced by it
>> at first, and only gradually realised that in spite of its intuitive
>> appeal, it did not work well in this example.
>
> Intriguing, could you elaborate. (Bear in mind I would like to know the
> good points in Itanium, despite the mocking of Itanic.)

There are good points in Itanium.

Their system architecture, e.g. SMM, is very good.

Even in VLIW they are not so bad. I think they just pushed everything too far.
From: jgd on
In article <ggtgp-8BCC50.22254218042010(a)news.isp.giganews.com>,
ggtgp(a)yahoo.com (Brett Davis) wrote:
> In article <uf-dnbaq5J9o_lbWnZ2dnUVZ8sCdnZ2d(a)giganews.com>,
> jgd(a)cix.compulink.co.uk wrote:
> > Ah, that makes sense. Thanks. In some ways the Itanium method of
> > running several instructions at once seems more "obvious". I was
> > convinced by it at first, and only gradually realised that in spite
> > of its intuitive appeal, it did not work well in this example.
> Intriguing, could you elaborate. (Bear in mind I would like to know
> the good points in Itanium, despite the mocking of Itanic.)

All that I was trying to say was that I found it more obvious, twelve
years ago, that the Itanium explicit-parallelism approach could run
several instructions at once that it was that a complex system of
microinstructions with automatic dependency tracking, al la Pentium Pro,
could do the same.

At the point I was forming this opinion, it was thoroughly demonstrated
that the Pentium Pro and its descendants worked well. It seemed to me
that the EPIC approach could also work, and save transistors to allow
more functional units.

Time has shown that I was wrong about that, and I no longer trust my
intuition on these things, but resort to measurements. It seems that I
was looking at it from a vaguely similar PoV to the Itanium architects,
not having properly appreciated the wealth of transistors becoming
available, and the way that latency and bandwidth in and out of the
cache are vital.

--
John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: Brett Davis on
In article <jwviq7n9wwm.fsf-monnier+comp.arch(a)gnu.org>,
Stefan Monnier <monnier(a)iro.umontreal.ca> wrote:

> >>>> 2.9: Register rotation, someone needs to be locked in a rubber room. ;)
> >>> Yes and no. They can be VERY effective, as on the Hitachi SR2201.
> >> How did they work?
> > Floating-point only, software-controlled, bypassing the cache.
> > It was called pseudovectorisation, which describes it very well.
>
> By "work" I meant: what were the instructions provided to setup/control
> the register rotation feature and what were their semantics?

HITACHI SR2201 Massively Parallel Processor
http://www.hitachi.co.jp/Prod/comp/hpc/eng/sr1.html

Has a short Do Loop example. Copied Cray after that boat sailed, and sank.


But ultimately is not register windowing just a horrid complex slow
way to get more register bits, in a fix width instruction set?

Are you not profoundly better off just adding an opcode extension word
with more register bits, like CLIW?

The overhead to keep track of all the windows is hardware CISC,
which cuts your clock rate in half of what real RISC and x86 provide.

Brett
From: "Andy "Krazy" Glew" on
On 4/19/2010 9:02 PM, Brett Davis wrote:
> In article<jwviq7n9wwm.fsf-monnier+comp.arch(a)gnu.org>,
> Stefan Monnier<monnier(a)iro.umontreal.ca> wrote:
>
>>>>>> 2.9: Register rotation, someone needs to be locked in a rubber room. ;)
>>>>> Yes and no. They can be VERY effective, as on the Hitachi SR2201.
>>>> How did they work?
>>> Floating-point only, software-controlled, bypassing the cache.
>>> It was called pseudovectorisation, which describes it very well.
>>
>> By "work" I meant: what were the instructions provided to setup/control
>> the register rotation feature and what were their semantics?
>
> HITACHI SR2201 Massively Parallel Processor
> http://www.hitachi.co.jp/Prod/comp/hpc/eng/sr1.html
>
> Has a short Do Loop example. Copied Cray after that boat sailed, and sank.
>
>
> But ultimately is not register windowing just a horrid complex slow
> way to get more register bits, in a fix width instruction set?
>
> Are you not profoundly better off just adding an opcode extension word
> with more register bits, like CLIW?
>
> The overhead to keep track of all the windows is hardware CISC,
> which cuts your clock rate in half of what real RISC and x86 provide.
>
> Brett

I'm not a big fan of register rotation, but let me jump to its defense:

Register rotation is NOT just a way of getting more registers without increasing the number of bits in the instruction.
In fact, one can propose useful register window implementations where there rotating registers are of size, say, 32, and
the space they map into is the same size. I.e. where register windows get you no more registers.

The usefulness of register rotation is that they rotate. This allows them to be used to create efficient software
pipelined code, without having to unroll the loop (or copy values at the end of the loop).

Register windows are especially useful with the predicates that allow the start-up and wind-down code of an unrolled and
interleaved loop to be mixed with the main body of the loop.

Register rotation is one of the few VLIW features that allows code size to be *reduced* rather than increased. For
example, I have found loops that I have had to unroll 4, 8, 16, and more times to get full performance. Yes, even on an
out-of-order machine. On a machine with register rotation, I could get away with only a single copy of the loop body.

One of the nice things about register rotation is that it almost removes the need for the compiler to make a decision:
optimize this loop by unrolling it and software pipelining it, or not? Register rotation makes that optimization
decision much simpler.

Register rotation does NOT need to slow clock rate much beyond what x86 already has. In fact, x86's x87 floating point
already does register rotation, in the x87 FP stack. It's painful, but it obviously has been made to run at speed. The
pain can be assuaged by adding a pipestage - which we only learned we needed too late in P6.

---

As I write this, I wonder why the GPUs do not provide register rotation. Perhaps because they don't care about
optimizing single threads at all.

From: "Andy "Krazy" Glew" on
On 4/20/2010 5:52 AM, Robert Myers wrote:
> Brett Davis wrote:
>
>> But ultimately is not register windowing just a horrid complex slow
>> way to get more register bits, in a fix width instruction set?
>
> Not in the case of Itanium, which has tons of registers.
>
> The purpose, as I understand it, is to permit more seamless operation
> across procedure calls.
>
>
> Robert.


Urghh. The conversation is mixing register windows and register rotation. Which Itanium did.

Register rotation: interesting. Hard to make generic. It's onbe of those ideas I keep thinking about, trying to get
its goodness in the sort of machine I like. But, until I do so, probably not worth it.

Register windows: questionable value.

However, multiple register contexts and variable number of registers accessible - the GPUs seem to be demonstrating good
value. Per thread.