From: "Andy "Krazy" Glew" on
Bill Todd wrote:
>[...Itanium...]
> Intel wasn't run by complete idiots, just by insufficiently skeptical
> (and/or 'easily impressed') but otherwise reasonably bright people. All
> they had to believe was that the expected performance domination would
> materialize (which was HP's area of expertise, and HP was at that time a
> reputable source) - and a hell of a lot of fairly bright people
> *outside* Intel bought into this right into the start of this decade,
> not just the middle of the last one.

Some of those people are still around.

Some of them don't understand the value of x86. Some of them now have
swung too far, and think x86 forever.

Most of them just don't understand OOO execution. How you can
judiciously add hardware to solve problems.

Some of them think that the advent of Larrabee and Atom show that OOO is
a dead end. They think that we are resetting to simple P5-era in-order
machines. More, reversing evolution: retreating from Pentium 4
"fireball", backing out of OOO. Some think that we will never go back to
OOO. Me, I think that we are resetting. I think of it as a sawtooth
wave: backing out a bit, but probably advancing to dynamic techniques in
a few years.

Heck: Willamette / Pentium 4 was brought to you by peopled who thought
OOO was a bad idea. The original concept was anti-OOO. They were
forced to implement OOO, badly, because the anti-OOO approach did not fly.

Some of the people who brought you Itanium, who drank the VLIW koolaid,
are the people who are bringing you Larrabee. 'Nuff said.

Some of them think that the problem with VLIW was the instruction set.
Some of them think that binary translation will solve all those
problems. Me, I think that binary translation is a great idea - but
only if the target microarchitecture makes sense in the absence of
binary translation, if we lived in an open source world and x86
compatibility was not an issue. To many binary translation projects end
up missing the point: you have got to have a good target microarchitecture.
From: Terje Mathisen on
Robert Myers wrote:
> When I talked about rewriting code, I meant just that, not merely
> recompiling it. I wasn't all that interested in the standard task:
> how do you feed bad code to an Itanium compiler and get acceptable
> performance, because I was pretty sure that the answer was: you
> don't.
:-)
>
> I was more interested in the question: how do you write code so that a
> compiler can understand enough about it to emit code that could really
> exploit the architectural features of Itanium? I always assumed that

That didn't seem too hard to figure out:

You write your code so that it has short if/then/else blocks, preferably
of approximately the same size: This makes it easy for the compiler to
handle both paths simultaneously, with if-generated predicates to save
the proper results.

For loops you unroll enough to cover the expected latency from L1 (or L2
for fp), using the huge register arrays to save all the intermediate
results.

You inline _very_ aggressively, since call/return is relatively
expensive, and you avoid all interrupt handling if at all possible. The
best solution here is probably to dedicate one cpu/core for this.

You also make sure that tasks spend a _long_ time between each time they
are switched out, since the overhead of saving/restoring the huge
register files is pretty significant.

I.e. this is/was a cpu which was very good at going fast in a straight
line, with the added capability of being able to do two-way splits for
short periods to absorb little branches.


> someone at Intel understood all that and briefed it to management and
> management said, "No problem. We'll have the only game in town, so
> people will conform their code to our hardware."
>
> If you accept that proposition, then all you need to do is to get
> enough code to run well to convince everyone else that it's either
> make their code do well on the architecture or die. I'm pretty sure
> that Intel tried to convince developers that that was the future they
> should prepare for.

Of course. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen on
Andy "Krazy" Glew wrote:
> That's the whole point: you want to get as many cache misses outstanding
> as possible. MLP. Memory level parallelism.
>
> If you are serialized on the cache misses, e.g. in a linear linked list
>
> a) skip ahead to a piece of code that isn't. E.g. if you are pointer
> chasing in an inner loop, skip ahead to the next iteration of an outer
> loop. Or, to a next function.

Andy, you really owe it to yourself to take a hard look at h264 and
CABAC: In approximately the same timeframe as DES was replaced with AES,
with a stated requirement of being easy to make fast/efficient on a
PentiumPro cpu, the MPEG working groups decided that a "Context Adaptive
Binary Arithmetic Coder" was the best choice for a video codec.

CABAC requires 3 or 4 branches for every single _bit_ decoded, and the
last of these branches depends on the value of that decoded bit.

Until you've made that branch you don't even know which context to apply
when decoding the next bit!

(I have figured out workarounds (either branchless code or making them
predictable) for most of those inline branches in the bit decoder, but
that last context branch is unavoidable.)

The only possible skip ahead is a really big one: You have to locate the
next key frame and start another core/thread, but this approach is of
extremely limited value if you are in a realtime situation, i.e. video
conferencing.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Robert Myers on
On Oct 22, 2:06 am, Terje Mathisen <Terje.Mathi...(a)tmsw.no> wrote:

> I.e. this is/was a cpu which was very good at going fast in a straight
> line, with the added capability of being able to do two-way splits for
> short periods to absorb little branches.

I'm _way_ out on a limb here, Terje, but It think you can design a
GPGPU/stream-processor to to do the same much more effectively. If
that's all Itanium could do, then we've all been snookered. That is
to say, Itanium was a ridiculously power-hungry GPGPU. I don't really
know if that's a fair characterization, but that's what your formula
seems to reduce to.

Robert.
From: nmm1 on
In article <1b3a5ckrqn.fsf(a)snowball.wb.pfeifferfamily.net>,
Joe Pfeiffer <pfeiffer(a)cs.nmsu.edu> wrote:
>Robert Myers <rbmyersusa(a)gmail.com> writes:
>> On Oct 21, 8:16�pm, Bill Todd <billt...(a)metrocast.net> wrote:
>>>
>>> > I think that Intel seriously expected that the entire universe of
>>> > software would be rewritten to suit its ISA.
>>>
>>> > As crazy as that sounds, it's the only way I can make sense of Intel's
>>> > idea that Itanium would replace x86 as a desktop chip.

No. Intel were suckered by the HP people who said that compiler
technology could handle that.

>>> Did you forget that the original plan (implemented in Merced and I'm
>>> pretty sure McKinley as well) was to include x86 hardware on the chip to
>>> run existing code natively?

It wasn't in the original plan. It was in the first post-panic
redesign.

>> I never took that capability seriously. Was I supposed to? I always
>> thought it was a marketing gimmick.
>
>We were sure supposed to take it seriously -- didn't Merced actually
>have a i386 core on it when delivered?

I can't remember - they changed plans several times, and I can't
remember which they delivered for the Merced.

The original plan was that ISA translation technology was advancing
fast enough that they could convert x86 code to IA64 code and beat
the best x86s by a factor of three. Like Alpha, only more so.

When they discovered that it didn't work (for reasons some of us
had predicted), they panicked and proposed to add a complete x86
core 'until the software was improved'. That went through a couple
of revisions.


Regards,
Nick Maclaren.