From: Noob on
Andrew Reilly wrote:

> Didn't SGI open-source their own in-house itanium compiler,
> (open64 or something like that)?

Correct.

http://en.wikipedia.org/wiki/Open64
http://www.open64.net/about-open64.html

"""
Formerly known as Pro64, Open64 was initially created by SGI and
licensed under the GNU Public License (GPL v2). It was derived from
SGI's MIPSPro compiler.
"""
From: Noob on
Robert Myers wrote:

> Open source depends on gcc, perhaps the cruftiest bit of code
> on the planet.

What kind of unsubstantiated BS is this?
From: Mayan Moudgill on
Andy "Krazy" Glew wrote:
> Mayan Moudgill wrote:
>

> Branch prediction:
>
> (1) branch predictors *have* gotten a lot better, and will continue to
> get better for quite a few more years.

There are at least 3 branch predictors you have to worry about:
direction predictors, return stack predictors and next-fetch-address
predictors. If you combine the effect of all 3, then, depending on your
code mix, your accuracies can be considerably lower than most published
work used to get. Database code, certain kinds of highly OO code, and
code which did a lot of OS calls were among the prominent culprits.

BTW: since the work we were doing was in simulation, we looked at
impractical structures: large tables, high-associativity, next-cycle NFA
fixup for computed branches, perfect & early table update etc. So our
accuracies in some of the models we looked at was considerably higher
than what was then practical, and may still be higher than what is now
practical.

>
> Cache misses:

I'm more worried about I$ misses. Even with a 100% accuracy, you might
end up missing your I$. At which point, what do you do?

Somehow you have to figure out some code that is:
- in the I$
- whose input registers are available (or predictable!)
- which has a reasonable chance of actually being executed.

One approach was to go back and execute some other path, such as the
other side of a weakly taken/not-taken branch. We didn't look at that,
partly because prior work suggested that it wasn't much of an
improvement, and partly because it would have been difficult to get the
renaming structures (particularily freeing the registers on
retirement/branch resolution) done easily.

Loop-ful code doesn't run into these problems. But then, loop-ful code
doesn't need fancy predictors to get very good results, either.

For non loop-heavy code, I seem to remember that the number of
instructions between cache misses was small-ish (assumes caches in the
32K-128K region). My memory was hazy, but IIRC correctly, the one-sigma
was about 40? [Note that this is independent of prediction and
everything else - this is just number of intructions in taken path
between misses].

> (3) Recall that I am a fan of skip-ahead, speculative multithreading
> architectures such as Haitham Akkary's DMT. If you can't predict a
> branch, skip ahead to the next loop iteration or function return, and
> execute code that you know will be executed with high probability.

Possibly. I am a little skeptical of results based on Spec95, but it
seems worth looking into. But (and I may be overly-cynical here) I
suspect that in a real implementation, it will end up giving the usual
+/-5% performance delta, with a 50:50 chance that it is -5% rather than
+5%. Note that the original late 90's paper required broadside copying
of some fairly large arrays.

Slightly off-topic: did IBM or anyone else make real traces available to
researchers? I know they were talking about it, but did they follow through?
From: Mayan Moudgill on
Terje Mathisen wrote:

> For loops you unroll enough to cover the expected latency from L1 (or L2
> for fp), using the huge register arrays to save all the intermediate
> results.
>

Huh? Terje, I agree with your other points, but *SURELY* the compiler
would optimize loops by unrolling (and applying other techniques,
including register tiling and software pipelining) correctly? Are you
telling me that they didn't get even *THAT* implemented correctly?
From: eternal september on
"Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote in message
news:4ADF1711.6060107(a)patten-glew.net...
> eternal september wrote:
>> "Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote in message
>> news:4ADEA866.5090000(a)patten-glew.net...
>
> I'll reread the HSW papers and get back to comp.arch.

Looking forward to it!

> Note: I'm not just an OOO bigot. I also have neat ideas about parallel
> systems, MP, MIMD, Coherent Threading. But I am probably the most
> aggressive OOO computer architect on Earth.

I don't know, I am pretty agressive. We will have to have an OOO arm
wrestling contest! I have argued in favor of OOO against Itanium
architects...

> This is why I get frustrated when people say "OOO CPU design has run into
> a wall, and can't improve: look at Intel and AMD's latest designs". I know
> how to improve things - but I was not allowed to work on it at Intel for
> the last 5 years. Now I can - in my copious free time.

You're right that power is not the issue. Power has set things back a few
generations (the Atom scheduler is as complex as first gen OOO schedulers -
although there is no renamer; ARM has released an OOO machine). The low
power guys can learn from P4's mistakes - but, there was a lot of good stuff
there that was thrown out for Core 2.

The problems I see:
1) We are running out of generations of Si - not many generations left
(from an economics point of view), and the generations give us less
performance than they used to (because of wire scaling and optimizing for
leakage)

2) Management is risk averse
This is the big one. With a new core costing hundreds of millions, only
small improvements on existing cores can be justified.

Thanks!
Ned