From: Joe Pfeiffer on
Robert Myers <rbmyersusa(a)gmail.com> writes:

> On Oct 21, 8:16�pm, Bill Todd <billt...(a)metrocast.net> wrote:
>> Robert Myers wrote:
>> > On Oct 21, 6:08 am, Bill Todd <billt...(a)metrocast.net> wrote:
>>
>> >> Once again that's irrelevant to the question under discussion here:
>> >> whether Terje's statement that Merced "_would_ have been, by far, the
>> >> fastest cpu on the planet" (i.e., in some general sense rather than for
>> >> a small cherry-picked volume of manually-optimized code) stands up under
>> >> any real scrutiny.
>>
>> > I think that Intel seriously expected that the entire universe of
>> > software would be rewritten to suit its ISA.
>>
>> > As crazy as that sounds, it's the only way I can make sense of Intel's
>> > idea that Itanium would replace x86 as a desktop chip.
>>
>> Did you forget that the original plan (implemented in Merced and I'm
>> pretty sure McKinley as well) was to include x86 hardware on the chip to
>> run existing code natively?
>
> I never took that capability seriously. Was I supposed to? I always
> thought it was a marketing gimmick.

We were sure supposed to take it seriously -- didn't Merced actually
have a i386 core on it when delivered?
--
As we enjoy great advantages from the inventions of others, we should
be glad of an opportunity to serve others by any invention of ours;
and this we should do freely and generously. (Benjamin Franklin)
From: Bill Todd on
Robert Myers wrote:
> On Oct 21, 9:44 pm, Bill Todd <billt...(a)metrocast.net> wrote:
>> Robert Myers wrote:
>>> On Oct 21, 8:16 pm, Bill Todd <billt...(a)metrocast.net> wrote:
>>>> Robert Myers wrote:
>>>>> On Oct 21, 6:08 am, Bill Todd <billt...(a)metrocast.net> wrote:
>>>>>> Once again that's irrelevant to the question under discussion here:
>>>>>> whether Terje's statement that Merced "_would_ have been, by far, the
>>>>>> fastest cpu on the planet" (i.e., in some general sense rather than for
>>>>>> a small cherry-picked volume of manually-optimized code) stands up under
>>>>>> any real scrutiny.
>>>>> I think that Intel seriously expected that the entire universe of
>>>>> software would be rewritten to suit its ISA.
>>>>> As crazy as that sounds, it's the only way I can make sense of Intel's
>>>>> idea that Itanium would replace x86 as a desktop chip.
>>>> Did you forget that the original plan (implemented in Merced and I'm
>>>> pretty sure McKinley as well) was to include x86 hardware on the chip to
>>>> run existing code natively?
>>> I never took that capability seriously. Was I supposed to?
>> Why not? It ran x86 code natively in an integrated manner on a native
>> Itanic OS. As with most things Merced the original cut wasn't
>> impressive in terms of speed, but the relative sizes of the x86 and
>> Itanic processors (especially given the amount of the chip area
>> dedicated to cache) made it clear that full-fledged x86 cores could be
>> included later if necessary as soon as the next process generations
>> appeared.
>>
> The die area may have been available, but I don't think the watts
> were. It's hard to remember with any accuracy what I knew when, but
> it's pretty easy to tell at least some of what Intel knew. By the
> second half of the nineties, Intel knew and briefed that power was
> going to be a big problem.

Not necessarily. Intel didn't have working silicon until some time in
1998 and were holding out hope for power reductions before shipping
product well beyond that date (and further hope that McKinley would
achieve whatever power targets Merced failed to). The decision to
include the x86 core occurred far earlier (and Intel x86 cores at that
time were still relatively stingy in their power requirements).

- bill
From: Robert Myers on
On Oct 22, 12:00 am, Bill Todd <billt...(a)metrocast.net> wrote:
> Robert Myers wrote:

>
> > The die area may have been available, but I don't think the watts
> > were.  It's hard to remember with any accuracy what I knew when, but
> > it's pretty easy to tell at least some of what Intel knew.  By the
> > second half of the nineties, Intel knew and briefed that power was
> > going to be a big problem.
>
> Not necessarily.  Intel didn't have working silicon until some time in
> 1998 and were holding out hope for power reductions before shipping
> product well beyond that date (and further hope that McKinley would
> achieve whatever power targets Merced failed to).  The decision to
> include the x86 core occurred far earlier (and Intel x86 cores at that
> time were still relatively stingy in their power requirements).

I'm sorry. I should have been more explicit. Intel never admitted
that power was a problem for Itanium, but I have a Gelsinger
(Otellini?) briefing somewhere that extrapolates emitted power per
area to that of a space shuttle heat shield tile for x86, c.
1997-1998. It was clear that they were headed to multicore,
especially since Patterson was a consultant and he'd been saying
similar for several years by then.

Robert.





From: "Andy "Krazy" Glew" on
Mayan Moudgill wrote:
> About making windows bigger: my last work on this subject is a bit
> dated, but, at that time, for most workloads, you pretty soon hit a
> point of exponentially smaller returns. Path mispredicts & cache misses
> were a couple of the gating factors, but so were niggling little details
> such as store-queue sizes, retire resources & rename buffer sizes. There
> is also the nasty issue of cache pollution on mispredicted paths.

Not what I have seen. Unless square root law is what you call
diminishing returns. Which it is, but there is a big difference between
square root, and worse.

Branch prediction:

(1) branch predictors *have* gotten a lot better, and will continue to
get better for quite a few more years. Seznec's OLGEH predictor opened
up a whole new family of predictors, with extremely long history. The
multilevel branch predictor techniques - some of which I pioneered but
did not publish (apart from a thesis proposal for the Ph.D. I abandoned
when my daighter was born; which Adam Butts also pushed; and which
Daniel Jimenez published in combination with his neural net predictor -
provide a way in which you can get the accuracy of a big predictor, but
the latency of a small predictor.

Daniel later recanted, publishing a paper saying the microarchitecture
complexity of handling late arriving branch predictions was not
warranted. I disagree, in part because it uses techniques very similar
to those such as Willamette's replay pipeline, which was not well known
when Daniel did his Ph.D.

Since I know Daniel reads this newsgroup, perhaps he would care to say
what he thinks about multilevel branch prediction now?

(2) In a really large window, many, most, branches that are predicted
are actually repaired before the instruction retires.

(3) Recall that I am a fan of skip-ahead, speculative multithreading
architectures such as Haitham Akkary's DMT. If you can't predict a
branch, skip ahead to the next loop iteration or function return, and
execute code that you know will be executed with high probability.
Control independence.

Cache misses:

That's the whole point: you want to get as many cache misses outstanding
as possible. MLP. Memory level parallelism.

If you are serialized on the cache misses, e.g. in a linear linked list

a) skip ahead to a piece of code that isn't. E.g. if you are pointer
chasing in an inner loop, skip ahead to the next iteration of an outer
loop. Or, to a next function.

b) actually, serializng on a linear linked list is ideal: in the ASPLOS
WACI talk where I first publicized the notion of MLP, Memory Level
Paralellism, I sketched out how to use in-place cache line compression,
to allow quite a few bits of extra data to be stored per cache line.
E.g. take 64 byte cache lines. Compression rates of 10-20% are quite
doable. It's easy to see that you can get an extra 32 bits of data into
the cache line. Store something like a skiplist pointer in the space
freed up. Leave the cacge line at the same address.
I.e. use cache line compression not to reduce memory sisze, but to add
extra metadata to memory.

The fly in he ointment was that in-place cache line compression required
at least 1 extra bit per cache line, to handle to uncompressible case.
This has recently been solved: you need`the extra bit, but it can be
stored in a separate region of memory; you compress the data so that
*usually* you can tell if a line is compressed or not just by looking at
the main 64 bytes. You only go to get the extra bit(s) that tell you if
the line was uncompressible on the rare cases where the data doesn't fit.
I.e., instead of always having the extra bit and always accessing
it, you always have the extra bit, but probably don;t need to access it.

OK, so linear linked lists are not a problem. What remains problematic
are data structures that have high valency - e.g. a B-tree with 70
children per node. Visited randomly. Hash tables that are accessed,
one after the other: hash(A0) -> A1 => hash(A1) -> A2 => hash(A3) -> ...
I don't know how to solve such "high valency" or "hash chasing" MLP
problems. Except by skipping ahead, or running another thread.


Store-queue sizes:

You've got to make a scalable store queue. When I took Chuck Moore out
to dinner when he was interviewing at AMD, we agreed that store queues
were the biggest overall issue. However, there was an ISCA that had
several papers on scalable store queues. None of them are exactly how I
would like to do it, but several were variations on multilevel store
queues. I now consider store queues a solved problem. Take one of the
academic papers, or take my patent pending, or, if you are at AMD or
Intel, use one of the techniques that I invented at those places which I
can't use. By now I have solved this problem 3 or 4 times.

Notice the pattern? I keep saying "multilevel everything". Multilevel
branch predictor, multilevel instruction fetch from a single thread,
multilevel threading, multilevel register file, multilevel scheduler,
multilevel retirement. That's why I called my toy uarch from 2004
"Multi-Star".

Retire resources: I'm not sure what you mean by retire resources. If it
is physical register file, multilevel it. If it is state that is no
longer needed for execution, but which is needed to commit, multilevel
it. In my dreams, I imagine storing this in DRAM - after filtering and
compressing it.

Rename buffer sizes: here I am puzzled. If you mean the actual renamer,
it is R entries, where R is the number of logical registers. The
entries might be lg(N) bits, where N is the size of the OOO window.
This grows at O(R*lg(N)), which is a lot better than the O(N^2) scaling
that stupid scheduler techniques have. But even here renaming can be
done multilevel, at the cost of data movement.


> There is also the nasty issue of cache pollution on mispredicted paths.

Again, not a problem I have seen, once you (a) have good branch
predictors, (b) have a good memory dependency predictor, and (c) can
skip ahead to code that you are more confident of.

---

I think that all of these problems can be solved. But, you have to
solve *all* of these problems. Together. At the same time. You can't
just solve one of them, and stick it into a simulator with legacy,
non-scalable, versions of the other parts of the CPU.

I wish that academics could publish papers along the lines of "I will
just wave a magic wand, and imagine that all of the other parts of the
CPU can be expanded, except <just the part that I am going to show how
to scale in this paper>". Unfortunately, program committees are full of
people who will reject papers because you have made unjustified
assumptions. (No, they did not reject my papers. I have only submitted
papers in referreed situations twice - although they were rejected in
this way. Mainly, I am talking about program committees where I have
seen the other referrees' comments.)

In order to make progress in small steps, you have to be willing to
suspend disbelief. So long as eventually you can pull it all together.

From: Robert Myers on
On Oct 21, 11:54 pm, Joe Pfeiffer <pfeif...(a)cs.nmsu.edu> wrote:
> Robert Myers <rbmyers...(a)gmail.com> writes:
> > On Oct 21, 8:16 pm, Bill Todd <billt...(a)metrocast.net> wrote:
> >> Robert Myers wrote:
> >> > On Oct 21, 6:08 am, Bill Todd <billt...(a)metrocast.net> wrote:
>
> >> >> Once again that's irrelevant to the question under discussion here:
> >> >> whether Terje's statement that Merced "_would_ have been, by far, the
> >> >> fastest cpu on the planet" (i.e., in some general sense rather than for
> >> >> a small cherry-picked volume of manually-optimized code) stands up under
> >> >> any real scrutiny.
>
> >> > I think that Intel seriously expected that the entire universe of
> >> > software would be rewritten to suit its ISA.
>
> >> > As crazy as that sounds, it's the only way I can make sense of Intel's
> >> > idea that Itanium would replace x86 as a desktop chip.
>
> >> Did you forget that the original plan (implemented in Merced and I'm
> >> pretty sure McKinley as well) was to include x86 hardware on the chip to
> >> run existing code natively?
>
> > I never took that capability seriously.  Was I supposed to?  I always
> > thought it was a marketing gimmick.
>
> We were sure supposed to take it seriously -- didn't Merced actually
> have a i386 core on it when delivered?

It had something or other, but PIII had to be in the works (Andy would
know) and it would have stomped anything that came before.

That is to say, I find it hard to believe that anyone took Itanium
seriously as an x86 competitor.

Robert.