From: Anton Ertl on
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>There were several patents filed, with diagrams that
>looked very much like the ones I drew for the K10 proposal.

That's a different K10 than the Barcelona (marketed as, e.g., Phenom),
right?

>Anyway: if it has an L0 or L1 data cache in the cluster, with or
>without the scheduler, it's my MCMT.

That reminds me of a paper [kim&smith02] that I have always found very
impressive, and I wondered why I have never seen followup papers or an
implementation (but then I don't follow ISCA and Micro proceedings
closely these days); hmm, looking at the research summary at Smith's
home page, he is still working on something like this. Of course that
work was also done in Wisconsin (same University, right?), so it may
have been inspired by your ideas. Do you have any comments on that?

@InProceedings{kim&smith02,
author = {Ho-Seop Kim and James E. Smith},
title = {An Instruction Set and Microarchitecture for
Instruction Level Distributed Processing},
crossref = {isca02},
pages = {71--81},
url = {http://www.ece.wisc.edu/~hskim/papers/kimh_ildp.pdf},
annote = {This paper addresses the problems of wide
superscalars with communication across the chip and
the number of write ports in the register file. The
authors propose an architecture (ILDP) with
general-purpose registers and with accumulators
(with instructions only accessing one accumulator
(read and/or write) and one register (read or
write); for the accumulators their death is
specified explicitly in the instructions. The
microarchitecture builds \emph{strands} from
instructions working on an accumulator; a strand
starts with an instruction writing to an accumulator
without reading from it, continues with instructions
reading from (and possibly writing to) the
accumulator and ends with an instruction that kills
the accumulator. Strands are allocated to one out of
eight processing elements (PEs) dynamically (i.e.,
accumulators are renamed). A PE consists of
mainly one ALU data path (but also a copy of the
GPRs and an L1 cache). They evaluated this
architecture by translating Alpha binaries into it,
and comparing their architecture to a 4-wide or
8-wide Alpha implementation; their architecture has
a lower L1 cache latency, though. The performance of
ILDP in clock cycles is competetive, and one can
expect faster clocks for ILDP. The paper also
presents data for other stuff, e.g. general-purpose
register writes, which have to be promoted between
strands and which are relatively few.}
}
@Proceedings{isca02,
title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture",
year = "2002",
key = "ISCA 29",
}

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: "Andy "Krazy" Glew" on
Anton Ertl wrote:
> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>> There were several patents filed, with diagrams that
>> looked very much like the ones I drew for the K10 proposal.
>
> That's a different K10 than the Barcelona (marketed as, e.g., Phenom),
> right?

So far as I can tell, Barcelona is a K8 warmed over.

There were several K10s. While I wanted to work on low power when I went
to AMD, I was hired to consult on low power and do high end CPU, since
the low power project was already rolling and did not need a new chef.
The first K10 that I knew at AMD was a low power part. When that was
cancelled I was sent off on my lonesome, then wth Mike Haertel, to work
on a flagship, out-of-order, aggressive processor, while the original
low power team did something else. When that other low-power project was
cancelled, that team came over to the nascent K10 that I was working on.
My K10 was MCMT, plus a few other things. I had actually had to
promise Fred Weber that I would NOT do anything advanced for this K10 -
no SpMT, just MCMT. But when the other guys came on board, I thought
this meant that I could leave the easy stuff for them, while I tried to
figure out how to do SpMT and/or any other way of using MCMT to speed up
single threads. Part of my motivation was that I had just attended ISCA
2003 in San Diego, where several of outstanding problems in big machines
had been solved, and I was scared that Intel would come out with
something if we did not.

In retrospect, that fear was unjustified.

Moral: don't give up power.

I don't know what happened to K10 after I left AMD, but I'm guessing
that the names were shifted, a warmed over K8 became K10, and the MCMT
K10 became Bulldozer. Slipping quite a few years.
From: Anton Ertl on
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>Anton Ertl wrote:
>> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>>> There were several patents filed, with diagrams that
>>> looked very much like the ones I drew for the K10 proposal.
>>
>> That's a different K10 than the Barcelona (marketed as, e.g., Phenom),
>> right?
>
>So far as I can tell, Barcelona is a K8 warmed over.
....
>I don't know what happened to K10 after I left AMD, but I'm guessing
>that the names were shifted, a warmed over K8 became K10

Yes, that was visible to us outsiders around the time of the release
of Barcelona. IIRC the K-name that we could read about was K8L at
first, later K10. My guess was that K9 was a cancelled project (or
maybe skipped because they did not want a project that was a dog:-).

>and the MCMT
>K10 became Bulldozer. Slipping quite a few years.

In the meantime I have read
<http://www.anandtech.com/printarticle.aspx?i=3674> and at least the
block diagrams there look like the Bobcat core, the low-power part
from AMD will be related to Bulldozer: a core consists of one cluster
(instead of two), and the FPU may be different, but the overall
structure is similar. So maybe your work also ended up in a low-power
part after all.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: "Andy "Krazy" Glew" on
Anton Ertl wrote:
> That reminds me of a paper [kim&smith02] that I have always found very
> impressive, and I wondered why I have never seen followup papers or an
> implementation (but then I don't follow ISCA and Micro proceedings
> closely these days); hmm, looking at the research summary at Smith's
> home page, he is still working on something like this. Of course that
> work was also done in Wisconsin (same University, right?), so it may
> have been inspired by your ideas. Do you have any comments on that?
>
> @InProceedings{kim&smith02,
> author = {Ho-Seop Kim and James E. Smith},
> title = {An Instruction Set and Microarchitecture for
> Instruction Level Distributed Processing},
> crossref = {isca02},

Same university, but I doubt direct inspiration. I know both Ho-Seop
and Jim Smith. But, they were in EE, while I was in CS under Sohi.

At the time there were many of us playing in the same space. Subrao
Palacharla was, I believe, the first to get published in the area, with
complexity effective, also with Jim Smith. Stole much of my thunder.

However, Subrao's, and especially Ho-Seop's, work was very much of the
flavor of "big out-of-order machines are too complicated, so we have to
find a way to approximate them by putting together simpler compnents".
Whereas my work was more of "How do I take the OOO designs that I know
are being worked on, and scale them up?" Also, at the time, although
I knew about Willamette - indeed, the scheduler structure of queues
feeding an RS arose from the debate between OOO (me) and in-order (Sager
and Upton) - but the academics at UWisc did not know. And I could not
tell them.

(This led, e.g. to me having to maintain 2 simulators, 1 with public
info and 1 with private info. It also led to me having arguments with
Jim Smith, where Jim said that it was useless to scale up window size
because brnach predictors weren't accurate, whereas I knew about
Willamette's predictors, so I knew that much more accuracy was coming.
Much of my work in multilevel branch predictors wa just designed to shut
Jim up, by showing that greater accuracy was possible, so I could just
go and work on big windows. Relatedly, I remember trying to persuade
Jim that 300 cycle memory latency was interesting, but he couldn't get
there. Years later he published "A Day in the Life of a Cache Miss",
imho a badly misnamed paper, that started exploring long latencies.)

Anyway, my big problem with ILDP is that it was a microoptimization. To
use it, you would basically have to throw away out-of-order
CPUs, and start over. And
in the first generation, you would just be playing catch up.

I've seen this many times. People think that they can get paid to
re-implement an existing CPU better, with a better, newer,
microarchiture. Maybe so - but remember that you are then competing
with the design team that is already going over the existing design with
a fine tooth comb. I've seen this several times with attempts to do a
new low power microarchitecture. I think that Wmt was much like this -
out-of-order done anew, rather than extending P6 OOO. The folks who
pushed run-ahead were in this camp: they weren't better than OOO, just
more efficient. Or so they believed. Because you also have to remember
that there is risk in doing anything new - so if the new supposedly more
efficient microarchitecture does not quickly make the phase change to
proven to more efficient, it will get canned.

MCMT is more along the other line. It takes the existing pieces of a
microarchitecture, and rearranges them.

I think ILDP was somewhat in the same camp as RunAhead. A good idea,
but not clearly better than the incumbent OOO.
From: "Andy "Krazy" Glew" on
Anton Ertl wrote:
>> So far as I can tell, Barcelona is a K8 warmed over.
>
> Yes, that was visible to us outsiders around the time of the release
> of Barcelona. IIRC the K-name that we could read about was K8L at
> first, later K10. My guess was that K9 was a cancelled project (or
> maybe skipped because they did not want a project that was a dog:-).

Mitch Alsup was K9.