From: Anton Ertl on 15 Nov 2009 06:49 "Andy \"Krazy\" Glew" writes:>There were several patents filed, with diagrams that >looked very much like the ones I drew for the K10 proposal. That's a different K10 than the Barcelona (marketed as, e.g., Phenom), right? >Anyway: if it has an L0 or L1 data cache in the cluster, with or >without the scheduler, it's my MCMT. That reminds me of a paper [kim&smith02] that I have always found very impressive, and I wondered why I have never seen followup papers or an implementation (but then I don't follow ISCA and Micro proceedings closely these days); hmm, looking at the research summary at Smith's home page, he is still working on something like this. Of course that work was also done in Wisconsin (same University, right?), so it may have been inspired by your ideas. Do you have any comments on that? @InProceedings{kim&smith02, author = {Ho-Seop Kim and James E. Smith}, title = {An Instruction Set and Microarchitecture for Instruction Level Distributed Processing}, crossref = {isca02}, pages = {71--81}, url = {http://www.ece.wisc.edu/~hskim/papers/kimh_ildp.pdf}, annote = {This paper addresses the problems of wide superscalars with communication across the chip and the number of write ports in the register file. The authors propose an architecture (ILDP) with general-purpose registers and with accumulators (with instructions only accessing one accumulator (read and/or write) and one register (read or write); for the accumulators their death is specified explicitly in the instructions. The microarchitecture builds \emph{strands} from instructions working on an accumulator; a strand starts with an instruction writing to an accumulator without reading from it, continues with instructions reading from (and possibly writing to) the accumulator and ends with an instruction that kills the accumulator. Strands are allocated to one out of eight processing elements (PEs) dynamically (i.e., accumulators are renamed). A PE consists of mainly one ALU data path (but also a copy of the GPRs and an L1 cache). They evaluated this architecture by translating Alpha binaries into it, and comparing their architecture to a 4-wide or 8-wide Alpha implementation; their architecture has a lower L1 cache latency, though. The performance of ILDP in clock cycles is competetive, and one can expect faster clocks for ILDP. The paper also presents data for other stuff, e.g. general-purpose register writes, which have to be promoted between strands and which are relatively few.} } @Proceedings{isca02, title = "$29^\textit{th}$ Annual International Symposium on Computer Architecture", booktitle = "$29^\textit{th}$ Annual International Symposium on Computer Architecture", year = "2002", key = "ISCA 29", } - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html From: "Andy "Krazy" Glew" on 15 Nov 2009 11:54 Anton Ertl wrote:> "Andy \"Krazy\" Glew" writes: >> There were several patents filed, with diagrams that >> looked very much like the ones I drew for the K10 proposal. > > That's a different K10 than the Barcelona (marketed as, e.g., Phenom), > right? So far as I can tell, Barcelona is a K8 warmed over. There were several K10s. While I wanted to work on low power when I went to AMD, I was hired to consult on low power and do high end CPU, since the low power project was already rolling and did not need a new chef. The first K10 that I knew at AMD was a low power part. When that was cancelled I was sent off on my lonesome, then wth Mike Haertel, to work on a flagship, out-of-order, aggressive processor, while the original low power team did something else. When that other low-power project was cancelled, that team came over to the nascent K10 that I was working on. My K10 was MCMT, plus a few other things. I had actually had to promise Fred Weber that I would NOT do anything advanced for this K10 - no SpMT, just MCMT. But when the other guys came on board, I thought this meant that I could leave the easy stuff for them, while I tried to figure out how to do SpMT and/or any other way of using MCMT to speed up single threads. Part of my motivation was that I had just attended ISCA 2003 in San Diego, where several of outstanding problems in big machines had been solved, and I was scared that Intel would come out with something if we did not. In retrospect, that fear was unjustified. Moral: don't give up power. I don't know what happened to K10 after I left AMD, but I'm guessing that the names were shifted, a warmed over K8 became K10, and the MCMT K10 became Bulldozer. Slipping quite a few years. From: Anton Ertl on 15 Nov 2009 12:07 "Andy \"Krazy\" Glew" writes:>Anton Ertl wrote: >> "Andy \"Krazy\" Glew" writes: >>> There were several patents filed, with diagrams that >>> looked very much like the ones I drew for the K10 proposal. >> >> That's a different K10 than the Barcelona (marketed as, e.g., Phenom), >> right? > >So far as I can tell, Barcelona is a K8 warmed over. ....>I don't know what happened to K10 after I left AMD, but I'm guessing >that the names were shifted, a warmed over K8 became K10 Yes, that was visible to us outsiders around the time of the release of Barcelona. IIRC the K-name that we could read about was K8L at first, later K10. My guess was that K9 was a cancelled project (or maybe skipped because they did not want a project that was a dog:-). >and the MCMT >K10 became Bulldozer. Slipping quite a few years. In the meantime I have read and at least the block diagrams there look like the Bobcat core, the low-power part from AMD will be related to Bulldozer: a core consists of one cluster (instead of two), and the FPU may be different, but the overall structure is similar. So maybe your work also ended up in a low-power part after all. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html From: "Andy "Krazy" Glew" on 15 Nov 2009 21:15 Anton Ertl wrote:> That reminds me of a paper [kim&smith02] that I have always found very > impressive, and I wondered why I have never seen followup papers or an > implementation (but then I don't follow ISCA and Micro proceedings > closely these days); hmm, looking at the research summary at Smith's > home page, he is still working on something like this. Of course that > work was also done in Wisconsin (same University, right?), so it may > have been inspired by your ideas. Do you have any comments on that? > > @InProceedings{kim&smith02, > author = {Ho-Seop Kim and James E. Smith}, > title = {An Instruction Set and Microarchitecture for > Instruction Level Distributed Processing}, > crossref = {isca02}, Same university, but I doubt direct inspiration. I know both Ho-Seop and Jim Smith. But, they were in EE, while I was in CS under Sohi. At the time there were many of us playing in the same space. Subrao Palacharla was, I believe, the first to get published in the area, with complexity effective, also with Jim Smith. Stole much of my thunder. However, Subrao's, and especially Ho-Seop's, work was very much of the flavor of "big out-of-order machines are too complicated, so we have to find a way to approximate them by putting together simpler compnents". Whereas my work was more of "How do I take the OOO designs that I know are being worked on, and scale them up?" Also, at the time, although I knew about Willamette - indeed, the scheduler structure of queues feeding an RS arose from the debate between OOO (me) and in-order (Sager and Upton) - but the academics at UWisc did not know. And I could not tell them. (This led, e.g. to me having to maintain 2 simulators, 1 with public info and 1 with private info. It also led to me having arguments with Jim Smith, where Jim said that it was useless to scale up window size because brnach predictors weren't accurate, whereas I knew about Willamette's predictors, so I knew that much more accuracy was coming. Much of my work in multilevel branch predictors wa just designed to shut Jim up, by showing that greater accuracy was possible, so I could just go and work on big windows. Relatedly, I remember trying to persuade Jim that 300 cycle memory latency was interesting, but he couldn't get there. Years later he published "A Day in the Life of a Cache Miss", imho a badly misnamed paper, that started exploring long latencies.) Anyway, my big problem with ILDP is that it was a microoptimization. To use it, you would basically have to throw away out-of-order CPUs, and start over. And in the first generation, you would just be playing catch up. I've seen this many times. People think that they can get paid to re-implement an existing CPU better, with a better, newer, microarchiture. Maybe so - but remember that you are then competing with the design team that is already going over the existing design with a fine tooth comb. I've seen this several times with attempts to do a new low power microarchitecture. I think that Wmt was much like this - out-of-order done anew, rather than extending P6 OOO. The folks who pushed run-ahead were in this camp: they weren't better than OOO, just more efficient. Or so they believed. Because you also have to remember that there is risk in doing anything new - so if the new supposedly more efficient microarchitecture does not quickly make the phase change to proven to more efficient, it will get canned. MCMT is more along the other line. It takes the existing pieces of a microarchitecture, and rearranges them. I think ILDP was somewhat in the same camp as RunAhead. A good idea, but not clearly better than the incumbent OOO. From: "Andy "Krazy" Glew" on 15 Nov 2009 21:16 Anton Ertl wrote:>> So far as I can tell, Barcelona is a K8 warmed over. > > Yes, that was visible to us outsiders around the time of the release > of Barcelona. IIRC the K-name that we could read about was K8L at > first, later K10. My guess was that K9 was a cancelled project (or > maybe skipped because they did not want a project that was a dog:-). Mitch Alsup was K9.