|
From: Nick Maclaren on 29 Mar 2008 05:42 In article <d3e90bcf-a062-49d0-86b5-1e8445212dfb(a)s19g2000prg.googlegroups.com>, David Kanter <dkanter(a)gmail.com> writes: |> |> > Don't believe what you read - half of it is propaganda :-) |> |> In this case, it would be your post. SMT is a very well established |> technique, used by most of the high performance CPU vendors. Even the |> embedded guys use SOEMT quite heavily. I suggest that you read a thread before allowing your knees to jerk. I specifically said that SOEMT was viable. Dammit, it was obviously a good way to proceed even before the Tera MTA showed that it was! Blindingly obvious is the expression that springs to mind. |> > I took the trouble of reading Eggers' main paper (and others), and |> > analysing her calculations. I started out impressed, and got less |> > so as I proceeded. There was one very significant omission: the |> > comparison between the throughput of a SMT system and a multi-core |> > system with the same number of transistors and same amount of |> > parallelism. |> |> Here's a hint Nick, if you think that ALU real estate was a 1970's |> problem, then transistor count is a 1960's problem. |> |> CPUs today use around 400-600M transistors easily, with high-end ones |> up to 2B. I don't know about you, but somehow I don't think |> transistor count is a relevant metric today. Would you like to explain what on earth you are wittering on about? That was precisely the point that I made! And, before you jerk your knees again, read by third point below. |> SMT is an extremely power efficient technique, and that's a hell of a |> lot more important than transistors. Every year, we get 2x as many |> transistors, yet the number of watts remains the same. Which one do |> you think is the bottleneck? Sigh. Firstly, that is wrong. The number of watts fluctuates a bit but, even over the past decade, the trend is up. Secondly, all of the many chip designers, experts and vendors I have spoken to have referred to process technologies as the key to keeping power under control, followed by changing clock rates and/or turning sections of chip off when unused. SMT was only mentioned before they had tried it :-) Thirdly, your claim that SMT is an extremely power efficient technique doesn't make it so. You clearly haven't read Eggers' main paper with any care, or haven't understood it. I used the transistor count as a constraint only because that paper did. Fourthly, you failed to understand the point I made above - replace the constraint on number of transistors by the number of watts, and there was STILL no comparison in those papers with an equivalent multi-core system. Fifthly, both Sun's and Intel's 'experimental' low-power systems use larger numbers of simpler, non-SMT cores - which is PRECISELY the technique I was saying that should also have been considered. And THAT is the key to why you should update your beliefs! |> > Also, the scalability was dire, and that was for the |> > SMT-friendly MIPS chip - as history shows, Intel's attempt to use |> > it on the x86 was not a great success. |> |> Here's a hint, implementations vary. Intel's first implementation |> sucked. Their second one probably won't. Here's a technique. Look at what Intel are experimenting with. It's easy enough to find on the Web - if you can be bothered - I knew about it earlier, of course, under NDA. |> Every processor IBM has designed since the POWER5 has used SMT, and |> generally, IBM has a tendency to make reasonable choices. IBM's objective with those designs is not mainstream computing - more like mainframe computing! |> You should leave armchair architecture to those that actually |> understand it... So, are you claiming that you do? :-) Regards, Nick Maclaren.
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on 29 Mar 2008 07:32 Nick Maclaren <nmm1(a)cus.cam.ac.uk> wrote: > In article > <d3e90bcf-a062-49d0-86b5-1e8445212dfb(a)s19g2000prg.googlegroups.com>, David > Kanter <dkanter(a)gmail.com> writes: > |> > |> > Don't believe what you read - half of it is propaganda :-) > |> > |> In this case, it would be your post. SMT is a very well established > |> technique, used by most of the high performance CPU vendors. Even the > |> embedded guys use SOEMT quite heavily. > > I suggest that you read a thread before allowing your knees to jerk. > I specifically said that SOEMT was viable. Dammit, it was obviously > a good way to proceed even before the Tera MTA showed that it was! > Blindingly obvious is the expression that springs to mind. The Tera MTA was not SOEMT, unless you accept every clock as an event. > |> Every processor IBM has designed since the POWER5 has used SMT, and > |> generally, IBM has a tendency to make reasonable choices. > > IBM's objective with those designs is not mainstream computing - more > like mainframe computing! The PS3 and Xbox 360 are hardly mainframes. -- Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Nick Maclaren on 29 Mar 2008 08:58 In article <1iekddk.165zhckhowzvuN%nospam(a)ab-katrinedal.dk>, nospam(a)ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) writes: |> |> > I suggest that you read a thread before allowing your knees to jerk. |> > I specifically said that SOEMT was viable. Dammit, it was obviously |> > a good way to proceed even before the Tera MTA showed that it was! |> > Blindingly obvious is the expression that springs to mind. |> |> The Tera MTA was not SOEMT, unless you accept every clock as an event. Well, Tera told me that they didn't actually switch until they needed to load from memory :-) Anyway, exactly what you call it doesn't matter - the fact is that it demonstrated that SOEMT was viable, where the event is a wait for memory. |> > |> Every processor IBM has designed since the POWER5 has used SMT, and |> > |> generally, IBM has a tendency to make reasonable choices. |> > |> > IBM's objective with those designs is not mainstream computing - more |> > like mainframe computing! |> |> The PS3 and Xbox 360 are hardly mainframes. True. But they are also not mainstream, and exactly how the CPUs work isn't public knowledge, as far as I know - though a hell of a lot of people CLAIM to know, based on parroting Web magazine articles. My last close contacts with IBM were in the POWER4 days, and the gaming CPUs had a lot more differences from the 'mainframe' ones than was commonly believed. Note that was NOT in the CPU cores, but the memory, SMP and RAS control. One of the things that I was told was being considered was using the 'SMT' aspect to configure some systems into SIMD engines and others into SMP ones. Whether that has been done for the gaming and 'mainframe' CPUs, respectively, I don't know. Del might. As I said right back when the term "SMT" became trendy, it covered a multitude of sins, and I now notice a trend to use the SMT label for an ever-widening set of techniques. I don't know what the CPU design of the year 2000 will look like, but it will be called SMT[*]! [*] With apologies to whoever said it about Fortran :-) Regards, Nick Maclaren.
From: John Dallman on 29 Mar 2008 09:30 In article <d3e90bcf-a062-49d0-86b5-1e8445212dfb(a)s19g2000prg.googlegroups.com>, dkanter(a)gmail.com (David Kanter) wrote: > Every processor IBM has designed since the POWER5 has used SMT, and > generally, IBM has a tendency to make reasonable choices. And they say, up-front, that if you're doing something CPU-limited, you should turn it off. Note to people who don't have experience with AIX-type machines: the PowerPC-based CPUs in the XBox 360 and PS/3, while produced in huge volumes, aren't much used in IBM's own products. SMT arguments tend to futility because everyone acts as if their kind of workload is "typical", even though they know that it isn't really. SMT seems to work quite well for tasks that can use a lot of threads that don't have huge amounts of work each, and have other caps on performance. File serving, and some kinds of web serving are examples, where disk and/or network speed can also be limiters. The stuff I work on has quite a bit of multi-threading, but was reliably slower on first-generation HyperThreading, because the time costs of locking weren't made up for by access to increased processing power. Intel were disappointed by this, and hoped that the second-generation implementation in the "Prescott" series of Pentium 4s would convert us. It did not; while it was better, it did not give a significant speed-up. One explanation came out as "HyperThreading is a way to get higher utilisation from the pool of execution units. However, if the threads are running very similar code, there aren't (many) spare execution units of the type that's the bottleneck, so there's no significant increase in throughput". Me, I reckoned the limits of memory bandwidth - the threads weren't working with the same data - also had something to do with it. -- John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: Nick Maclaren on 29 Mar 2008 10:00
In article <memo.20080329133047.1696A(a)jgd.compulink.co.uk>, jgd(a)cix.co.uk (John Dallman) writes: |> In article |> <d3e90bcf-a062-49d0-86b5-1e8445212dfb(a)s19g2000prg.googlegroups.com>, |> dkanter(a)gmail.com (David Kanter) wrote: |> |> > Every processor IBM has designed since the POWER5 has used SMT, and |> > generally, IBM has a tendency to make reasonable choices. |> |> And they say, up-front, that if you're doing something CPU-limited, you |> should turn it off. Note to people who don't have experience with |> AIX-type machines: the PowerPC-based CPUs in the XBox 360 and PS/3, |> while produced in huge volumes, aren't much used in IBM's own products. Interesting. It fails to surprise me but, as I said, I have had little direct contact with IBM since the POWER4. |> One explanation came out as "HyperThreading is a way to get higher |> utilisation from the pool of execution units. However, if the threads |> are running very similar code, there aren't (many) spare execution units |> of the type that's the bottleneck, so there's no significant increase in |> throughput". Me, I reckoned the limits of memory bandwidth - the threads |> weren't working with the same data - also had something to do with it. Precisely. Interestingly, a careful reading of Eggers's papers made that quite clear - which shows how many people actually READ the references that they rely on :-( The same thing applied to the SMP vector processors - if you were running too many processes, it was BAD NEWS, even if only one was using the vector unit. I managed to double the throughput on one system by reducing the run limit by a factor of three. That's why I have liked the idea of switching threads on a cache miss for a good many years now - given that memory latency is THE problem, a solution that starts by assuming that seems good to me. Regards, Nick Maclaren. |