Prev: What happened to computer architecture (and comp.arch?)
Next: Parallel huffman encoding and decoding algorithm/idea by Skybuck for sale !
From: Mayan Moudgill on 9 Sep 2009 12:28 nmm1 (a)cam.ac.uk wrote:> Your problem is that you aren't aware of the range of technologies > that have been developed and proven to work, and assume that anything > you don't know about doesn't exist. A lot of what I am proposing > dates from a long way back, and lost out to IBM marketing and its > dominance of process and production. Intel? Who they? > Enlighten us. How would you implement software fp efficiently? In particular, what code would you have us generate for for( i = 0; i < N; i++ ) { Z[i] = A[i]*B[i] - C[i]; if( Z[i] < D[i] ) { Z[i] = Z[i] + E[i]; } } (I picked a loop with a multiply, a subtract, an add and a compare ) Feel free to specify non-IEEE formats, including 32bit exponent and 32 bit signed mantissa (i.e. use a 32 bit register for exponent and a 32 bit register for mantissa).
From: nmm1 on 9 Sep 2009 13:28 In article <OMOdne9eFPwDSjrXnZ2dnUVZ_tGdnZ2d (a)bestweb.net>,Mayan Moudgill <mayan (a)bestweb.net> wrote:> >> Your problem is that you aren't aware of the range of technologies >> that have been developed and proven to work, and assume that anything >> you don't know about doesn't exist. A lot of what I am proposing >> dates from a long way back, and lost out to IBM marketing and its >> dominance of process and production. Intel? Who they? >> >Enlighten us. How would you implement software fp efficiently? In >particular, what code would you have us generate for > > for( i = 0; i < N; i++ ) { > Z[i] = A[i]*B[i] - C[i]; > if( Z[i] < D[i] ) { > Z[i] = Z[i] + E[i]; > } > } >(I picked a loop with a multiply, a subtract, an add and a compare ) > >Feel free to specify non-IEEE formats, including 32bit exponent and 32 >bit signed mantissa (i.e. use a 32 bit register for exponent and a 32 >bit register for mantissa). I am beginning to wonder whether you are posting seriously, or beginning to troll. In the hope that you are not, I shall respond. Firstly, I never denied that there was SOME loss of efficiency, but said that the losses were offset by gains with code that is currently a problem and need not be. You have chosen an example that provides minimal opportunities for the gains. Secondly, by breaking the floating-point operations down into multiple stages, you provide much more opportunity for the compiler to optimise for multiple issue, pipelining, parallelism and even VLIW. Again, you have chosen an example that doesn't expose that serious problem with existing architectures. Thirdly, the only fundamental difference between a hardware and software approach is that the former is invoked by a single ABI instruction and the latter is not. You may not remember the days when floating-point was often implemented in firmware, but those systems had fast floating-point. The last is perhaps the key. You optimise software floating-point in almost exactly the same way that you optimise firmware floating- point, which is very similar (and often identical) to the way that you optimise multi-stage hardware floating-point. The ONLY real difference is that the granularity is finer, and you are exposing the stages of the floating-point operation. Think of my proposal as firmware revisited. You may understand it better. And, if you think that firmware was necessarily slow, you haven't been keeping up. Even Seymour Cray couldn't get more than a factor of about two, and his cavalier attitude to accuracy accounted for as much of his speed gains as his use of hardware. Before you can ask how I can optimise it, you need to explain why doing that is necessarily much slower than hiding the stages behind a single instruction. Because those stages assuredly exist, and are necessarily serialised, in most floating-point unit designs. Though a lot of the complexity of floating-point hardware is to try to hide that. Take a look at the number of multiple-issue and VLIW designs that had and have restrictions on the number of instructions that can be floating-point, and the mind-blowing length of some of the pipelines. It wasn't and isn't easy to get those to run at full speed, when faced with an arbitrary set of code. No, my proposal doesn't solve that. If it did, it would have been adopted. But that problem is the reason that exposing the underlying stages is feasible without as much loss of efficiency as you think. And remember that it isn't rare for an HPC system to spend 2-5% of its 'CPU time' actually executing arithmetic instructions. Yes, really. So there is a hell of a lot of slack. Now, if you start to bring in the DSP and similar designs where they ARE dominated by the arithmetic unit, you will first have to explain why they need floating-point in the place. And remember that, while I did very little of it, I came in at the end of the period when using fixed-point for heavyweight numerics was common. Regards, Nick Maclaren.
From: Terje Mathisen on 9 Sep 2009 15:57 Mayan Moudgill wrote: > nmm1 (a)cam.ac.uk wrote:> > >> Your problem is that you aren't aware of the range of technologies >> that have been developed and proven to work, and assume that anything >> you don't know about doesn't exist. A lot of what I am proposing >> dates from a long way back, and lost out to IBM marketing and its >> dominance of process and production. Intel? Who they? >> > Enlighten us. How would you implement software fp efficiently? In > particular, what code would you have us generate for > > for( i = 0; i < N; i++ ) { > Z[i] = A[i]*B[i] - C[i]; > if( Z[i] < D[i] ) { > Z[i] = Z[i] + E[i]; > } > } > (I picked a loop with a multiply, a subtract, an add and a compare ) > > Feel free to specify non-IEEE formats, including 32bit exponent and 32 > bit signed mantissa (i.e. use a 32 bit register for exponent and a 32 > bit register for mantissa). With sufficient integer registers available, the code above can run very fast indeed, since each iteration is independent of the previous. I.e. you (or I) can pipeline it as much as the register set allows. Since Nick specifies that the critical building blocks should be exposed, I'll assume I have a normalize which returns the number of bits shifted (positive or negative): a_mant = a[i].mant; a_exp = a[i].exp; b_mant = b[i].mant; b_exp = b[i].exp; c_mant = c[i].mant; c_exp = c[i].exp; d_mant = d[i].mant; d_exp = d[i].exp; e_mant = e[i].mant; e_exp = e[i].exp; // FMUL, double-wide non-normalized result: mant2 m = (mant2) a_mant * (mant2) b_mant; e = a_exp + b_exp; // FSUB expdiff = e - c_exp; m_adj = max(0, -expdiff); // Shifts larger than the mantissa length will always return 0 m >>= m_adj; e += m_adj; c_adj = max(0, expdiff); c_mant >>= c_adj; m -= c_mant; // Assume the compare will be OK, so calculate Z+E // FADD expdiff = e - e_exp; m_adj = max(0, -expdiff); zm = m >> m_adj; ze = e + m_adj; e_adj = max(0, expdiff); e_mant >>= e_adj; zm += e_mant; // Normalize Z (e,m) in order to do the compare: e -= normalize(m); larger = (e > d_exp) || ((e == d_exp) && (m > d_mant); cmov(m, zm, larger); cmov(e, ze, larger); etc., for a total of ~30 operations. The key is that the latency of a single iteration is pretty bad, and there's no attempt on IEEE rounding, and my sign handling is non-existent (add a couple more building block opcodes), but with enough execution units an arbitrary number of these operations can run in parallel. With Nick's 1K cores, each with 32 or 64 registers, it seems like you could get ~32 results every cycle, with a single execution unit in each core. OTOH, having just slightly more complicated building blocks would double this, for a real win in power efficiency. I.e. I don't really believe Nick is right: FP math has such large usage that it does make sense to dedicate specialized hardware for it, although many (most these days?) can make do with well below IEEE float standard implementations: I'm thinking of all the flops taken by the world's GPU population. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: nmm1 on 9 Sep 2009 17:00 In article <YIydnbaWAq0glTXXnZ2dnUVZ8kednZ2d (a)lyse.net>,Terje Mathisen <Terje.Mathisen (a)tmsw.no> wrote:> >etc., for a total of ~30 operations. > >The key is that the latency of a single iteration is pretty bad, and >there's no attempt on IEEE rounding, and my sign handling is >non-existent (add a couple more building block opcodes), but with enough >execution units an arbitrary number of these operations can run in parallel. Yes, but you are using existing software primitives, and that is NOT what I am proposing! >With Nick's 1K cores, each with 32 or 64 registers, it seems like you >could get ~32 results every cycle, with a single execution unit in each >core. > >OTOH, having just slightly more complicated building blocks would double >this, for a real win in power efficiency. And THAT is far closer to what I am proposing. My estimate is that the right scale of division is to split floating-point operations into 5-10 steps, many of which would be very useful for implementing numeric functions and/or extended precision integers. >I.e. I don't really believe Nick is right: > >FP math has such large usage that it does make sense to dedicate >specialized hardware for it, although many (most these days?) can make >do with well below IEEE float standard implementations: I'm thinking of >all the flops taken by the world's GPU population. Yes, but remember that those don't need floating-point, in the first place! Almost all GPU algorithms are fixed-point ones, implemented in floating-point because so few people understand fixed-point numerics nowadays. Realistically, it's that aspect that kills my idea, not the actual architectural ones. It doesn't matter how good the engineering is if the customers won't adopt it. Regards, Nick Maclaren.
From: Terje Mathisen on 10 Sep 2009 06:16
nmm1 (a)cam.ac.uk wrote:>> FP math has such large usage that it does make sense to dedicate >> specialized hardware for it, although many (most these days?) can make >> do with well below IEEE float standard implementations: I'm thinking of >> all the flops taken by the world's GPU population. > > Yes, but remember that those don't need floating-point, in the first > place! Almost all GPU algorithms are fixed-point ones, implemented Not any more. GPUs switched to using fp internally _before_ they ever exposed this to the end users/developers, simply because it made sense for the GPU vendors, and they already had working fixed-point firmware for previous generations of cards. > in floating-point because so few people understand fixed-point > numerics nowadays. All the texture access&filtering, which is responsible for the majority of all the flops, is defined using texture quads surrounding the fp coordinates of the sampling point. Using this approach avoids the need for a division per pixel (or small group of pixels, as in the original Quake), so it is a real win as long as you have enough hardware to handle all four coordinates simultaneously. Yes, this _could_ have been handled with fixed-point, but since you need 10 bits of fractional precision, you'd still end up with 32-bit chunks, and you'd have to be very careful to avoid overflows. Another place where fp really helps is when doing gamma-corrected blending/sampling. I'm guessing anisotropic filtering is in the same group. > Realistically, it's that aspect that kills my idea, not the actual > architectural ones. It doesn't matter how good the engineering is > if the customers won't adopt it. I believe you could indeed make a 'multiply_setup/mul_core1/mul_core2/mul_normalize' perform close to dedicated hw, but you would have to make sure that _nobody_ except the compiler writers ever needed to be exposed to it. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |