Prev: Intel to sell processor which is 1000x faster than anything ever seen before !
Next: Which is the most beautiful and memorable hardware structure in a ?CPU?
From: Piotr Wyderski on 1 Apr 2010 06:19 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43599 They claim that "fsin/fcos are known to get wrong results for certain values and their precision is nowhere near acceptable". Put aside the range reduction process, mentioned in the manual, what else is known about their alleged lack of accuracy? I always thought that the floating-point unit always produces IEEE754-compliant results, i.e. accurate to ULP, even for complex functions. Best regards Piotr Wyderski
From: robertwessel2 on 1 Apr 2010 16:41 On Apr 1, 5:19 am, "Piotr Wyderski" <piotr.wyder... (a)mothers.against.spam.gmail.com> wrote:> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43599 > > They claim that "fsin/fcos are known to get wrong results for > certain values and their precision is nowhere near acceptable". > > Put aside the range reduction process, mentioned in the manual, > what else is known about their alleged lack of accuracy? I always > thought that the floating-point unit always produces IEEE754-compliant > results, i.e. accurate to ULP, even for complex functions. The transcendentals are not generally defined to produce a result to within an ULP (that's actually a pretty tall order). Intel used to define the transcendental opcodes to generate about 62 bits of precision* for inputs in the supported range (+/-2**63 radians for the trig functions). So if you're computing the sine of an 80-bit long double, you're short a few bits, but you're going to be within a couple of ULPs for singles and doubles. More recently (Pentium and later), they've changed the definition to be (approximately) within 1.5 ULPs. Range reduction is an issue its pretty easy to do wrong, but frankly the sine of a double containing 2**63 has little actual meaning. And I'm not aware of any widespread errors in implementation, although I know some of those instructions have been the subject of errata over the years. *Actually they defined it as a relative error - the error divided by the result will be less than 2**-62.
From: MitchAlsup on 1 Apr 2010 18:17 On Apr 1, 5:19 am, "Piotr Wyderski" <piotr.wyder... (a)mothers.against.spam.gmail.com> wrote:> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43599 > > They claim that "fsin/fcos are known to get wrong results for > certain values and their precision is nowhere near acceptable". So 1.5 bits ulp is not good enough--golly gee, there are plenty of books around that can allow the complainers to build these functions to get 0.5 bits ulp. The problem is that these might take 5X-10X as much time (I might be being generous, here). The 99%-ers would rather have these functions usefully fast and usefully accurate than the other way around. As it is, the x87s of which I am familiar go to extraordinary lengths to make these functions usefully accurate and usefully fast. Some burder the FPUs with a 66-bit or 67-bit data path so that the intermediate results (used ONLY for transendentals) leads to good final results. In any event, the world is switching to SSE which has none of these instructions. So, to the complainers -- have at it. > Put aside the range reduction process, mentioned in the manual, > what else is known about their alleged lack of accuracy? I always > thought that the floating-point unit always produces IEEE754-compliant > results, i.e. accurate to ULP, even for complex functions. The problem is that 754 does not specify that transendentals have any accuracy at all (SQRT excepted). The FPUs produce 0.5 bits ulp for your 6-bang instructions {+,-,*,/,SQRT,xINT} and provide support for libraries that can deal with 754<->ASCII with the required accuracy. Mitch
From: Terje Mathisen on 2 Apr 2010 07:55 Piotr Wyderski wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43599 > > They claim that "fsin/fcos are known to get wrong results for > certain values and their precision is nowhere near acceptable". > > Put aside the range reduction process, mentioned in the manual, > what else is known about their alleged lack of accuracy? I always > thought that the floating-point unit always produces IEEE754-compliant > results, i.e. accurate to ULP, even for complex functions. Assuming this is for the x87 fpu, I know that Peter Tang who designed the trancendental functions made sure that they are indeed below one ulp for 80-bit inputs/outputs, as long as the argument is within range. Range reduction for way out of range inputs is more of a philosophical question: Do you simply assume that the input is exact, with infinitely many trailing zero bits, or do you check how many fractional bits are actually present in the input? For a 64-bit input which is greater than about 2^53, I'd be willing to say that the proper result could be anything in the -1.0 to 1.0 range. The only real alternative is to do an exact range reduction, something which isn't really that hard: Calculate 1/pi with about 1080 bits precision, then split the result into an array of 24-bit numbers. Use a bitmask and fp subtraction to split the input value into two halves with about 27 bits in each. Inspect the input exponent to determine which (leading) part of the reciprocal array would generate only integer results, with no fractional part at all: That part of the array can be safely skipped! Next, take the following 3 or 4 array elements and multiply each by the two halves of the input, combining the products into 4 or 5 overlapping sums, each with 52 or less significant bits. At this point I'd merge & split these sums into a final pair, consisting of a single-precision leading fraction and a full double-precision trailing part, giving a 77-bit accurate conversion of the initial input into a circle fraction in the 0 to 1.0 range. At this point the rest of the algorithm is quite simple: Use the leading N bits of the fraction as a further range reduction, then use a simple double precision cheby poly to calculate the required number of terms, and then expand the range back with extended (24+53 bits) table lookups. The only exception is for inputs very close to zero, i.e. without any further range reduction needed: For these inputs it is sufficient to calculate all terms of the polynomial with double precision, except the initial term: This last term is also split into two parts and added in from the bottom up, leading to a final result which is extremely close to 0.5 ulp for all possible inputs. The range reduction stage would take about 10-15 fp operations, but only if the initial input was very large: For smaller inputs the reciprocal calculation could be simplified into just two terms and four multiplications. Terje PS. The algorithm given above would of course also work with 80-bit extended precision inputs, the key is simply to split the number into two parts so that calculations can be performed without loss of precision due to rounding.
From: robertwessel2 on 2 Apr 2010 20:39
On Apr 2, 6:55 am, Terje Mathisen <terje.mathi... (a)tmsw.no> wrote:> Piotr Wyderski wrote: > >http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43599 > > > They claim that "fsin/fcos are known to get wrong results for > > certain values and their precision is nowhere near acceptable". > > > Put aside the range reduction process, mentioned in the manual, > > what else is known about their alleged lack of accuracy? I always > > thought that the floating-point unit always produces IEEE754-compliant > > results, i.e. accurate to ULP, even for complex functions. > > Assuming this is for the x87 fpu, I know that Peter Tang who designed > the trancendental functions made sure that they are indeed below one ulp > for 80-bit inputs/outputs, as long as the argument is within range. > > Range reduction for way out of range inputs is more of a philosophical > question: Do you simply assume that the input is exact, with infinitely > many trailing zero bits, or do you check how many fractional bits are > actually present in the input? > > For a 64-bit input which is greater than about 2^53, I'd be willing to > say that the proper result could be anything in the -1.0 to 1.0 range. > > The only real alternative is to do an exact range reduction, something > which isn't really that hard: > > Calculate 1/pi with about 1080 bits precision, then split the result > into an array of 24-bit numbers. > > Use a bitmask and fp subtraction to split the input value into two > halves with about 27 bits in each. > > Inspect the input exponent to determine which (leading) part of the > reciprocal array would generate only integer results, with no fractional > part at all: That part of the array can be safely skipped! > > Next, take the following 3 or 4 array elements and multiply each by the > two halves of the input, combining the products into 4 or 5 overlapping > sums, each with 52 or less significant bits. > > At this point I'd merge & split these sums into a final pair, consisting > of a single-precision leading fraction and a full double-precision > trailing part, giving a 77-bit accurate conversion of the initial input > into a circle fraction in the 0 to 1.0 range. > > At this point the rest of the algorithm is quite simple: Use the leading > N bits of the fraction as a further range reduction, then use a > simple double precision cheby poly to calculate the required number of > terms, and then expand the range back with extended (24+53 bits) table > lookups. > > The only exception is for inputs very close to zero, i.e. without any > further range reduction needed: For these inputs it is sufficient to > calculate all terms of the polynomial with double precision, except the > initial term: This last term is also split into two parts and added in > from the bottom up, leading to a final result which is extremely close > to 0.5 ulp for all possible inputs. > > The range reduction stage would take about 10-15 fp operations, but only > if the initial input was very large: For smaller inputs the reciprocal > calculation could be simplified into just two terms and four > multiplications. > > Terje > > PS. The algorithm given above would of course also work with 80-bit > extended precision inputs, the key is simply to split the number into > two parts so that calculations can be performed without loss of > precision due to rounding. Ive been doing a little research on this since I posted my response, and there may be a bit more to this issue than I thought. The complaint the Java guys (who seem to be the main source) seem to have is that the range reduction used *internally* by the trig instructions is inadequate. And they may have a point. Intel documents that the trig instruction reduce their inputs by multiples of 2*pi, using a 66 bit value for pi. That seems like it should result in rounding errors accumulating. Now on the flip side, I cant seem to actually demonstrate the effect. The results of sin(1000000000), for example, appear sufficiently consistent between the output from my C compiler, maxima, and my trusty HP calculator. The compiled code is using FSIN. If I get a few minutes, Ill put together a more low level test, but for the time being, Im mostly just confused. Either Im seeing consistent errors across three rather different platforms, or Intel is not doing quite what they documented. |