floating point, how many significant figures? [C++]

Prev: Accessor Functions (getter) for C String (Character Array) Members
Next: Why is the return type of count_if() "signed" rather than "unsigned"?

From: Pete Becker on 25 Jun 2010 17:21

On 2010-06-25 03:11:54 -0400, Andrew said:

> I understand and normally I trust GCC and don't trust VS but this time
> VS has the correct behaviour. I diff'd the output built via VS against
> the output built with GCC and selected a few lines that were
> different.

Well, okay, VC produced the results you expected. But was it reasonable
to expect what you expected? <g>

The problem is this: suppose you're using pencil and paper, and
recording values to four significant digits; you "read" the two values
0.33331 and 0.333315, and record them with four significant digits:

0.3333
0.3333

Now, you want to "write" the values. But they're identical, and they
have fewer digits than the originals, so it's not possible to recover
the original values. The best you can do is represent each as 0.3333,
which is different from both of the originals. And the difference
between the two original values has been lost.

Floating-point conversions run into the same problem: input values get
converted to a nearby representable floating-point value; this usually
loses some information from the input, and you'll only rarely get
identical output.

On the other hand, going the other way in general ought to work (C99
requires it): you should be able to write out a floating-point value as
text, read the text, and get back the original value. That's because
the external decimal representation has more precision than the
internal binary representation, so you don't lose information.

--
Pete
Roundhouse Consulting, Ltd. (www.versatilecoding.com) Author of "The
Standard C++ Library Extensions: a Tutorial and Reference
(www.petebecker.com/tr1book)

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Andrew on 25 Jun 2010 17:21

On 25 June, 13:11, SG <s.gesem...(a)gmail.com> wrote:
> > Here are some examples of the differences I found:
>
> > VS GCC
> > -937566.2364699869 -937566.2364699868

> > Checking aginst the original input file, VS is the one that gets it
> > right. Can anyone comment on why the difference with GCC please?
>
> 937566.2364699869 =
> 11100100111001011110.001111001000100101001100000011000 01110...
>
> The closest representable number with an IEEE-754 64bit float is
>
> 11100100111001011110.001111001000100101001100000011000 =
> 937566.2364699868 485...
>
> The closest representable 16-digit decimal number is
>
> 937566.2364699868
>
> So, your program you compiled with GCC did a good job.

I'm not convinced.

> If you're interested in a lossless double->string->double roundtrip
> you should use 17 decimal digits and high quality conversions.

See my sample program in this thread that uses the value
-937566.2364699869. When GCC takes that string, converts it to a
double, then converts the double back to a string, it gives
-937566.2364699868. Adding an extra digit of precision gives
-937566.23646998685. IFAICS this means it is doing the rounding
incorrectly.

The real program (whose source code I cannot reproduce here) was
written very quickly under quite a bit of pressure (and not by me
BTW). There are a few corner cases where the memory demands are huge
which is why the input numbers are converted from strings to doubles.
My approach would have been to keep them as strings then there would
be no conversion/precision problres. But this would have forced the
program to solve the lack of memory problem. This can be done with
various ways and means but would have made the program much more
complex just to cover those few corner cases. We don't know yet how
often these corner cases will come up in practice. If it proves to be
a bigger problem than we thought then my suggestion is to build the
program in 64 bit mode. This would allow much more memory to be
addressed and would keep the program relatively simple.

Regards,

Andrew Marlow

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Rui Maciel on 26 Jun 2010 05:15

George Neuner wrote:

> IEEE double precision isn't 16 significant figures ... it's actually
> about 15.9 on average - some code that requires 16 works and some
> doesn't so it's best never to test the limits

Sorry for nit-picking but that isn't exactly true. According to IEEE 754, the single precision
floating point data type has a (23+1)-bit mantissa while the double precisioun has a (52+1)-bit
mantissa. So that means that the most significant digits a floating point representation which
complies with IEEE 754 is:

for single precision: log_10(2^(23+1)) = 7.2247 => 8 significant digits
for double precision: log_10(2^(52+1)) = 15.955 => 16 significant digits

There is no such thing as 15.9 digits. We either have 15 digits or 16 digits. If we only handle 15
significant digits with a IEEE 754 double precision data type then we are needlessly avoiding taking
full advantage of double precision's precision. If we opt to handle 16 significant digits then we
take full advantage of the data type's precision at the expense of having a precision loss in those
cases where a exact conversion from a 16-digit decimal representation to a double precision floating
point representation would require more than 53 bits.

But once again, this is just nit-picking.

Rui Maciel

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Francis Glassborow on 26 Jun 2010 05:28

Andrew wrote:
>> you should use 17 decimal digits and high quality conversions.
>
> See my sample program in this thread that uses the value
> -937566.2364699869. When GCC takes that string, converts it to a
> double, then converts the double back to a string, it gives
> -937566.2364699868. Adding an extra digit of precision gives
> -937566.23646998685. IFAICS this means it is doing the rounding
> incorrectly.
>
By what definition of incorrectly? If a decimal ends in a 5 and nothing
is known about whether that five is the result of rounding up or down
then there is no unique correct solution to rounding (despite what has
been taught in many schools) Consistently rounding up or consistently
rounding down will introduce a systematic error. There are three
Strategies that are statistically valid:

alternately round up and down
randomly (and remember that humans are bad random number generators
despite what individuals may think)
always to an even digit (so .15 rounds to .2 as does .25)

The point of such strategies is to ensure that such things as arithmetic
means are not adversely affected by rounding of terminal 5s.

Now rounding binary numbers is problematic if all you know is that the
terminal bit is a 1, shout you now round up or down?

The number you quote above is not exactly representable in binary so the
internal representation will not match the input string in value. It
seems that the conversion routines result in a value that is slightly
less. Now if that slightly less value has a terminal 1 it is not
unreasonable that the conversion the other way again generates something
that is less, nor is it unreasonable that if you push for one more digit
of precision that the the result in denary includes a terminal 5 (which
might be rounding up from '46'

My point is that you cannot deduce whether a C++ implementation is
correctly rounding without a great deal more information than the
results of a single round trip conversion.

Mathematically, as soon as you apply some mathematical operation to a
floating point value in a limited precision context the last digit (or
bit or ..) becomes unreliable.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Jean-Marc Bourguet on 26 Jun 2010 05:28

Andrew <marlow.andrew(a)googlemail.com> writes:

> See my sample program in this thread that uses the value
> -937566.2364699869. When GCC takes that string, converts it to a
> double, then converts the double back to a string, it gives
> -937566.2364699868. Adding an extra digit of precision gives
> -937566.23646998685. IFAICS this means it is doing the rounding
> incorrectly.

> The real program (whose source code I cannot reproduce here) was
> written very quickly under quite a bit of pressure (and not by me
> BTW). There are a few corner cases where the memory demands are huge
> which is why the input numbers are converted from strings to doubles.
> My approach would have been to keep them as strings then there would
> be no conversion/precision problres. But this would have forced the
> program to solve the lack of memory problem. This can be done with
> various ways and means but would have made the program much more
> complex just to cover those few corner cases. We don't know yet how
> often these corner cases will come up in practice. If it proves to be
> a bigger problem than we thought then my suggestion is to build the
> program in 64 bit mode. This would allow much more memory to be
> addressed and would keep the program relatively simple.

If you expect to keep 16 decimal digits with double, you'll have a
problem whatever compiler and libraries you are using. Try putting
-937566.2364699868 instead of -937566.2364699869 as input in your
program, you'll get the good result with gcc and the bad one with
VC++. Both numbers will be read to the same FP representation and so
will get the same string representation after writing.

That number is 937566.2364699868 485... rouding it correctly to 10
digits gives 937566.2364699868. Yes if you first round it to 11
digits you get 937566.23646998685 and rounding that to 10 digits you
get 937566.2364699869. This difference is why double rounding should
be avoided.

Yours,

--
Jean-Marc

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Accessor Functions (getter) for C String (Character Array) Members
Next: Why is the return type of count_if() "signed" rather than "unsigned"?