Portable way to write binary data [C++]

Prev: SHA1
Next: generate double persision random value

From: Holger Sebert on 21 Nov 2005 04:48

Hi all,

I was shocked when I read to thread "For binary files use only read() and
write()??" above, in which was stated that using read()/write() for binary data
is unportable and may lead to undefined behaviour (!!).

I always thought myself to be on the safe side by doing things the following way:

- Use std::ofstream/std::ifstream together with read()/write()

- Only use types of standardized size, i.e. float, double, long, ...
(they _are_ standardized, aren't they?? I'm slowly becoming unsure of almost
everything concerning portable C++ *sigh*)

- Store information of endianess elsewhere and when reading binart data flip the
bytes if neccessary.

Where are the pitfalls following this procedure?

How should I do binary i/o instead to achieve portability?

Note: Unfortunately I cannot use the portable boost libraries ... (because they
don't compile on one of my target architectures, what a funny world)

Many thanks in advance,
Holger

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Ulrich Eckhardt on 21 Nov 2005 07:41

Holger Sebert wrote:
> I was shocked when I read to thread "For binary files use only read() and
> write()??" above, in which was stated that using read()/write() for binary
> data is unportable and may lead to undefined behaviour (!!).
>
> I always thought myself to be on the safe side by doing things the
> following way:
>
> - Use std::ofstream/std::ifstream together with read()/write()

You need the according codecvt facet (from std::locale::classic()) and the
ios_base::binary flag, too.

> - Only use types of standardized size, i.e. float, double, long, ...
> (they _are_ standardized, aren't they?? I'm slowly becoming unsure of
> almost everything concerning portable C++ *sigh*)

No. Neither their size nor their layout is standardized. There are a few
minimum requirements but that's all.

Of course, there is also an invalid assumption that CHAR_BITS==8 but while I
have seen such a beast (a DSP from Texas Instruments), I haven't seen the
need to write portable software for it.

Uli

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Simon Bone on 21 Nov 2005 20:42

On Mon, 21 Nov 2005 04:48:03 -0500, Holger Sebert wrote:

> Hi all,
>
> I was shocked when I read to thread "For binary files use only read() and
> write()??" above, in which was stated that using read()/write() for binary data
> is unportable and may lead to undefined behaviour (!!).
>
> I always thought myself to be on the safe side by doing things the following way:
>
> - Use std::ofstream/std::ifstream together with read()/write()
>

The stream classes do formatting. You would use the streambuf classes if
you don't need that.

> - Only use types of standardized size, i.e. float, double, long, ...
> (they _are_ standardized, aren't they?? I'm slowly becoming unsure of
> almost everything concerning portable C++ *sigh*)
>

C++ standardizes minimum sizes for fundamental types. Implementations are
always free to use larger types if they think it makes sense for their
customers. For example, there is currently some variation in whether long
is 32 bits (the minimum allowed) or 64 bits (the widest native integral
type on many common processors).

In addition to this, there is some variation allowed in the format of the
types. E.g. integral types can be twos-complement, ones-complement or
signed-magnitude. You certainly do not want bit-for-bit copying of one of
these to another, since that would change the value and might even lead to
a trap representation. On the bright side, two-complement is so common you
can probably rely on it.

> - Store information of endianess elsewhere and when reading binart data
> flip the
> bytes if neccessary.
>

Bear in mind that there are some perverse choices possible. A 4 byte datum
could be written 1234 or 4321 or 2143...
And what about a 8 byte datum?

> Where are the pitfalls following this procedure?
>

You might cover enough for all the platforms you develop and test on, and
then find yourself asked to support a platform where all your assumptions
break down. How likely that is depends on your application.

If or when it happens you can possibly write a special program to convert
the data files you have already created to the new platforms expectations.
This is often hard, and with a legacy application where the original
source has become convoluted through long haphazard maintenance (or
just been lost), it is darn-near impossible. Most of us curse applications
that put us through this, so consider whether it is likely for your
applications.

> How should I do binary i/o instead to achieve portability?
>

At the least, use typedef names for the types you write/read. Such as
int8_t, int32_t etc from the C99 stddef.h. This encapsulates your
assumption about the sizes of the types.

Your approach to include information about endian-ness in the file is OK,
but usually you can define a fixed format for the file. The time spent
waiting for IO to complete is likely to dwarf any time spend marshalling
the data to or from this format. If you are doing that limited formatting
anyway, you might consider going one extra step and ditching binary IO
altogether too. The advantage of a file format that can be used on any
platform is a big one.

> Note: Unfortunately I cannot use the portable boost libraries ...
> (because they don't compile on one of my target architectures, what a
> funny world)
>

There are many others out there. The boost library is worth looking at to
see how this can be done well. But also look at the serialization section
in the FAQ at http://www.parashift.com/c++-faq-lite/ for more ideas.

HTH

Simon Bone

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Le Chaud Lapin on 21 Nov 2005 21:09

Holger Sebert wrote:
> Where are the pitfalls following this procedure?
>
> How should I do binary i/o instead to achieve portability?

Your views seem good to me.

I implemented a serialization package (which turned out to be oddly
similar to one in Boost) that basically defined Source and Target
repositories for serializing the 13 scalar types in C++ and the 13
vector types. Source and Target have virtual functions that can be
overriden by any derived I/O class. I use this model extensively for
my inter-process distributed communication.

With regard to data format, you're right. It's better to follow the
receiver-makes right rule, because in the vast majority of distributed
data sharing, the source and target architectures are identifcal
(PC-to-PC, SPARC-to-SPARC, etc.). For cases where they are not, I
include at beginning of transmission stream an object that completely
characterizes the format of the fundamental C++ types on the source
machine, so that any target machine can do a conversion if necessary.
One would be surprised at how compact this object can be made for the
13 fundamental C++ scalar types.

To do the same for files, I would simply put this descriptor object at
the beginning of the file, but I am not doing that yet.

Finally, since any aggregate can be recursively and ultimately
decomposed into scalar objects, it is trivial to serialize complex
types.

Caveats, which you are certainly aware of:

1. Polymorphic objects are intractable
2. If structure of an object changes, you're in big trouble with all
that old-format data everywhere. Boost gets around this with embedded
versioning. I decided not to take this route, as I felt it would be
pushing the limit on what makes one type distinct from another. And
also, it raises the standard for defining nice clean data types. I
hear a little voice in my head as I write the serialization code..."You
sure you got the structure of this class right? Huh..huh...huh? You'll
suffer if you didn't."

-Le Chaud Lapin-

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 22 Nov 2005 11:34

Simon Bone wrote:

>> On Mon, 21 Nov 2005 04:48:03 -0500, Holger Sebert wrote:

>>> > I was shocked when I read to thread "For binary files use
>>> > only read() and write()??" above, in which was stated that
>>> > using read()/write() for binary data is unportable and may
>>> > lead to undefined behaviour (!!).

>>> > I always thought myself to be on the safe side by doing
>>> > things the following way:

>>> > - Use std::ofstream/std::ifstream together with read()/write()

>> The stream classes do formatting. You would use the streambuf
>> classes if you don't need that.

basic_ios also does error handling. What there is of it,
anyway. Use the streambuf if you don't need that.

The streambuf does character code translation. Don't use
streambuf if you don't want that.

In fact, it's a trade off, which has to be evaluated each time.

>>> > - Only use types of standardized size, i.e. float, double,
>>> > long, ... (they _are_ standardized, aren't they?? I'm
>>> > slowly becoming unsure of almost everything concerning
>>> > portable C++ *sigh*)
>>> >

>> C++ standardizes minimum sizes for fundamental types.
>> Implementations are always free to use larger types if they
>> think it makes sense for their customers. For example, there
>> is currently some variation in whether long is 32 bits (the
>> minimum allowed) or 64 bits (the widest native integral type
>> on many common processors).

There are also machines with 32 bit char's, and at least one
with 9 bit char's and 36 bit 1's complement int's.

Not everybody has to deal with them, of course.

>> In addition to this, there is some variation allowed in the
>> format of the types. E.g. integral types can be
>> twos-complement, ones-complement or signed-magnitude. You
>> certainly do not want bit-for-bit copying of one of these to
>> another, since that would change the value and might even lead
>> to a trap representation. On the bright side, two-complement
>> is so common you can probably rely on it.

Probably. There's always the Unisys 2200's, but that's a pretty
small market.

Floating point is trickier, since the mainframe IBM's also have
a different format (and I've been told that IEEE isn't always
compatible between vendors, at least where NaN's are concerned).

>>> > - Store information of endianess elsewhere and when reading
>>> > binart data flip the bytes if neccessary.

>> Bear in mind that there are some perverse choices possible. A
>> 4 byte datum could be written 1234 or 4321 or 2143... And what
>> about a 8 byte datum?

I've actually used systems where long's were 3412. The
processor was Intel, and the compiler Microsoft, so I don't
think we can speak of obscure niche players, either.

>>> > Where are the pitfalls following this procedure?

>> You might cover enough for all the platforms you develop and
>> test on, and then find yourself asked to support a platform
>> where all your assumptions break down. How likely that is
>> depends on your application.

>> If or when it happens you can possibly write a special program
>> to convert the data files you have already created to the new
>> platforms expectations. This is often hard, and with a legacy
>> application where the original source has become convoluted
>> through long haphazard maintenance (or just been lost), it is
>> darn-near impossible. Most of us curse applications that put
>> us through this, so consider whether it is likely for your
>> applications.

The problem isn't so much writing the code to read the format,
once you know it. The problem is finding out what the format
was to begin with. Especially if the data written contained
struct's -- who knows where the original compiler inserted
padding?

>>> > How should I do binary i/o instead to achieve portability?

>> At the least, use typedef names for the types you write/read.
>> Such as int8_t, int32_t etc from the C99 stddef.h. This
>> encapsulates your assumption about the sizes of the types.

>> Your approach to include information about endian-ness in the
>> file is OK, but usually you can define a fixed format for the
>> file.

I'd say that you have to do it anyway. You have to document the
exact format on disk; otherwise, sooner or later, it will be
unreadable. Given that, you might as well document endian-ness,
and stick to it. (And it is easy to write portably to a given
endianness.)

>> The time spent waiting for IO to complete is likely to dwarf
>> any time spend marshalling the data to or from this format. If
>> you are doing that limited formatting anyway, you might
>> consider going one extra step and ditching binary IO
>> altogether too. The advantage of a file format that can be
>> used on any platform is a big one.

In theory at least, any file format can be used on any platform.
I'll admit that I've never tested the extreme cases -- writing a
file on a machine with 9 bit char's, then trying to read it on
one with 8 bit char's, for example. But I regularly read and
write binary files which are shared between Sparc's (in both 32
bit and 64 bit modes) and PC's under Linux and Windows, using
the exact same code on every platform (no conditional byte
swapping).

Note that while globally, I agree with your recommendation for
using text whenever possible (it sure makes debugging easier),
it's worth pointing out that you need to define a few details of
the format there as well -- Unix and Windows typically expect
different line separators, and mainframe IBM's still use EBCDIC.

>>> > Note: Unfortunately I cannot use the portable boost
>>> > libraries ... (because they don't compile on one of my
>>> > target architectures, what a funny world)

Join the club:-(.

>> There are many others out there. The boost library is worth
>> looking at to see how this can be done well.

Sort of. The Boost libraries have different goals than normal
production code, and I would certainly never introduce so much
genericity in something that I knew would only be used for a
short time in one project.

>> But also look at the serialization section in the FAQ at
>> http://www.parashift.com/c++-faq-lite/ for more ideas.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

| Next | Last
Pages: 1 2 3
Prev: SHA1
Next: generate double persision random value