Designing fgetline - a perspective [General Programming]

Prev: problem analysis chart
Next: Please help!

From: Richard Harter on 11 Oct 2007 19:05

The following is an exercise in thinking out a design.
Comments and thoughts are welcome.

Writing a routine to read lines from a file is one of
those little design tasks that seems to create disagreement and
confusion. The routine has various names; in this article we
will call it fgetline. Here is a take on how to do it.

What we want is a routine that keeps reading and returning lines
from a file. A simple prototype for this function is

char * fgetline(FILE *);

There are some gotchas with this prototype - well, not really
gotchas, rather little bits of awkwardness and inefficiency. One

is that we are throwing away information that (usually) has to be

recomputed, namely the length of the line and, as we shall see,
status information. If we want to add the information to our
prototype there are several that to do it - it can be
passed back through the calling sequence or it can be returned by

the function. (We could also pass it to a global - ugh, don't do

it.) My vote is to have it all returned by the function which
means we create a struct (record) that looks like:

struct fgetline_info {
char *line;
size_t len;
int status;
};

There are various choices that could be made in the types -
season to taste. (A pointer to the struct could be passed in as
an argument if need be.)

A second issue is that there is no limit to how large the line
can be. A plausible way to provide a limit that is to pass one
in as an argument. If we do, our prototype looks like this:

struct fgetline_info fgetline(FILE * fptr, size_t maxsize);

This version has a little problem - where did the storage for the

line come from? One way to deal with that is for fgetline to
allocate storage for the line and for the calling routine to free

it when it is done with the line.

Okay, so how do we do this in fgetline? Well, there is a very
simple way to do it, one that has been reinvented time and time
again. We start out allocating a standard amount of space
(numbers like 32, 64, and 80 bytes are typical) for a line.
We read from the file until we either hit an EOL (end of line
marker), a failure to read, or we have read as many characters as

were allocated. If we haven't hit the EOL we reallocate the
array, commonly by doubling the size, and doing another read.

This works, but it is inefficient - we have to call malloc and
free for every line. One way to get around this is to use a
technique I call highwater buffering. In the file containing the

code for fgetline we have two file-scope variables:

static char * buffer = 0;
static size_t size = 0;

In fgetline we have some initialization code that allocates
buffer space when the buffer size is zero. Thereafter we use our

little doubling trick. The advantage of this scheme is that we
have at most a small number of calls to malloc (the number of
doublings needed to get to the largest line) and no calls to
free.

The very real disadvantage of this scheme is that it produces a
dirty copy of the line. By a dirty copy, I mean one that can be
scribbled on elsewhere before we are done with it. This will
happen whenever fgetline is called anywhere else with any FILE
argument before our next read. This is double plus ungood.

Is there a way to get the efficiency of the highwater scheme
without being dirty? Yes; the trick is that the calling program
provides the initial buffer. If we add the buffer information to

the prototype we get:

fgetline_info fgetline(FILE * fptr,
char * buffer,
size_t size,
size_t maxsize);

There are three kinds of copies of the line that we might get
from fgetline - clean, transient, and dirty. A clean copy is one

that has no encumbrances - it lasts until it is specifically
deleted. A transient copy is one that lasts until the next
call to fgetline to get another line from the current file.
(There may be another file being read elsewhere.)

What kind of copy should fgetline provide, clean or transient?
There are arguments for each choice. Clean copies are more
expensive but safer. Transient copies are (usually) cheaper. In

questions like this there is much to be said for the design
principle that:

When there is a significant choice as to the kind of
output being delivered the library routine should let
the user make the choice rather than dictating the
choice unless offering the choice unduly complicates
usage.

So how do we provide a choice? That is simple; if the buffer
pointer in the calling sequence is NULL, fgetline must return a
clean copy; otherwise fgetline MAY return a transient copy. I
say "may" because it might have to increase the buffer size.
If it does the calling routine will have to check whether there
was a size increase and, if so, will have to free the returned
buffer. (The fgetline implementation can't realloc the passed
in buffer; it will have to do a malloc for the first resizing.)

It's probably best to have a value in the status field that
signifies the size has been increased; call that value
fg_increased and a normal read fg_normal. Then the usage for
transient copies might look like this (first cut):

...
struct fgetline_info fg;
char buffer[80];
int done = 0;
....
for(;!done;) {
...
fg = fgetline(fptr,buffer,80,FG_MAXSIZE);
if (fg.status == fg_normal || fg.status == fg_increased) {
/* do stuff */
if (fg.status == fg_increased) free(fg.line);
} else done = 1;
}

If we want clean copies the corresponding loop body would be

fg = fgetline(fptr,0,0,FG_MAXSIZE);
if (fg.status == fg_normal) {
/* do stuff */
free(fg.line);
} else done = 1;

What sort of return values should be available in the status
field? These are the successful ones that occur to me:

end_of file The last read (if any) found an EOL
marker. The current read finds an
immediate EOF.

no_increase Normal read - either buffer was null or
else it was not increased.

increase Normal read - the buffer size has been
increased.

abn_no_increase Abnormal read - an EOF was found without
an EOL. Buffer was not increased.

abn_increase Abnormal read - an EOF was found without
an EOL. Buffer was increased.

In addition there are numerous kinds of errors that can occur.
Calling sequence arguments include:

no_file The file pointer is null.
bad_buffer One and only one of buffere and size s
zero.
bad_size Size is 0 or is greater than maxsize
bad_maxsize Maxsize is 0

Then there are the memory allocation failures:

bad_allocate Malloc or realloc failure
big_line Line length is greater than maxsize.

When there is a memory allocation failure the line and len fields

hold what has been successfuly read.

There are various conventions one could use for assigning code
values for the possible return values. My view is that there
should be a bit that is set only if there was an error, a bit
that is set only if there was a memory allocation error, a bit
that is set only if there was a buffer increase, and a bit that
is set only if an EOL occurred.

The point of doing this is that we can have simple tests to check

whether an error occurred, whether some space has to be freed,
and whether the file read is complete.

As a final note, there is one minor decision left unsettled,
namely should the returned line include an EOL marker before the
string terminating 0. My take is that this matter is too trivial

to warrant adding a flag to the calling sequence, and that it
will be slightly less confusing if there is one present even if
it has to be manufactured. Perhaps someone has a convincing
argument one way or the other.

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins

From: Richard Heathfield on 11 Oct 2007 22:18

Richard Harter said:

<snip>

> What we want is a routine that keeps reading and returning lines
> from a file. A simple prototype for this function is
>
> char * fgetline(FILE *);
>
> There are some gotchas with this prototype

Check out http://www.cpax.org.uk/prg/writings/fgetdata.php in which I dealt
with these problems a few years back. Curiously, I too chose the name
fgetline (and fgetword, for token-based rather than line-based input).

You, um, wanna try a different name? Two fgetlines could get confusing.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

From: Thad Smith on 11 Oct 2007 22:32

Richard Harter wrote:
> The following is an exercise in thinking out a design.
> Comments and thoughts are welcome.

I enjoyed reading your design process.

> Is there a way to get the efficiency of the highwater scheme
> without being dirty? Yes; the trick is that the calling program
> provides the initial buffer. If we add the buffer information to
>
> the prototype we get:
>
> fgetline_info fgetline(FILE * fptr,
> char * buffer,
> size_t size,
> size_t maxsize);
>
> It's probably best to have a value in the status field that
> signifies the size has been increased; call that value
> fg_increased and a normal read fg_normal. Then the usage for
> transient copies might look like this (first cut):
>
> ...
> struct fgetline_info fg;
> char buffer[80];
> int done = 0;
> ....
> for(;!done;) {
> ...
> fg = fgetline(fptr,buffer,80,FG_MAXSIZE);
> if (fg.status == fg_normal || fg.status == fg_increased) {
> /* do stuff */
> if (fg.status == fg_increased) free(fg.line);
> } else done = 1;
> }

That is too messy for my taste. I can see giving the flexibility of
using either a supplied buffer or allocating one, but to have the
function choose based on the size of the line makes the application too
error prone, IMO. If accepting a supplied buffer, I would only use
that, giving an error indication if it were too short. The
simplification in code is a good payoff for me.

> What sort of return values should be available in the status
> field? These are the successful ones that occur to me:
>
> end_of file The last read (if any) found an EOL
> marker. The current read finds an
> immediate EOF.
>
> no_increase Normal read - either buffer was null or
> else it was not increased.
>
> increase Normal read - the buffer size has been
> increased.
>
> abn_no_increase Abnormal read - an EOF was found without
> an EOL. Buffer was not increased.
>
> abn_increase Abnormal read - an EOF was found without
> an EOL. Buffer was increased.
>
> In addition there are numerous kinds of errors that can occur.
> Calling sequence arguments include:
>
> no_file The file pointer is null.
> bad_buffer One and only one of buffere and size s
> zero.
> bad_size Size is 0 or is greater than maxsize
> bad_maxsize Maxsize is 0
>
> Then there are the memory allocation failures:
>
> bad_allocate Malloc or realloc failure
> big_line Line length is greater than maxsize.

Add I/O error.

> As a final note, there is one minor decision left unsettled,
> namely should the returned line include an EOL marker before the
> string terminating 0. My take is that this matter is too trivial
>
> to warrant adding a flag to the calling sequence, and that it
> will be slightly less confusing if there is one present even if
> it has to be manufactured.

I prefer a flag. Otherwise you need to usurp some character for an EOF
marker. The constant EOF is intentionally outside (unsigned char)c for
all characters c (excepting implementations with sizeof int = 1).

--
Thad

From: Richard Harter on 11 Oct 2007 23:27

On Fri, 12 Oct 2007 02:18:30 +0000, Richard Heathfield
<rjh(a)see.sig.invalid> wrote:

>Richard Harter said:
>
><snip>
>
>> What we want is a routine that keeps reading and returning lines
>> from a file. A simple prototype for this function is
>>
>> char * fgetline(FILE *);
>>
>> There are some gotchas with this prototype
>
>Check out http://www.cpax.org.uk/prg/writings/fgetdata.php in which I dealt
>with these problems a few years back. Curiously, I too chose the name
>fgetline (and fgetword, for token-based rather than line-based input).

Thanks, I've read it; it's quite good; however it doesn't go into
some of the issues I was concerned with.
>
>You, um, wanna try a different name? Two fgetlines could get confusing.

But that's the logical name - getline gets a line from stdin,
fgetline gets a line from an arbitrary file.

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins

From: Richard Heathfield on 11 Oct 2007 23:43

Richard Harter said:

> On Fri, 12 Oct 2007 02:18:30 +0000, Richard Heathfield
> <rjh(a)see.sig.invalid> wrote:
>
<snip>

>>Check out http://www.cpax.org.uk/prg/writings/fgetdata.php in which I
>>dealt with these problems a few years back. Curiously, I too chose the
>>name fgetline (and fgetword, for token-based rather than line-based
>>input).
>
> Thanks, I've read it; it's quite good; however it doesn't go into
> some of the issues I was concerned with.

I'm sure, but I wasn't particularly fussed about that - I just wanted to
draw your attention to the name clash.

>>You, um, wanna try a different name? Two fgetlines could get confusing.
>
> But that's the logical name - getline gets a line from stdin,
> fgetline gets a line from an arbitrary file.

Er, I agree that it's the logical name (obviously!). Oh, well - if you want
a name clash, a name clash you can have, I guess.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: problem analysis chart
Next: Please help!