Designing fgetline - a perspective [General Programming]

Prev: problem analysis chart
Next: Please help!

From: Richard Harter on 12 Oct 2007 00:01

On Thu, 11 Oct 2007 20:32:32 -0600, Thad Smith
<ThadSmith(a)acm.org> wrote:

>Richard Harter wrote:
>> The following is an exercise in thinking out a design.
>> Comments and thoughts are welcome.
>
>I enjoyed reading your design process.

Thank you.
>
>
>> Is there a way to get the efficiency of the highwater scheme
>> without being dirty? Yes; the trick is that the calling program
>> provides the initial buffer. If we add the buffer information to
>>
>> the prototype we get:
>>
>> fgetline_info fgetline(FILE * fptr,
>> char * buffer,
>> size_t size,
>> size_t maxsize);
>>
>> It's probably best to have a value in the status field that
>> signifies the size has been increased; call that value
>> fg_increased and a normal read fg_normal. Then the usage for
>> transient copies might look like this (first cut):
>>
>> ...
>> struct fgetline_info fg;
>> char buffer[80];
>> int done = 0;
>> ....
>> for(;!done;) {
>> ...
>> fg = fgetline(fptr,buffer,80,FG_MAXSIZE);
>> if (fg.status == fg_normal || fg.status == fg_increased) {
>> /* do stuff */
>> if (fg.status == fg_increased) free(fg.line);
>> } else done = 1;
>> }
>
>That is too messy for my taste. I can see giving the flexibility of
>using either a supplied buffer or allocating one, but to have the
>function choose based on the size of the line makes the application too
>error prone, IMO. If accepting a supplied buffer, I would only use
>that, giving an error indication if it were too short. The
>simplification in code is a good payoff for me.

Hmmm. My take is that I want the function to give me a complete
line if at all possible but to do it in a way that minimizes
calls to malloc/realloc/free. Passing a supplied buffer does this
provided that we still do the expanding buffer trick. Treating a
too small buffer as an error would move a bunch of mess to the
calling program.

That said, your point about being too error prone is well taken.
It may be better to actually have two routines - one without an
input buffer and one with. Call the one with no user supplied
buffer fgetline and the one with the user supplied buffer
fgetline_hp (hp for high performance :-)).

>
>
>> What sort of return values should be available in the status
>> field? These are the successful ones that occur to me:
>>
>> end_of file The last read (if any) found an EOL
>> marker. The current read finds an
>> immediate EOF.
>>
>> no_increase Normal read - either buffer was null or
>> else it was not increased.
>>
>> increase Normal read - the buffer size has been
>> increased.
>>
>> abn_no_increase Abnormal read - an EOF was found without
>> an EOL. Buffer was not increased.
>>
>> abn_increase Abnormal read - an EOF was found without
>> an EOL. Buffer was increased.
>>
>> In addition there are numerous kinds of errors that can occur.
>> Calling sequence arguments include:
>>
>> no_file The file pointer is null.
>> bad_buffer One and only one of buffere and size s
>> zero.
>> bad_size Size is 0 or is greater than maxsize
>> bad_maxsize Maxsize is 0
>>
>> Then there are the memory allocation failures:
>>
>> bad_allocate Malloc or realloc failure
>> big_line Line length is greater than maxsize.
>
>Add I/O error.
>
>> As a final note, there is one minor decision left unsettled,
>> namely should the returned line include an EOL marker before the
>> string terminating 0. My take is that this matter is too trivial
>>
>> to warrant adding a flag to the calling sequence, and that it
>> will be slightly less confusing if there is one present even if
>> it has to be manufactured.
>
>I prefer a flag. Otherwise you need to usurp some character for an EOF
>marker. The constant EOF is intentionally outside (unsigned char)c for
>all characters c (excepting implementations with sizeof int = 1).

I see that I was unclear. Here is the issue: When we return a
line do we include \n or not. For example, suppose the line is
"abc". Do we return abc\n\0 or abc\0? The "manufactured" comes
into play when the last line of a file isn't properly terminated
with a \n. If we are returning the EOL (not EOF) character then
we have to manufacture one that is not actually present in the
file. Upon reflection, it may be best to return abc\0.

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins

From: Richard Harter on 12 Oct 2007 00:23

On Fri, 12 Oct 2007 03:43:48 +0000, Richard Heathfield
<rjh(a)see.sig.invalid> wrote:

>Richard Harter said:
>
[snip]
>>
>> But that's the logical name - getline gets a line from stdin,
>> fgetline gets a line from an arbitrary file.
>
>Er, I agree that it's the logical name (obviously!). Oh, well - if you want
>a name clash, a name clash you can have, I guess.

Hey, I'm easy. Do you have any good suggestions? getln, fgetln,
rdline, and readline are already in use.

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins

From: Logan Shaw on 12 Oct 2007 00:24

Richard Harter wrote:
> Is there a way to get the efficiency of the highwater scheme
> without being dirty? Yes; the trick is that the calling program
> provides the initial buffer. If we add the buffer information to
> the prototype we get:
>
> fgetline_info fgetline(FILE * fptr,
> char * buffer,
> size_t size,
> size_t maxsize);
>
> There are three kinds of copies of the line that we might get
> from fgetline - clean, transient, and dirty. A clean copy is one
> that has no encumbrances - it lasts until it is specifically
> deleted. A transient copy is one that lasts until the next
> call to fgetline to get another line from the current file.
> (There may be another file being read elsewhere.)
>
> What kind of copy should fgetline provide, clean or transient?
> There are arguments for each choice. Clean copies are more
> expensive but safer. Transient copies are (usually) cheaper. In
> questions like this there is much to be said for the design
> principle that:
>
> When there is a significant choice as to the kind of
> output being delivered the library routine should let
> the user make the choice rather than dictating the
> choice unless offering the choice unduly complicates
> usage.
>
> So how do we provide a choice? That is simple; if the buffer
> pointer in the calling sequence is NULL, fgetline must return a
> clean copy; otherwise fgetline MAY return a transient copy.

I agree 100% so far, but this is where I diverge. IMHO, this is
all a lot simpler if you let the caller make the choice up front
before fgetline() is ever called. How do you do this when the
length of the line is not known in advance? You let the user
manage the storage for the "meta info" about the buffer (the
address of the buffer, the size of it, etc.), and you track
everything about the buffer either in local variables or in
the storage that the user has supplied. If the user wants that
meta info to be static, they declare it static. If they want
to have a clean copy on every call, they make a new structure
to track a buffer every time.Also notice that free_buffer() leaves the buffer metadata in a
consistent state so that you can throw away the dynamic memory as
soon as you know you don't need it yet still call freadline() on it
again if you feel like it.

It looks like this (extraneous fields omitted for clarity):

struct LineBuffer {
char *buffer;
size_t buffer_size;
};

And the code looks like this:

void init_buffer(LineBuffer *lb) {
lb->buffer = NULL;
lb->buffer_size = 0;
}

int increase_buffer(LineBuffer *lb, size_t newsize) {
char *newbuffer = realloc(buffer, newsize);
if (newbuffer == NULL) return 0;

lb->buffer = newbuffer;
lb->buffer_size = newsize;
return 1;
}

void free_buffer(LineBuffer *lb) {
free(lb->buffer);
init_buffer(lb); /* leave in a consistent state */
}

int freadline(FILE *f, LineBuffer *lb) {
int pos = 0;

while (.... whatever ....) {
if (pos >= lb->buffer_size &&
! increase_buffer(lb, 2 * lb->buffer_size + 1)) {
return 0;
}

lb->buffer[pos++] = .... this character ....;
}
}

Now the client code can do whatever it wants. It doesn't have to worry
about conditionally freeing the old storage. Because the client is
managing the metadata about the buffer and because it is passing that
metadata to freadline(), the freadline() function can free() it if
necessary.

On a side note, since init_buffer() allocates no dynamic storage and
merely leaves the metadata in a consistent state, no error handling
is necessary, and only calls to freadline() have to worry about an
error, thus reducing the amount of error-handling code needed.

- Logan

From: Alf P. Steinbach on 12 Oct 2007 00:38

* Richard Harter:
> The following is an exercise in thinking out a design.
> Comments and thoughts are welcome.
>
> Writing a routine to read lines from a file is one of
> those little design tasks that seems to create disagreement and
> confusion. The routine has various names; in this article we
> will call it fgetline. Here is a take on how to do it.
>
> What we want is a routine that keeps reading and returning lines
> from a file. A simple prototype for this function is
>
> char * fgetline(FILE *);
>
> There are some gotchas with this prototype - well, not really
> gotchas, rather little bits of awkwardness and inefficiency. One
>
> is that we are throwing away information that (usually) has to be
>
> recomputed, namely the length of the line and, as we shall see,
> status information. If we want to add the information to our
> prototype there are several that to do it - it can be
> passed back through the calling sequence or it can be returned by
>
> the function. (We could also pass it to a global - ugh, don't do
>
> it.) My vote is to have it all returned by the function which
> means we create a struct (record) that looks like:
>
> struct fgetline_info {
> char *line;
> size_t len;
> int status;
> };

This looks like a C++ std::string except it has mixed in that "status"
field which has nothing to do with strings.

> There are various choices that could be made in the types -
> season to taste. (A pointer to the struct could be passed in as
> an argument if need be.)
>
>
> A second issue is that there is no limit to how large the line
> can be. A plausible way to provide a limit that is to pass one
> in as an argument. If we do, our prototype looks like this:
>
> struct fgetline_info fgetline(FILE * fptr, size_t maxsize);
>
> This version has a little problem - where did the storage for the
>
> line come from? One way to deal with that is for fgetline to
> allocate storage for the line and for the calling routine to free
>
> it when it is done with the line.

Well, the C++ function is

std::ostream& std::getline( std::ostream&, std::string& );

As far as I can see this solves all of the problems mentioned.

What it doesn't solve is how to just ignore a line.

For that, streams have the ignore() member function.

OH! You have cross-posted to [comp.lang.c]! Well, darn, it shouldn't
be too difficult to translate the C++ solution to C.

Of course, then the programmer has to do what the compiler does for C++.

And I think that's very relevant: the functionality you want is
essentially to have all this stuff automated and enforced by the
compiler, like in C++ and more high level languages.

Cheers, & hth.,

- Alf

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

From: Richard Heathfield on 12 Oct 2007 02:19

Richard Harter said:

> On Fri, 12 Oct 2007 03:43:48 +0000, Richard Heathfield
> <rjh(a)see.sig.invalid> wrote:
>
>>Richard Harter said:
>>
> [snip]
>>>
>>> But that's the logical name - getline gets a line from stdin,
>>> fgetline gets a line from an arbitrary file.
>>
>>Er, I agree that it's the logical name (obviously!). Oh, well - if you
>>want a name clash, a name clash you can have, I guess.
>
> Hey, I'm easy. Do you have any good suggestions? getln, fgetln,
> rdline, and readline are already in use.

Ah. Er, um... Hmmm. The trouble is that fgetline *is* the logical name! :-)

Okay, how about... oh. fgets has kinda gone, too, hasn't it?

Um... sgetline? (s for stream (the source) or for string (the target) at
your discretion).

Just a thought. Obviously, fgetline isn't trademarked or anything (well,
not as far as I know, anyway), so you're free to use that if you choose.
I'm not being 'precious' about it, honestly - but between you and me the
name clash is particularly bad because we can't even say "fgetline - you
know, Richard H's version"!

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: problem analysis chart
Next: Please help!