Designing fgetline - a perspective [General Programming]

Prev: problem analysis chart
Next: Please help!

From: Richard Harter on 12 Oct 2007 02:37

On Thu, 11 Oct 2007 23:24:50 -0500, Logan Shaw
<lshaw-usenet(a)austin.rr.com> wrote:

>Richard Harter wrote:

>> So how do we provide a choice? That is simple; if the buffer
>> pointer in the calling sequence is NULL, fgetline must return a
>> clean copy; otherwise fgetline MAY return a transient copy.
>
>I agree 100% so far, but this is where I diverge. IMHO, this is
>all a lot simpler if you let the caller make the choice up front
>before fgetline() is ever called. How do you do this when the
>length of the line is not known in advance? You let the user
>manage the storage for the "meta info" about the buffer (the
>address of the buffer, the size of it, etc.), and you track
>everything about the buffer either in local variables or in
>the storage that the user has supplied. If the user wants that
>meta info to be static, they declare it static. If they want
>to have a clean copy on every call, they make a new structure
>to track a buffer every time.Also notice that free_buffer() leaves the buffer metadata in a
>consistent state so that you can throw away the dynamic memory as
>soon as you know you don't need it yet still call freadline() on it
>again if you feel like it.
>
>It looks like this (extraneous fields omitted for clarity):
>
> struct LineBuffer {
> char *buffer;
> size_t buffer_size;
> };
>
>And the code looks like this:
>
> void init_buffer(LineBuffer *lb) {
> lb->buffer = NULL;
> lb->buffer_size = 0;
> }
>
> int increase_buffer(LineBuffer *lb, size_t newsize) {
> char *newbuffer = realloc(buffer, newsize);
> if (newbuffer == NULL) return 0;
>
> lb->buffer = newbuffer;
> lb->buffer_size = newsize;
> return 1;
> }
>
> void free_buffer(LineBuffer *lb) {
> free(lb->buffer);
> init_buffer(lb); /* leave in a consistent state */
> }
>
> int freadline(FILE *f, LineBuffer *lb) {
> int pos = 0;
>
> while (.... whatever ....) {
> if (pos >= lb->buffer_size &&
> ! increase_buffer(lb, 2 * lb->buffer_size + 1)) {
> return 0;
> }
>
> lb->buffer[pos++] = .... this character ....;
> }
> }
>
>Now the client code can do whatever it wants. It doesn't have to worry
>about conditionally freeing the old storage. Because the client is
>managing the metadata about the buffer and because it is passing that
>metadata to freadline(), the freadline() function can free() it if
>necessary.
>
>On a side note, since init_buffer() allocates no dynamic storage and
>merely leaves the metadata in a consistent state, no error handling
>is necessary, and only calls to freadline() have to worry about an
>error, thus reducing the amount of error-handling code needed.

Many thanks for the comments. You have, I opine, the right idea
(using metadata) but your actual code doesn't do the job. One
problem is that your init_buffer initializes the buffer to NULL.
This means that the client can't supply a buffer either from the
stack or from static storage; it will have to come from the heap
which is what we want to avoid. We can get around that by having
a pair of pointer/length variables, e.g.

struct fg_meta {
char * buf0;
size_t len0;
char * buffer;
size_t length;
};

void init_fgmeta(fg_meta * fg; char *buf, size_t len) {
if (!buf) len = 0;
if (!len) buf = 0;
fg->buf0 = buf; fg->buffer = buf;
fg->len0 = len; fg->length = len;
};

with the obvious adjustment to free_fgmeta/free_buffer.

More importantly we don't want the user (client if you like) to
have to manage the meta data, though they can have the option to
do so if they like. The implication is that at initialization
time the user *tells* the package how it wants to do things,
presumably by passing a flag to the initialization routine. Does
the user want a clean copy each time? Then the user takes
responsibility for freeing the storage. Will the user accept
transient copies? Then they don't have to anything about freeing
storage.

The way I would do this is to define a structure that holds all
this information; call the pointer to it a handle. Then we have
three routines, one to initialize the structure, one to do the
actual reads (and whatever back room storage management is
needed) and one to close the read. The latter takes care of any
cleanup that needs to be done.

That's the alternative to passing bits and pieces through calling
sequences, and it probably is the road that should be taken.

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins

From: Alf P. Steinbach on 12 Oct 2007 02:48

* Alf P. Steinbach:
>
> std::ostream

I meant istream, of course.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

From: Richard Harter on 12 Oct 2007 05:04

On Fri, 12 Oct 2007 06:19:23 +0000, Richard Heathfield
<rjh(a)see.sig.invalid> wrote:

>Richard Harter said:
>
>> On Fri, 12 Oct 2007 03:43:48 +0000, Richard Heathfield
>> <rjh(a)see.sig.invalid> wrote:
>>
>>>Richard Harter said:
>>>
>> [snip]
>>>>
>>>> But that's the logical name - getline gets a line from stdin,
>>>> fgetline gets a line from an arbitrary file.
>>>
>>>Er, I agree that it's the logical name (obviously!). Oh, well - if you
>>>want a name clash, a name clash you can have, I guess.
>>
>> Hey, I'm easy. Do you have any good suggestions? getln, fgetln,
>> rdline, and readline are already in use.
>
>Ah. Er, um... Hmmm. The trouble is that fgetline *is* the logical name! :-)
>
>Okay, how about... oh. fgets has kinda gone, too, hasn't it?
>
>Um... sgetline? (s for stream (the source) or for string (the target) at
>your discretion).
>
>Just a thought. Obviously, fgetline isn't trademarked or anything (well,
>not as far as I know, anyway), so you're free to use that if you choose.
>I'm not being 'precious' about it, honestly - but between you and me the
>name clash is particularly bad because we can't even say "fgetline - you
>know, Richard H's version"!

I did a bit of googling; freadline is pretty much free, so
freadline it is.

>
>--
>Richard Heathfield <http://www.cpax.org.uk>
>Email: -http://www. +rjh@
>Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
>"Usenet is a strange place" - dmr 29 July 1999

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins

From: Malcolm McLean on 12 Oct 2007 16:06

"Richard Harter" <cri(a)tiac.net> wrote in message
> The following is an exercise in thinking out a design.
> Comments and thoughts are welcome.
>
Firstly let me say I think this is a massively valuable exercise. Any
competent C programmer should be able to write an fgetline() without bugs,
the cghallenge is in developing an itnerface for it.
>
> My vote is to have it all returned by the function which
> means we create a struct (record) that looks like:
>
> struct fgetline_info {
> char *line;
> size_t len;
> int status;
> };
>
Sorry hopeless.
Caller wants
while(line = fgetline() )
{
processline();
}

He can't be bothered with fiddling rounf with the internals of structures.

> A second issue is that there is no limit to how large the line
> can be. A plausible way to provide a limit that is to pass one
> in as an argument. If we do, our prototype looks like this:
>
> struct fgetline_info fgetline(FILE * fptr, size_t maxsize);
>
> This version has a little problem - where did the storage for the
> line come from?
>
You're passing in a maximum size, so caller might as well make the line a
buffer in his scope, as in fgets().

> In fgetline we have some initialization code that allocates
> buffer space when the buffer size is zero. Thereafter we use our
> little doubling trick.
>
Is this relevant? Apart for the problem of malicious, massive lines designed
to tie up resources, I don't think most people care whether the function can
be implemented efficiently or not.
>
[ clen, transient, dirty copies ]
> When there is a significant choice as to the kind of
> output being delivered the library routine should let
> the user make the choice rather than dictating the
> choice unless offering the choice unduly complicates
> usage.
>
And the choice does unduly complicate usage. Caller doesn't want to be
bothered with all this.
>
> What sort of return values should be available in the status
> field? These are the successful ones that occur to me:
>
> end_of file The last read (if any) found an EOL
> marker. The current read finds an
> immediate EOF.
>
> no_increase Normal read - either buffer was null or
> else it was not increased.
>
> increase Normal read - the buffer size has been
> increased.
>
> abn_no_increase Abnormal read - an EOF was found without
> an EOL. Buffer was not increased.
>
> abn_increase Abnormal read - an EOF was found without
> an EOL. Buffer was increased.
>
> In addition there are numerous kinds of errors that can occur.
> Calling sequence arguments include:
>
> no_file The file pointer is null.
> bad_buffer One and only one of buffere and size s
> zero.
> bad_size Size is 0 or is greater than maxsize
> bad_maxsize Maxsize is 0
>
> Then there are the memory allocation failures:
>
> bad_allocate Malloc or realloc failure
> big_line Line length is greater than maxsize.
>
> When there is a memory allocation failure the line and len fields
> hold what has been successfuly read.
>
This is a real problem. Too many return values. You can't get away from
allocation failures unless you also redo your entire allocation strategy.
You can't prevent IO errors, though in practise they are unlikely. Premature
EOF is neither unlikely nor something you can avoid.
>
> As a final note, there is one minor decision left unsettled,
> namely should the returned line include an EOL marker before the
> string terminating 0.
>
Why not just extend fgets().
You're passing a maximum size, which 99% of the time will be less than 1024
bytes, so caller can easily declare it in his scope. He can malloc a massive
buffer if expecting a massive line. The only thing we lose is that if there
is huge but legitimate line of unknown length, caller has to allocate more
memory.

The problem of line length is easily solved, pass in a length pointer /
return it. That leaves only the problem of what to do on overlong lines. The
ideal thing is to take as much error handling code out of caller as
possible.
So caller might want to exit, he might want to print a message to stderr, he
might want the function to refuse to return any data, or he might want to do
what fgets() does and essentially ignore the problem.

So

/*
fgetsx() - fgets extension.
Params; buff - buffer to hold line
maxlen - length of buffer
flags - how to handle overlong lines and IO errors
fp - pointer to an open file
Returns: number of characters placed in buff, or errorcode
*/
int fgetsx(char *buff, size_t maxlen, int flags, FILE *fp);

So now we've just got to decide the flags.

FGETS_EXIT - call exit on error
FGETS_CODE - return an error code on error
FGETS_STDERR - print a message to stderr on error
FGETS_IO - report IO errors
FGETS_OVERFLOW - report overflow errors
FGETS_EOF - report premature EOF as an error

FGETS_ALL - report all errors

Return value -1 = premature EOF
-2 - IO error
-3 - buffer overflow error

Typical call

while(len = fgetsx(buff, 1024, FGETS_EXIT| FGETS_STDERR | FGETS_ALL, fp))
{
/* here we know that buff is a valid normal line. Else program quits with
an error message */
}

while(ret = fgets(buff, 1024, FGETS_CODE | FGETS_ALL, fp) )
{
if(ret < 0)
{
switch(ret)
{
case EOF: /* premature EOF, handle */
case FGETS_IOERROR: /* io error, handle */
case FGETS_OVERFLOW: /* line too long, handle */
}
}
}

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

From: Richard Harter on 13 Oct 2007 13:22

On Fri, 12 Oct 2007 21:06:10 +0100, "Malcolm McLean"
<regniztar(a)btinternet.com> wrote:

>"Richard Harter" <cri(a)tiac.net> wrote in message
>> The following is an exercise in thinking out a design.
>> Comments and thoughts are welcome.
>>
>Firstly let me say I think this is a massively valuable exercise. Any
>competent C programmer should be able to write an fgetline() without bugs,
>the cghallenge is in developing an itnerface for it.

Thanks.

>>
>> My vote is to have it all returned by the function which
>> means we create a struct (record) that looks like:
>>
>> struct fgetline_info {
>> char *line;
>> size_t len;
>> int status;
>> };
>>
>Sorry hopeless.
>Caller wants
>while(line = fgetline() )
>{
> processline();
>}
>
>He can't be bothered with fiddling rounf with the internals of structures.

Point well taken, though I don't agree with "can't be bothered" -
some will and some won't. There are a couple of lessons here
though:

(a) The usage should be consistent with common practice. It is
very common in C to use loops like:
while(line = foo()) {...}
and
while((len = foo()) >= 0) {...}

so the interface should provide for it.

(b) The desires of the "the can't be bothered" school of
programming should be respected - at least to the extent of
providing usage options that produce results that aren't too
gross.

>
>> A second issue is that there is no limit to how large the line
>> can be. A plausible way to provide a limit that is to pass one
>> in as an argument. If we do, our prototype looks like this:
>>
>> struct fgetline_info fgetline(FILE * fptr, size_t maxsize);
>>
>> This version has a little problem - where did the storage for the
>> line come from?
>>
>You're passing in a maximum size, so caller might as well make the line a
>buffer in his scope, as in fgets().

I disagree. A common usage is read all of the lines of a file
and keep them in memory. If the buffers holding the lines are
all maximum size then you get memory bloat. The purpose of
maxsize is to serve as a sanity check.

>
>> In fgetline we have some initialization code that allocates
>> buffer space when the buffer size is zero. Thereafter we use our
>> little doubling trick.
>>
>Is this relevant? Apart for the problem of malicious, massive lines designed
>to tie up resources, I don't think most people care whether the function can
>be implemented efficiently or not.

I care. I like library code to be mean, lean, and efficient. I
don't like library code produces bloated resource usage.

>>
>[ clen, transient, dirty copies ]
>> When there is a significant choice as to the kind of
>> output being delivered the library routine should let
>> the user make the choice rather than dictating the
>> choice unless offering the choice unduly complicates
>> usage.
>>
>And the choice does unduly complicate usage. Caller doesn't want to be
>bothered with all this.

This user does. I dare say that most users don't know, don't
care, and don't understand that their code is bloated and uses an
order of magnitude more resources than necessary. Often it
doesn't matter - the cost of writing more efficient code can
be greater than the savings.

That said, programmers should (in my personal opinion) at least
understand the costs of various choices. Part of programming is
about understanding tradeoffs. I feel that one ought to know and
appreciate that it is cheaper to reuse space rather than
continually allocate and deallocate it from the heap.

>>
>> What sort of return values should be available in the status
>> field? These are the successful ones that occur to me:
>>
>> end_of file The last read (if any) found an EOL
>> marker. The current read finds an
>> immediate EOF.
>>
>> no_increase Normal read - either buffer was null or
>> else it was not increased.
>>
>> increase Normal read - the buffer size has been
>> increased.
>>
>> abn_no_increase Abnormal read - an EOF was found without
>> an EOL. Buffer was not increased.
>>
>> abn_increase Abnormal read - an EOF was found without
>> an EOL. Buffer was increased.
>>
>> In addition there are numerous kinds of errors that can occur.
>> Calling sequence arguments include:
>>
>> no_file The file pointer is null.
>> bad_buffer One and only one of buffere and size s
>> zero.
>> bad_size Size is 0 or is greater than maxsize
>> bad_maxsize Maxsize is 0
>>
>> Then there are the memory allocation failures:
>>
>> bad_allocate Malloc or realloc failure
>> big_line Line length is greater than maxsize.
>>
>> When there is a memory allocation failure the line and len fields
>> hold what has been successfuly read.
>>
>This is a real problem. Too many return values. You can't get away from
>allocation failures unless you also redo your entire allocation strategy.
>You can't prevent IO errors, though in practise they are unlikely. Premature
>EOF is neither unlikely nor something you can avoid.

These aren't return values - they are information flags, bits in
the status word. At least that's the way I would do it. I tend
to take the view that library code ought to have a way to inform
the user of all the happenings of interest. Of course it
shouldn't be shoved down the user's throat.

>>
>> As a final note, there is one minor decision left unsettled,
>> namely should the returned line include an EOL marker before the
>> string terminating 0.
>>
>Why not just extend fgets().
>You're passing a maximum size, which 99% of the time will be less than 1024
>bytes, so caller can easily declare it in his scope. He can malloc a massive
>buffer if expecting a massive line. The only thing we lose is that if there
>is huge but legitimate line of unknown length, caller has to allocate more
>memory.

That's one way to do things but I really don't like it in part
for reasons already discussed, and because I don't trust coding
decisions made on the " most of the time it won't be a problem"
theory. More precisely my feeling is that code "down at the
bottom" should be very solid.

Having done the motherhood and apple pie commercial we now return
to your regular station.

>
>The problem of line length is easily solved, pass in a length pointer /
>return it. That leaves only the problem of what to do on overlong lines. The
>ideal thing is to take as much error handling code out of caller as
>possible.
>So caller might want to exit, he might want to print a message to stderr, he
>might want the function to refuse to return any data, or he might want to do
>what fgets() does and essentially ignore the problem.
>
> So
>
>/*
> fgetsx() - fgets extension.
> Params; buff - buffer to hold line
> maxlen - length of buffer
> flags - how to handle overlong lines and IO errors
> fp - pointer to an open file
> Returns: number of characters placed in buff, or errorcode
>*/
>int fgetsx(char *buff, size_t maxlen, int flags, FILE *fp);
>
>So now we've just got to decide the flags.
>
>FGETS_EXIT - call exit on error
>FGETS_CODE - return an error code on error
>FGETS_STDERR - print a message to stderr on error
>FGETS_IO - report IO errors
>FGETS_OVERFLOW - report overflow errors
>FGETS_EOF - report premature EOF as an error
>
>FGETS_ALL - report all errors

This idea I like - pass flags to the routine that tell it what to
do with errors.

>
>Return value -1 = premature EOF
> -2 - IO error
> -3 - buffer overflow error

I dunno know about this one - using special values as flags is an
ancient programming trick, but in my experience it is error prone
and maintainance problems prone. The problems come from having
two distinct meanings for a variable.
>
>Typical call
>
>while(len = fgetsx(buff, 1024, FGETS_EXIT| FGETS_STDERR | FGETS_ALL, fp))
>{
> /* here we know that buff is a valid normal line. Else program quits with
> an error message */
>}
>
>while(ret = fgets(buff, 1024, FGETS_CODE | FGETS_ALL, fp) )
>{
> if(ret < 0)
> {
> switch(ret)
> {
> case EOF: /* premature EOF, handle */
> case FGETS_IOERROR: /* io error, handle */
> case FGETS_OVERFLOW: /* line too long, handle */
> }
> }
>}

I assume that second call should be ret = fgetsx.

Where is buff? If it is on the stack you're making transient
copies; if it is allocated each time (and you have to do the
allocation in the user code) you're making clean bloated copies.
Notice that in one call you've got len in one call and ret in
another.

Thanks for the comments - you've given me things to think about.

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
But the rhetoric of holistic harmony can generate into a kind of
dotty, Prince Charles-style mysticism. -- Richard Dawkins

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: problem analysis chart
Next: Please help!