fgets() vs std::getline() performance [C++]

Prev: Sanity check: public/private
Next: Base Class undefined but header right above it!

From: Jeff Koftinoff on 18 Sep 2006 15:30

Wu Yongwei wrote:
>
> Why do you say `no exception thrown'? I would expect a std::bad_alloc,
> and, when it is not caught, an abort().
>

You would expect correct based on a conforming system. However, any
system that employs lazy memory allocation {allocation of memory pages
at page fault time instead of malloc() time} , behaves differently.

I have written a simple test program available at:

https://clicker.jdkoftinoff.com/projects/trac/jdks/wiki/alloctests

Which does some stress tests that shows the problems.

> >
> > Depending where this code is used, it can be the basis of a security
> > hole - and in that sense fgets() could be a better solution!
>
> Yes, this is a problem, but I do not see it a security hole. Every
> program can run out of memory for some kind of input, including Firefox
> and Internet Explorer. I often see Firefox occupies more than 500 MB of
> memory, which makes me feel necessary to restart it after viewing the
> big page. I am sure it is possible to make a big page to crash them.
>

It is a security hole when it joins the the lazy allocation problem.
It allows an untrusted user to kill servers, potentially even other
unrelated admin servers running on the same system. Yes, it is a
problem with those system's designs. But it is a real one that affects
many real servers. It can not be stressed more that catching
std::bad_alloc is not always enough!

> When this problem could be a real issue, there are ways to go around
> it. For example, using custom allocators. The point is that C++ does
> not strangely force an arbitrary limit on how long a line could be. And
> the system limitation could be put somewhere else than the processing
> logic.
>

Unfortunately, the interface of getline() and std::string also do not
allow me to put a limit on how long the line could be.

And if I can use this space to respond to James Kanze's comment:

James Kanze wrote:

> I agree that a version of the function with a maximum length
> would be nice. Or simply specifying that extractors respect
> setw too---that would be useful for a lot of other things as
> well. But I don't see it really as a security hole, any more
> than the possibility of any user function to allocate more
> available memory than it should is. If untrusted users have
> access to your program, you'll have taken the necessary steps
> outside the program to prevent DOS due to thrashing.

What would the necessary steps be then to ensure maximum line length in
my protocol parses utilizing iostream and std::string? Should I write
my own getline()? Or read one character at a time?

I, for one, would love to have a std::string() class that had an option
of setting a maximum allowable length.

Regards,
Jeff Koftinoff
www.jdkoftinoff.com

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 19 Sep 2006 13:16

Jeff Koftinoff wrote:
> Wu Yongwei wrote:

> > Why do you say `no exception thrown'? I would expect a
> > std::bad_alloc, and, when it is not caught, an abort().

> You would expect correct based on a conforming system.
> However, any system that employs lazy memory allocation
> {allocation of memory pages at page fault time instead of
> malloc() time} , behaves differently.

Yes, but you really shouldn't allow such machines to be
connected to the Internet. It's the overcommit which is the
problem, not the code using getline(). Normally, any Linux
machine connected to the network should have the value 2 in
/proc/sys/vm/overcommit_memory. (I would, in fact, recommend
this on any Linux system, unless some of the applications
being run require overcommit or were designed with overcommit in
mind. The process which gets killed isn't necessarily the one
using too much memory; I have heard of a least one case where a
critical system process needed to login was killed.)

Note that not having overcommit isn't a panacea either. I can
remember Solaris 2.2 hanging for around 5 minutes, thrashing
like crazy but not advancing in any visible way, in one of the
stress tests I did on it. (This seems to have been fixed in
later versions. At least, my stress test caused no problems
with Solaris 2.4.)

[...]
> It is a security hole when it joins the the lazy allocation
> problem.

Overcommit is a very serious security hole. Which has nothing
to do with getline or fgets (unless you are considering the
possibility of writing the entire program without using any
dynamic memory).

> It allows an untrusted user to kill servers, potentially even
> other unrelated admin servers running on the same system.

The untrusted user doesn't get a choice with regards to which
processes are killed. For that matter, nor does the trusted
user:-). Anytime you run a system with overcommit, any program
which uses dynamic memory, OR forks, OR does any one of a number
of other things which could cause memory to be allocated, may
crash the system anytime it runs. Common sense says that you
don't run anything important on such machines.

> Yes, it is a problem with those system's designs. But it is a
> real one that affects many real servers. It can not be
> stressed more that catching std::bad_alloc is not always
> enough!

IIRC, Andy Koenig wrote an article about the general problem a
long time ago, in some OO journal. In general: except on
specially designed systems, you can't count on catching
bad_alloc and recovering; there are generally cases of memory
allocation failure that escape its detection (the stack is an
obvious example). On the other hand, many programs don't
require 100% certainty of catching it, and on a well designed
system, if you exceed available memory and don't manage to catch
it, the system will still shut you down cleanly and free up most
resources. Also, if you're familiar with the system, you may be
able to avoid some of the problem areas---I've written programs
for Solaris where I could guarantee no insufficient memory due
to stack overflow after start-up.

> > When this problem could be a real issue, there are ways to
> > go around it. For example, using custom allocators. The
> > point is that C++ does not strangely force an arbitrary
> > limit on how long a line could be. And the system limitation
> > could be put somewhere else than the processing logic.

> Unfortunately, the interface of getline() and std::string also
> do not allow me to put a limit on how long the line could be.

> And if I can use this space to respond to James Kanze's comment:

> James Kanze wrote:
>
> > I agree that a version of the function with a maximum length
> > would be nice. Or simply specifying that extractors respect
> > setw too---that would be useful for a lot of other things as
> > well. But I don't see it really as a security hole, any more
> > than the possibility of any user function to allocate more
> > available memory than it should is. If untrusted users have
> > access to your program, you'll have taken the necessary steps
> > outside the program to prevent DOS due to thrashing.

> What would the necessary steps be then to ensure maximum line
> length in my protocol parses utilizing iostream and
> std::string?

One obvious solution would be to overload getline with an
additional parameter specifying maximum length. A perhaps more
general solution would be to systematically recognize the width
parameter on input---this would also be useful for reading files
with fixed width fields, rather than separators.

(Note that depending on the implementation, inputting with >>
into an int might suffer similar problems, if you feed it too
many digits. (I would expect any good implementation to stop
storing digits once it recognizes that it has more than enough,
but all the standard says is that if you feed it a number which
is too big, you have undefined behavior.)

> Should I write my own getline()? Or read one character at a
> time?

It depends on what you are doing. I'd start by getting rid of
overcommit:-). AIX stopped using by default a long time ago
(but you can still turn it on, on a user by user basis, if you
need it). Most Linux distributions seem to default to using it
(which seems fairly irresponsible), but it's easy to turn off
(globally). Solaris and HP/UX don't use it. So you safe on the
major Unix platforms (True Unix for Alpha? SGI? I don't know.)

After that, it depends on the application, and what you're using
the standard streams for. My applications are mainly large,
reliable servers, and istream is used only for reading the
configuration file---if it crashes, it crashes during
initialization, due to an operator error (bad config file), but
the server doesn't go down once it is running. Unless it hits
some odd case in the system or the system library that I
couldn't protect against.

I also write a lot of little quicky programs for my own use.
There too, if they crash because of an excessively long line,
it's no big deal. (But if I were serious about it, I'd set up a
new_handler to emit a nice message before terminating the
program.)

> I, for one, would love to have a std::string() class that had
> an option of setting a maximum allowable length.

That's another issue. In many applications, it would, in fact,
be useful t

From: Hendrik Schober on 19 Sep 2006 13:20

crhras <crhras(a)sbcglobal.net> wrote:
> > Wow ! I just used two different file IO methods and the performance
> > difference was huge.
>
> I should have mentioned that I'm using Borland Studio 2006 on Windows XP
> Pro. At this point, I think it might be caused by a flaw in the way
> getline( ) is implemented by Borland. I am going to post this question
> at borland.public.cppbuilder.language.cpp and if I discover anything
> there I'll post it here.

AFAIK Borland Studio now uses Dinkumware for a std
lib. Pete Becker, who already answered in this thread,
might have written this very function, so I doubt you
will get better answers in b.p.cb.l.cpp. It has been
pointed out to you several times that 'std::getline()'
does significantly more than 'fgets()' and that the
latter is more comparable to 'std::istream::getline()'.

> Thanks again for everyone's responses.

Schobi

--
SpamTrap(a)gmx.de is never read
I'm Schobi at suespammers dot org

"The sarcasm is mightier than the sword."
Eric Jarvis

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: crhras on 20 Sep 2006 11:09

> will get better answers in b.p.cb.l.cpp. It has been
> pointed out to you several times that 'std::getline()'
> does significantly more than 'fgets()' and that the
> latter is more comparable to 'std::istream::getline()'.
>

You are correct. I didn't get better responses in b.p.cb.l.cpp.
And I understand that getline( ) does "significantly" more than
fgets( ). But does it do significantly 4000 percent more ?

Remember, my tests showed roughly 5 seconds for fgets( )
versus over 200 for getline( ).

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 27 Sep 2006 09:04

Earl Purple wrote:
> Hendrik Schober wrote:

> > That depends on 'std::getline()'s implementation,
> > the compilers ability to inline/optimize, the memory
> > manager used and probably a lot more.

> In the example above he is re-using the same std::string. I
> would hope that getline called multiple times would attempt to
> use the string's already-allocated buffer if it has one, and
> therefore reallocations would only happen when you're reading
> a line that is longer than any you have encountered
> previously. If he initially reserves 512 in the string,
> assuming none of the lines are longer than that, then no
> reallocations would be necessary at all.

There is no way to access the string's already-allocated buffer
from outside the string, so this would require getline to be a
friend of std::string. I rather doubt that it is in most
implementations. (A quick grep on the preprocessor output of a
program which included <string> showed no friend in the g++
implementation.)

> Besides that, I would assume that both methods would write to
> a buffer before writing directly to the string / char-array.

I presume you are talking about the buffering done in filebuf.

> I would hope that it reads ahead to find the terminating '\n'
> character before checking the allocation length of the
> std::string.

I'm not sure how you propose to implement this. getline doesn't
have access to the internals of filebuf, any more than it does
to the internals of std::string. And the buffer in filebuf has
a fixed length anyway, so there's no guarantee that you'd find
the '\n' in it.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Sanity check: public/private
Next: Base Class undefined but header right above it!