FAQ 5.29 How can I read in an entire file all at once? [Perl]

Prev: FAQ 2.3 I don't have a C compiler. How can I build my own Perl interpreter?
Next: FAQ 2.7 Is there an ISO or ANSI certified version of Perl?

From: PerlFAQ Server on 4 Jul 2010 00:00

This is an excerpt from the latest version perlfaq5.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .

--------------------------------------------------------------------

5.29: How can I read in an entire file all at once?

Are you sure you want to read the entire file and store it in memory? If
you mmap the file, you can virtually load the entire file into a string
without actually storing it in memory:

use File::Map qw(map_file);

map_file my $string, $filename;

Once mapped, you can treat $string as you would any other string. Since
you don't actually load the data, mmap-ing is very fast and does not
increase your memory footprint.

If you really want to load the entire file, you can use the
"File::Slurp" module to do it in one step.

use File::Slurp;

my $all_of_it = read_file($filename); # entire file in scalar
my @all_lines = read_file($filename); # one line per element

The customary Perl approach for processing all the lines in a file is to
do so one line at a time:

open my $input, '<', $file or die "can't open $file: $!";
while (<$input>) {
chomp;
# do something with $_
}
close $input or die "can't close $file: $!";

This is tremendously more efficient than reading the entire file into
memory as an array of lines and then processing it one element at a
time, which is often--if not almost always--the wrong approach. Whenever
you see someone do this:

my @lines = <INPUT>;

You should think long and hard about why you need everything loaded at
once. It's just not a scalable solution. You might also find it more fun
to use the standard Tie::File module, or the DB_File module's $DB_RECNO
bindings, which allow you to tie an array to a file so that accessing an
element the array actually accesses the corresponding line in the file.

You can read the entire filehandle contents into a scalar.

{
local $/;
open my $fh, '<', $file or die "can't open $file: $!";
$var = <$fh>;
}

That temporarily undefs your record separator, and will automatically
close the file at block exit. If the file is already open, just use
this:

$var = do { local $/; <$fh> };

For ordinary files you can also use the read function.

read( $fh, $var, -s $fh );

The third argument tests the byte size of the data on the INPUT
filehandle and reads that many bytes into the buffer $var.

--------------------------------------------------------------------

The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
perlfaq.pod.

From: Uri Guttman on 4 Jul 2010 01:15

brian, here are some edits and comments for this faq:

>>>>> "PS" == PerlFAQ Server <brian(a)theperlreview.com> writes:

PS> 5.29: How can I read in an entire file all at once?

PS> Are you sure you want to read the entire file and store it in memory? If
PS> you mmap the file, you can virtually load the entire file into a string
PS> without actually storing it in memory:

Reading in an entire file at one time can be useful and more efficient
providing the file is small enough. With modern systems, even a 1MB file
can be considered small and almost all common text files and many others
are less than 1MB. Also some files need to be processed as whole
entities (e.g. image formats) and are best loaded into a scalar.

PS> use File::Map qw(map_file);

PS> map_file my $string, $filename;

PS> Once mapped, you can treat $string as you would any other
PS> string. Since you don't actually load the data, mmap-ing is
PS> very fast and does not increase your memory footprint.

i disagree with that last point. mmap always needs virtual ram allocated
for the entire file to be mapped. it only saves ram if you map part of
the file into a smaller virtual window. the win of mmap is that it won't
do the i/o until you touch a section. so if you want random access to
sections of a file, mmap is a big win. if you are going to just process
the whole file, there isn't any real win over File::Slurp

PS> If you really want to load the entire file, you can use the
PS> "File::Slurp" module to do it in one step.

If you decide to load the entire file, you can use the "File::Slurp"
module to do it in one simple and efficient step.

PS> use File::Slurp;

PS> my $all_of_it = read_file($filename); # entire file in scalar
PS> my @all_lines = read_file($filename); # one line per element

PS> The customary Perl approach for processing all the lines in a file is to
PS> do so one line at a time:

PS> open my $input, '<', $file or die "can't open $file: $!";
PS> while (<$input>) {
PS> chomp;
PS> # do something with $_
PS> }
PS> close $input or die "can't close $file: $!";

PS> This is tremendously more efficient than reading the entire file into
PS> memory as an array of lines and then processing it one element at a
PS> time, which is often--if not almost always--the wrong approach. Whenever
PS> you see someone do this:

again, i disagree. you can easily benchmark slurping an array of lines
and looping vs line by line reading. the win with slurping (with
File::Slurp) is bypassing perl's i/o layer. the looping overhead is the
same and the ram overhead isn't so much for most files as i have said
above. also some parsing or regex stuff is MUCH faster with whole files
in ram. a single s///g done over a whole file in a scalar is way faster
than doing it over each line in a loop. parsing and munging whole files
can be much easier too as you can do multiline matches and such.

here is a super fast way to read and parse a simple config file (key:
value lines):

use File::Slurp ;
my %config = read_file( $conf_file ) =~ /^(\w+):\s*(.+)$/mg ;

doing that line by line takes more code and is much slower as you need
to call the regex for each line.

PS> my @lines = <INPUT>;

PS> You can read the entire filehandle contents into a scalar.

PS> {
PS> local $/;
PS> open my $fh, '<', $file or die "can't open $file: $!";
PS> $var = <$fh>;
PS> }

PS> That temporarily undefs your record separator, and will automatically
PS> close the file at block exit. If the file is already open, just use
PS> this:

PS> $var = do { local $/; <$fh> };

you missed the coolest variant:

my $text = do { local( @ARGV, $/) = $file ; <> };

no open needed!

other than file::slurp not being in core (and it should be! :), there is
no reason to show the $/ = undef trick. it is always slower and more
obscure then calling read_file (which also does better error handling
and has more options).

PS> For ordinary files you can also use the read function.

PS> read( $fh, $var, -s $fh );

might as well use sysread as it is faster and has the same api. read is
almost never needed unless you are doing block reads on a file and
mixing in line reads (they share the perl stdio).

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

From: Eric Pozharski on 5 Jul 2010 03:35

with <87mxu7j08z.fsf(a)quad.sysarch.com> Uri Guttman wrote:
>>>>>> "PS" == PerlFAQ Server <brian(a)theperlreview.com> writes:
*SKIP*
> PS> {
> PS> local $/;
> PS> open my $fh, '<', $file or die "can't open $file: $!";
> PS> $var = <$fh>;
> PS> }
>
> PS> That temporarily undefs your record separator, and will automatically
> PS> close the file at block exit. If the file is already open, just use
> PS> this:
>
> PS> $var = do { local $/; <$fh> };
>
> you missed the coolest variant:
>
> my $text = do { local( @ARGV, $/) = $file ; <> };
>
> no open needed!
>
> other than file::slurp not being in core (and it should be! :), there is
> no reason to show the $/ = undef trick. it is always slower and more
> obscure then calling read_file (which also does better error handling
> and has more options).
>
> PS> For ordinary files you can also use the read function.
>
> PS> read( $fh, $var, -s $fh );
>
> might as well use sysread as it is faster and has the same api. read is
> almost never needed unless you are doing block reads on a file and
> mixing in line reads (they share the perl stdio).

Please reconsider your 'always slower':

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw{ cmpthese timethese };

use File::Slurp;
my $fname = '/etc/passwd';
read_file $fname;

cmpthese timethese -5, {
code00 => sub { my $aa = read_file $fname; },
code01 => sub { local $/; open my $fh, '<', $fname or die $!; my $aa = <$fh>; },
code02 => sub { local( @ARGV, $/ ) = $fname; my $aa = <>; },
code03 => sub { open my $fh, '<', $fname or die $!; defined read $fh, my $aa, -s $fh or die $!; },
code04 => sub { open my $fh, '<', $fname or die $!; defined sysread $fh, my $aa, -s $fh or die $!; },
};

__END__
Benchmark: running code00, code01, code02, code03, code04 for at least 5 CPU seconds...
code00: 5 wallclock secs ( 3.34 usr + 2.01 sys = 5.35 CPU) @ 31214.95/s (n=167000)
code01: 5 wallclock secs ( 2.82 usr + 2.45 sys = 5.27 CPU) @ 41757.50/s (n=220062)
code02: 5 wallclock secs ( 2.58 usr + 2.68 sys = 5.26 CPU) @ 43446.01/s (n=228526)
code03: 5 wallclock secs ( 2.60 usr + 2.69 sys = 5.29 CPU) @ 47371.08/s (n=250593)
code04: 4 wallclock secs ( 2.36 usr + 3.02 sys = 5.38 CPU) @ 52458.92/s (n=282229)
Rate code00 code01 code02 code03 code04
code00 31215/s -- -25% -28% -34% -40%
code01 41757/s 34% -- -4% -12% -20%
code02 43446/s 39% 4% -- -8% -17%
code03 47371/s 52% 13% 9% -- -10%
code04 52459/s 68% 26% 21% 11% --

And that's for s{/etc/passwd}{/boot/vmlinuz}

Benchmark: running code00, code01, code02, code03, code04 for at least 5 CPU seconds...
code00: 5 wallclock secs ( 1.45 usr + 3.96 sys = 5.41 CPU) @ 223.84/s (n=1211)
code01: 5 wallclock secs ( 2.08 usr + 3.06 sys = 5.14 CPU) @ 365.18/s (n=1877)
code02: 6 wallclock secs ( 2.16 usr + 3.00 sys = 5.16 CPU) @ 366.28/s (n=1890)
code03: 6 wallclock secs ( 2.12 usr + 3.24 sys = 5.36 CPU) @ 372.20/s (n=1995)
code04: 5 wallclock secs ( 0.12 usr + 5.16 sys = 5.28 CPU) @ 583.14/s (n=3079)
Rate code00 code01 code02 code03 code04
code00 224/s -- -39% -39% -40% -62%
code01 365/s 63% -- -0% -2% -37%
code02 366/s 64% 0% -- -2% -37%
code03 372/s 66% 2% 2% -- -36%
code04 583/s 161% 60% 59% 57% --

Although, braian, please consider cleaning up that entry a bit. Those
who can read would find their way; those who can't wouldn't read that
anyway.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

From: Uri Guttman on 5 Jul 2010 11:52

>>>>> "EP" == Eric Pozharski <whynot(a)pozharski.name> writes:

EP> Please reconsider your 'always slower':

try the pass by scalar reference method of read_file. and check out the
much more comprehensive benchmark script that comes with the module. and
that was also redone in an unreleased version you can find on git at
perlhunter.com/git. for one thing it uses better names so you can see
what the results mean.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

From: brian d foy on 5 Jul 2010 14:31

In article <87mxu7j08z.fsf(a)quad.sysarch.com>, Uri Guttman
<uri(a)StemSystems.com> wrote:

> other than file::slurp not being in core (and it should be! :), there is
> no reason to show the $/ = undef trick.

That's a pretty big reason though.

| Next | Last
Pages: 1 2 3 4
Prev: FAQ 2.3 I don't have a C compiler. How can I build my own Perl interpreter?
Next: FAQ 2.7 Is there an ISO or ANSI certified version of Perl?