FAQ 5.29 How can I read in an entire file all at once? [Perl]

Prev: FAQ 2.3 I don't have a C compiler. How can I build my own Perl interpreter?
Next: FAQ 2.7 Is there an ISO or ANSI certified version of Perl?

From: Ilya Zakharevich on 23 Jul 2010 19:20

On 2010-07-23, Uri Guttman <uri(a)StemSystems.com> wrote:
> mmap still needs space in the program. it may be allocated with malloc
> or even builtin these days (haven't used it directly in decades! :). now
> real ram could be saved but that is true for all virtual memory use. if
> you seek into the mmap space and only read/write parts, then the other
> sections won't be touched. so the issue comes down to random access vs
> processing a whole file. most uses of slurp are for processing a whole
> file so i would lean in that direction. someone sophisticated enough to
> use mmap directly for random access should know the resource usage issues.

I do not see it mentioned in this discussion that (a good
implemenation of) mmap() also semi-unmaps-when-needed. So as far as
you have enough *virtual* memory, mmap() behaves as a "smartish"
intermediate ground between reading-by-line and slurping. And it
"almost scales"; the limit is the virtual memory, so on 64bit systems
it might even "absolutely scale".

Of course, this can severely limit the amount of free physical memory
on the computer, so may harden the life of other programs, AND
decrease disk caching. However, if YOUR program is the only one on
CPU, and THIS disk access is the only one in question, mmap() has a
chance to be a clear win...

Yours,
Ilya

From: Peter J. Holzer on 25 Jul 2010 05:08

On 2010-07-23 22:15, Uri Guttman <uri(a)StemSystems.com> wrote:
>>>>>> "TW" == Tim Watts <tw(a)dionic.net> writes:
>
> TW> Uri Guttman <uri(a)StemSystems.com>
> TW> wibbled on Sunday 04 July 2010 06:15
>
> >> i disagree with that last point. mmap always needs virtual ram allocated
> >> for the entire file to be mapped. it only saves ram if you map part of
> >> the file into a smaller virtual window. the win of mmap is that it won't
> >> do the i/o until you touch a section. so if you want random access to
> >> sections of a file, mmap is a big win. if you are going to just process
> >> the whole file, there isn't any real win over File::Slurp
>
> TW> I think it is worth some clarification - at least under linux:
> TW> mmap requires virtual address space, not RAM per se, for the
> TW> initial mmap.
>
> TW> Obviously as soon as you try to read any part of the file, those
> TW> blocks must be paged in to actual RAM pages.
>
> TW> However, if you then ignore those pages and have not modified
> TW> them, the LRU recovery sweeper can just drop those pages.
>
> but a slurped file in virtual ram behaves the same way. it may be
> swapped in when you read in the file and process it but as soon as that
> is done, and you free the scalar in perl, perl can reuse the space.

Well, *if* you free it. The nice thing about mmap is that RAM can be
reused even if you don't free it.

> the virtual ram can't be given back to the os

That depends on the malloc implementation. GNU malloc uses heap based
allocation only for small chunks (less than 128 kB by default, I think),
but mmap-based allocation for larger chunks. So for a scalar larger than
128 kB, the space can and will be given back to the OS.

> but the real ram is reused.

> TW> Compare to if you slurp the file into some virtual RAM that's been malloc'd:
>
> TW> The RAM pages are all dirty (because you copied data into them) -
> TW> so if the system needs to reduce the working page set, it will
> TW> have to page those out to swap rather than just dropping them - it
> TW> no longer has the knowledge that they are in practise backed by
> TW> the original file.
>
> that is true. the readonly aspect of a mmap slurp is a win. but given
> the small sizes of most files slurped it isn't that large a win.

Yes. Mmap is only a win for large files. And I suspect "large" means
really large - somewhere on the same order as available RAM.

> today we have 4k or larger page sizes and many files are smaller than
> that. ram and vram are cheap as hell so fighting for each byte is a
> long lost art that needs to die. :)

I wish Perl would fight for each byte at the low level. The overhead for
each scalar, array element or hash element is enormous, and these really
add up if you have enough of them.

hp

From: Peter J. Holzer on 25 Jul 2010 10:16

On 2010-07-25 13:35, Tim Watts <tw(a)dionic.net> wrote:
> Uri Guttman <uri(a)StemSystems.com>
> wibbled on Friday 23 July 2010 23:15
>> that is true. the readonly aspect of a mmap slurp is a win. but given
>> the small sizes of most files slurped it isn't that large a win.
>
> Yes that would be true of small files.
>
> But what if you're dealing with 1GB files or just multi MB files? This is
> extremely likely if you were processing video or scientific data (ignoring
> the fact that you probably wouldn't be using perl for either!)

Perl was used in the Human Genome project.

hp, who also routinely processes files in the range of a few GB.

From: Uri Guttman on 25 Jul 2010 11:38

>>>>> "TW" == Tim Watts <tw(a)dionic.net> writes:

TW> Uri Guttman <uri(a)StemSystems.com>
TW> wibbled on Friday 23 July 2010 23:15

>> that is true. the readonly aspect of a mmap slurp is a win. but given
>> the small sizes of most files slurped it isn't that large a win. today
>> we have 4k or larger page sizes and many files are smaller than
>> that. ram and vram are cheap as hell so fighting for each byte is a long
>> lost art that needs to die. :)

TW> Yes that would be true of small files.

TW> But what if you're dealing with 1GB files or just multi MB files?
TW> This is extremely likely if you were processing video or
TW> scientific data (ignoring the fact that you probably wouldn't be
TW> using perl for either!)

and your point is?

and someone else pointed out that perl was and is used for genetic
work. ever heard of bioperl? it is a very popular package for
biogenetics. look for the article about perl saving the human genome
project (that was done by the author of cgi.pm!). of course those
systems don't slurp in those enormous data files. but they can always
slurp in the smaller (for some definition of smaller) config, control,
and other files.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

From: Uri Guttman on 25 Jul 2010 17:51

>>>>> "TW" == Tim Watts <tw(a)dionic.net> writes:

TW> BTW - I am surprised the genome project was done in perl. I
TW> *would* have thought, even from a perl fanboi perspective, that C
TW> would have been somewhat faster and the amount of data would have
TW> made it worth optimising the project even at the expense of
TW> simplicity. I shall have to read up on that.

the artical i referred to can likely be found. it wasn't that the whole
project was done in perl. the issue was worldwide they ended up with
about 14 different data formats and they couldn't share it with each
other. so this one guy (as i said author of cgi.pm and several perl
books) wrote modules to convert each format to/from a common format
which allowed full sharing of data. that 'saved' the project from its
babel hell. since then, perl is a major language used in biogen both for
having bioperl and for its great string and regex support. c sucks for
both of those and its faster run speed loses out to perl's much better
development time.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: FAQ 2.3 I don't have a C compiler. How can I build my own Perl interpreter?
Next: FAQ 2.7 Is there an ISO or ANSI certified version of Perl?