File Position [Perl]

Prev: $^T is not working as expected from mod_perl
Next: FAQ 2.5 I grabbed the sources and tried to compile but gdbm/dynamic loading/malloc/linking/... failed. How do I make it work?

From: mud_saisem on 18 Feb 2010 01:12

On Feb 18, 4:56 pm, Jürgen Exner <jurge...(a)hotmail.com> wrote:
> mud_saisem <mud_sai...(a)hotmail.com> wrote:
> >Does anybody know how to read through a file searching for a word and
> >printing the file position of that word ?
>
> Please define 'position': are you talking about characters or bytes?
>
> Just slurp the whole file into a string and then use index() to get the
> position of the desired word in that string.
> This is very straight-forward and unless you are dealing with
> exceptionally large files (GB size) or unusual distribution of your
> 'word' (almost always very early in the file) probably also faster than
> any looping line by line or chunk by chunk.
>
> jue

The logs file that I will be scanning through range from 500Mb to 5Gb.
So adding the content of the file into memory is not a option.

What I meant about position was, if i am looking for a word like
"slurp" (from your paragraph), it should tell me where in the file the
word is, so that I can use the seek function and jump directly to the
position in the file where the word "slurp" is.

From: J�rgen Exner on 18 Feb 2010 03:08

mud_saisem <mud_saisem(a)hotmail.com> wrote:
>On Feb 18, 4:56�pm, J�rgen Exner <jurge...(a)hotmail.com> wrote:
>> mud_saisem <mud_sai...(a)hotmail.com> wrote:
>> >Does anybody know how to read through a file searching for a word and
>> >printing the file position of that word ?
>>
>> Please define 'position': are you talking about characters or bytes?
[...]
>What I meant about position was, if i am looking for a word like
>"slurp" (from your paragraph), it should tell me where in the file the
>word is,

That is not any more specific than your first requrest. It could still
be bytes or characters.

>so that I can use the seek function

Now, that is the critical clue. seek() is based on bytes, so you need a
position in bytes in order to use seek().
Position in characters would do you no good and therefore my suggestion
with index() wouldn't do you any good, either, because it returns the
position in characters. As does the suggestion from Peter Makholm. His
regular expression search is character-based, too, therefore it will not
return the byte-based position that you need for seek().
That is unless your file is in a single-byte character set, of course,
but you didn't say.

jue

From: Randal L. Schwartz on 18 Feb 2010 08:38

>>>>> "J�rgen" == J�rgen Exner <jurgenex(a)hotmail.com> writes:

J�rgen> Now, that is the critical clue. seek() is based on bytes, so you need a
J�rgen> position in bytes in order to use seek().

Historical fact: fseek(3) was originally based on ftell(3)-"cookies", where
the stdio lib didn't promise to be able to return to any position that it
hadn't originally handed you from a tell. As it turns out, those "cookies"
were always byte positions on every operating system *I* saw stdio implemented
on.

print "Just another Perl hacker,";

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn(a)stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion

From: sln on 18 Feb 2010 11:13

On Thu, 18 Feb 2010 06:22:51 +0100, Peter Makholm <peter(a)makholm.net> wrote:

>mud_saisem <mud_saisem(a)hotmail.com> writes:
>
>> Does anybody know how to read through a file searching for a word and
>> printing the file position of that word ?
>
>If your file contains plain ascii, iso-8859, or another 8bit charset
>it should be easy. The tell() function gives you the current location
>in the file, pos() gives you the location of regexp match, and
>index() directly gives you the location.
>
>So this should work (untested though)
>
> my $offset = 0;
> while (<$fh>) {
> if (/word/) {
> say "Found 'word' at location ", $offset + pos();
> }
> $offset = tell $fh;
> }
>
>If you file contains a variable width uniode encoding (like utf-8) it
>gets a lot harder.
^^^
But probably not impossible.

-sln

------------------------
use strict;
use warnings;
use Encode;

binmode(STDOUT, ':encoding(UTF-8)');

my $word = "wo\x{2100}rd";
my $octet_search = encode('UTF-8', $word);
my @FileLocations = ();

my $filedata = encode ('UTF-8', "
This $word \x{2100} is a $word puzzle
It is not in this line,
but $word is in this one.
End.
");

open my $fh, '<', \$filedata or die "can't open memory file: $!";

my $linelength = 0;
print "\n";

while (<$fh>)
{
my $octet_dataline = $_;
while ( /($octet_search)/g )
{
my ($byte_offset, $byte_len) = (
$linelength + pos() - length($octet_search),
length $1
);
print "Found $word at $byte_offset\n";
print "Byte length is $byte_len, byte string is '$1'\n";
push @FileLocations, $byte_offset, $byte_len;
}
$linelength += length ($octet_dataline);
}
close $fh;

# To reconstitute,
# seek to the offsets, and read length bytes
#
print "\nFile offset/length's:\n";
while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
print "$offset, $len\n";
}

__END__

Found woG��rd at 7
Byte length is 7, byte string is 'wo+�-�-�rd'
Found woG��rd at 24
Byte length is 7, byte string is 'wo+�-�-�rd'
Found woG��rd at 69
Byte length is 7, byte string is 'wo+�-�-�rd'

File offset/length's:
7, 7
24, 7
69, 7

From: Ted Zlatanov on 18 Feb 2010 12:10

On Wed, 17 Feb 2010 20:21:19 -0800 (PST) mud_saisem <mud_saisem(a)hotmail.com> wrote:

ms> Does anybody know how to read through a file searching for a word and
ms> printing the file position of that word ?

Besides the great Perl solutions posted here, you may want to consider
`grep -b' which will print the byte offset of each match, depending on
your needs of course.

Ted

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: $^T is not working as expected from mod_perl
Next: FAQ 2.5 I grabbed the sources and tried to compile but gdbm/dynamic loading/malloc/linking/... failed. How do I make it work?