From: mud_saisem on
On Feb 18, 4:56 pm, Jürgen Exner <jurge...(a)hotmail.com> wrote:
> mud_saisem <mud_sai...(a)hotmail.com> wrote:
> >Does anybody know how to read through a file searching for a word and
> >printing the file position of that word ?
>
> Please define 'position': are you talking about characters or bytes?
>
> Just slurp the whole file into a string and then use index() to get the
> position of the desired word in that string.
> This is very straight-forward and unless you are dealing with
> exceptionally large files (GB size) or unusual distribution of your
> 'word' (almost always very early in the file) probably also faster than
> any looping line by line or chunk by chunk.
>
> jue

The logs file that I will be scanning through range from 500Mb to 5Gb.
So adding the content of the file into memory is not a option.

What I meant about position was, if i am looking for a word like
"slurp" (from your paragraph), it should tell me where in the file the
word is, so that I can use the seek function and jump directly to the
position in the file where the word "slurp" is.
From: J�rgen Exner on
mud_saisem <mud_saisem(a)hotmail.com> wrote:
>On Feb 18, 4:56�pm, J�rgen Exner <jurge...(a)hotmail.com> wrote:
>> mud_saisem <mud_sai...(a)hotmail.com> wrote:
>> >Does anybody know how to read through a file searching for a word and
>> >printing the file position of that word ?
>>
>> Please define 'position': are you talking about characters or bytes?
[...]
>What I meant about position was, if i am looking for a word like
>"slurp" (from your paragraph), it should tell me where in the file the
>word is,

That is not any more specific than your first requrest. It could still
be bytes or characters.

>so that I can use the seek function

Now, that is the critical clue. seek() is based on bytes, so you need a
position in bytes in order to use seek().
Position in characters would do you no good and therefore my suggestion
with index() wouldn't do you any good, either, because it returns the
position in characters. As does the suggestion from Peter Makholm. His
regular expression search is character-based, too, therefore it will not
return the byte-based position that you need for seek().
That is unless your file is in a single-byte character set, of course,
but you didn't say.

jue
From: Randal L. Schwartz on
>>>>> "J�rgen" == J�rgen Exner <jurgenex(a)hotmail.com> writes:

J�rgen> Now, that is the critical clue. seek() is based on bytes, so you need a
J�rgen> position in bytes in order to use seek().

Historical fact: fseek(3) was originally based on ftell(3)-"cookies", where
the stdio lib didn't promise to be able to return to any position that it
hadn't originally handed you from a tell. As it turns out, those "cookies"
were always byte positions on every operating system *I* saw stdio implemented
on.

print "Just another Perl hacker,";

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn(a)stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion
From: sln on
On Thu, 18 Feb 2010 06:22:51 +0100, Peter Makholm <peter(a)makholm.net> wrote:

>mud_saisem <mud_saisem(a)hotmail.com> writes:
>
>> Does anybody know how to read through a file searching for a word and
>> printing the file position of that word ?
>
>If your file contains plain ascii, iso-8859, or another 8bit charset
>it should be easy. The tell() function gives you the current location
>in the file, pos() gives you the location of regexp match, and
>index() directly gives you the location.
>
>So this should work (untested though)
>
> my $offset = 0;
> while (<$fh>) {
> if (/word/) {
> say "Found 'word' at location ", $offset + pos();
> }
> $offset = tell $fh;
> }
>
>If you file contains a variable width uniode encoding (like utf-8) it
>gets a lot harder.
^^^
But probably not impossible.

-sln

------------------------
use strict;
use warnings;
use Encode;

binmode(STDOUT, ':encoding(UTF-8)');

my $word = "wo\x{2100}rd";
my $octet_search = encode('UTF-8', $word);
my @FileLocations = ();

my $filedata = encode ('UTF-8', "
This $word \x{2100} is a $word puzzle
It is not in this line,
but $word is in this one.
End.
");

open my $fh, '<', \$filedata or die "can't open memory file: $!";

my $linelength = 0;
print "\n";

while (<$fh>)
{
my $octet_dataline = $_;
while ( /($octet_search)/g )
{
my ($byte_offset, $byte_len) = (
$linelength + pos() - length($octet_search),
length $1
);
print "Found $word at $byte_offset\n";
print "Byte length is $byte_len, byte string is '$1'\n";
push @FileLocations, $byte_offset, $byte_len;
}
$linelength += length ($octet_dataline);
}
close $fh;

# To reconstitute,
# seek to the offsets, and read length bytes
#
print "\nFile offset/length's:\n";
while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
print "$offset, $len\n";
}

__END__


Found woG��rd at 7
Byte length is 7, byte string is 'wo+�-�-�rd'
Found woG��rd at 24
Byte length is 7, byte string is 'wo+�-�-�rd'
Found woG��rd at 69
Byte length is 7, byte string is 'wo+�-�-�rd'

File offset/length's:
7, 7
24, 7
69, 7

From: Ted Zlatanov on
On Wed, 17 Feb 2010 20:21:19 -0800 (PST) mud_saisem <mud_saisem(a)hotmail.com> wrote:

ms> Does anybody know how to read through a file searching for a word and
ms> printing the file position of that word ?

Besides the great Perl solutions posted here, you may want to consider
`grep -b' which will print the byte offset of each match, depending on
your needs of course.

Ted