From: sln on
On Thu, 18 Feb 2010 08:13:22 -0800, sln(a)netherlands.com wrote:

>On Thu, 18 Feb 2010 06:22:51 +0100, Peter Makholm <peter(a)makholm.net> wrote:
>
>>mud_saisem <mud_saisem(a)hotmail.com> writes:
>>
>>> Does anybody know how to read through a file searching for a word and
>>> printing the file position of that word ?
>>
>>If your file contains plain ascii, iso-8859, or another 8bit charset
>>it should be easy. The tell() function gives you the current location
>>in the file, pos() gives you the location of regexp match, and
>>index() directly gives you the location.
>>
>>So this should work (untested though)
>>
>> my $offset = 0;
>> while (<$fh>) {
>> if (/word/) {
>> say "Found 'word' at location ", $offset + pos();
>> }
>> $offset = tell $fh;
>> }
>>
>>If you file contains a variable width uniode encoding (like utf-8) it
>>gets a lot harder.
> ^^^
>But probably not impossible.
>
>-sln
>
I guess I'll keep this around as a curiosity,
not knowing the particulars of how/if Perl auto-promotes
byte strings to utf8 in the regex process.

If I try it out on different encodings, it seems to work.
The only problem is with any BOM (byte order mark) as this would
require adjusting the offset because of the bom/seek bug.

Depending on the OS, an endian'es won't map correctly to utf8.
For this reason, I left out the 16/32 LE's, because it prints to
STDOUT, which is binmode to utf-8. But otherwise, all the endian's
work as far as getting offsets.

Same realestate, different code.
Btw, this may be a much faster way to do regex on
Unicode. Reading/processing regular expressions on a file opened
in utf-8 mode and that happens to be very large, significantly
slows down the regex engine (by several magnitudes).

-sln
--------------------
# Rx_Bytes_Unicode_misc1.pl
# -sln, 2/10
use strict;
use warnings;
use Encode;

binmode(STDOUT, ':encoding(UTF-8)');

## Try some encodings
#
for my $UTF ('ascii', 'UTF-8', 'UTF-16BE', 'UTF-32BE')
{
## Create pattern in encoded bytes
#
my $word = "wo\x{2100}rd";
my $octet_pattern = encode($UTF, $word."|End|one");

print "\n",'-'x20,"\nEncoding: $UTF\nPattern: '$octet_pattern'\n";

## Create file data in encoded bytes
#
my $filedata = encode ($UTF,
"This $word \x{2100} is a $word puzzle
It is not in this line,
but $word is in this one.
The End."
);

## Open a memory buffer in byte mode
#
open my $fh, '<', \$filedata
or die "Can't open memory buffer for read: $!";
print "\n";

## Process file data
#
my @FileLocations = ();
my ($filepos, $line_count, $byte_offset, $byte_len) = (0,0);

while (<$fh>)
{
++$line_count;
while ( /($octet_pattern)/g )
{
$byte_len = length $1;
$byte_offset = $filepos + pos() - $byte_len;

print "(line $line_count) Found '",decode($UTF,$1),
"' (fpos= $byte_offset), byte string ",
"(len= $byte_len) is '$1'\n";
# save offset/length of matched item
push @FileLocations, $byte_offset, $byte_len;
}
# $filepos += length;
# or ->
$filepos = tell ($fh);
}

## Reconstitute file data.
## Seek to offsets, read length bytes
#
if ( @FileLocations ) {
print "\nFile offset/length:\n";
my $buf = '';
while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
seek ($fh, $offset, 0);
read ($fh, $buf, $len);
print "$offset, $len, ",
"$UTF: '$buf', UTF-8 string: '",
decode($UTF, $buf), "'\n";
}
}
close $fh;
}
__END__
--------------------
Encoding: ascii
Pattern: 'wo?rd|End|one'

(line 3) Found 'one' (fpos= 96), byte string (len= 3) is 'one'
(line 4) Found 'End' (fpos= 115), byte string (len= 3) is 'End'

File offset/length:
96, 3, ascii: 'one', UTF-8 string: 'one'
115, 3, ascii: 'End', UTF-8 string: 'End'

--------------------
Encoding: UTF-8
Pattern: 'wo+�-�-�rd|End|one'

(line 1) Found 'woG��rd' (fpos= 5), byte string (len= 7) is 'wo+�-�-�rd'
(line 1) Found 'woG��rd' (fpos= 22), byte string (len= 7) is 'wo+�-�-�rd'
(line 3) Found 'woG��rd' (fpos= 85), byte string (len= 7) is 'wo+�-�-�rd'
(line 3) Found 'one' (fpos= 104), byte string (len= 3) is 'one'
(line 4) Found 'End' (fpos= 123), byte string (len= 3) is 'End'

File offset/length:
5, 7, UTF-8: 'wo+�-�-�rd', UTF-8 string: 'woG��rd'
22, 7, UTF-8: 'wo+�-�-�rd', UTF-8 string: 'woG��rd'
85, 7, UTF-8: 'wo+�-�-�rd', UTF-8 string: 'woG��rd'
104, 3, UTF-8: 'one', UTF-8 string: 'one'
123, 3, UTF-8: 'End', UTF-8 string: 'End'

--------------------
Encoding: UTF-16BE
Pattern: ' w o! r d | E n d | o n e'

(line 1) Found 'woG��rd' (fpos= 10), byte string (len= 11) is ' w o! r d '
(line 1) Found 'woG��rd' (fpos= 36), byte string (len= 11) is ' w o! r d '
(line 3) Found 'woG��rd' (fpos= 158), byte string (len= 11) is ' w o! r d '
(line 3) Found 'one' (fpos= 192), byte string (len= 6) is ' o n e'
(line 4) Found 'End' (fpos= 230), byte string (len= 7) is ' E n d '

File offset/length:
10, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woG��rd'
36, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woG��rd'
158, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woG��rd'
192, 6, UTF-16BE: ' o n e', UTF-8 string: 'one'
230, 7, UTF-16BE: ' E n d ', UTF-8 string: 'End'

--------------------
Encoding: UTF-32BE
Pattern: ' w o ! r d | E n d | o n e'

(line 1) Found 'woG��rd' (fpos= 20), byte string (len= 23) is ' w o ! r
d '
(line 1) Found 'woG��rd' (fpos= 72), byte string (len= 23) is ' w o ! r
d '
(line 3) Found 'woG��rd' (fpos= 316), byte string (len= 23) is ' w o ! r
d '
(line 3) Found 'one' (fpos= 384), byte string (len= 12) is ' o n e'
(line 4) Found 'End' (fpos= 460), byte string (len= 15) is ' E n d '

File offset/length:
20, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woG��rd'
72, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woG��rd'
316, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woG��rd'
384, 12, UTF-32BE: ' o n e', UTF-8 string: 'one'
460, 15, UTF-32BE: ' E n d ', UTF-8 string: 'End'

From: Ben Morrow on

Quoth merlyn(a)stonehenge.com (Randal L. Schwartz):
> >>>>> "J�rgen" == J�rgen Exner <jurgenex(a)hotmail.com> writes:
>
> J�rgen> Now, that is the critical clue. seek() is based on bytes, so you need a
> J�rgen> position in bytes in order to use seek().
>
> Historical fact: fseek(3) was originally based on ftell(3)-"cookies", where
> the stdio lib didn't promise to be able to return to any position that it
> hadn't originally handed you from a tell. As it turns out, those "cookies"
> were always byte positions on every operating system *I* saw stdio implemented
> on.

IIRC Win32's stdio in 'text' mode (the default) uses this mechanism to
get around the CRLF->LF translation.

Ben

From: Randal L. Schwartz on
>>>>> "Ben" == Ben Morrow <ben(a)morrow.me.uk> writes:

Ben> IIRC Win32's stdio in 'text' mode (the default) uses this mechanism to
Ben> get around the CRLF->LF translation.

Nice to know. I guess I'm lucky in that I've never had to use Windows
except in internet cafes, where the first step is "download putty"
so I can ssh to a real box.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn(a)stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion
From: C.DeRykus on
On Feb 17, 9:22 pm, Peter Makholm <pe...(a)makholm.net> wrote:
> mud_saisem <mud_sai...(a)hotmail.com> writes:
> > Does anybody know how to read through a file searching for a word and
> > printing the file position of that word ?
>
> If your file contains plain ascii, iso-8859, or another 8bit charset
> it should be easy. The tell() function gives you the current location
> in the file, pos() gives you the location of regexp match, and
> index() directly gives you the location.
>
> So this should work (untested though)
>
>   my $offset = 0;
>   while (<$fh>) {
>       if (/word/) {
^^^^^^^^^^^^

if ( /word/g ) {


Maybe the OP assumed it was correct because of
the 'tell' addition.



--
Charles DeRykus

From: C.DeRykus on
On Feb 18, 4:55 pm, "C.DeRykus" <dery...(a)gmail.com> wrote:
> On Feb 17, 9:22 pm, Peter Makholm <pe...(a)makholm.net> wrote:> mud_saisem <mud_sai...(a)hotmail.com> writes:
> > > Does anybody know how to read through a file searching for a word and
> > > printing the file position of that word ?
>
> > If your file contains plain ascii, iso-8859, or another 8bit charset
> > it should be easy. The tell() function gives you the current location
> > in the file, pos() gives you the location of regexp match, and
> > index() directly gives you the location.
>
> > So this should work (untested though)
>
> >   my $offset = 0;
> >   while (<$fh>) {
> >       if (/word/) {
>
>         ^^^^^^^^^^^^
>
>         if ( /word/g ) {
>
> Maybe the OP assumed it was correct because of
> the 'tell' addition.
>

You may need to loop to pick up multiple hits
per line too if that was the goal.

--
Charles DeRykus