Fast searching of large files [Ruby]

Prev: Synchronized Circular Buffer
Next: New AES gem available -- fast-aes

From: Stuart Clarke on 1 Jul 2010 05:47

Hey all,

Could anyone advise me on a fast way to search a single, but very large
file (1Gb) quickly for a string of text? Also, is there a library to
identify the file offset this string was found within the file?

Thanks
--
Posted via http://www.ruby-forum.com/.

From: Michael Fellinger on 1 Jul 2010 06:40

On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
<stuart.clarke1986(a)gmail.com> wrote:
> Hey all,
>
> Could anyone advise me on a fast way to search a single, but very large
> file (1Gb) quickly for a string of text? Also, is there a library to
> identify the file offset this string was found within the file?

You can use IO#grep like this:
File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
io.grep(/apiKey/){|m| p io.pos => m } }

The pos is the position the match ended, so just substract the string length.
The above example was a file with 700mb, took around 40s the first
time, 2s subsequently, so disk I/O is the limiting factor in terms of
speed (as usual).
Oh, and also don't use binary encoding if you are dealing with another one ;)

--
Michael Fellinger
CTO, The Rubyists, LLC

From: Robert Klemme on 1 Jul 2010 07:03

2010/7/1 Michael Fellinger <m.fellinger(a)gmail.com>:
> On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
> <stuart.clarke1986(a)gmail.com> wrote:
>> Hey all,
>>
>> Could anyone advise me on a fast way to search a single, but very large
>> file (1Gb) quickly for a string of text? Also, is there a library to
>> identify the file offset this string was found within the file?
>
> You can use IO#grep like this:
> File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
> io.grep(/apiKey/){|m| p io.pos => m } }
>
> The pos is the position the match ended, so just substract the string length.
> The above example was a file with 700mb, took around 40s the first
> time, 2s subsequently, so disk I/O is the limiting factor in terms of
> speed (as usual).

If you only need to know whether the string occurs in the file you can do

found = File.foreach("foo").any? {|line| /apiKey/ =~ line}

This will stop searching as soon as the sequence is found.

"fgrep -l foo" is likely faster.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

From: Stuart Clarke on 1 Jul 2010 07:58

Thanks.

This seems to be pretty much the best logic for me, however it takes a
good 20 minutes to scan a 2Gb file.

Any ideas?

Thanks

Michael Fellinger wrote:
> On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
> <stuart.clarke1986(a)gmail.com> wrote:
>> Hey all,
>>
>> Could anyone advise me on a fast way to search a single, but very large
>> file (1Gb) quickly for a string of text? Also, is there a library to
>> identify the file offset this string was found within the file?
>
> You can use IO#grep like this:
> File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
> io.grep(/apiKey/){|m| p io.pos => m } }
>
> The pos is the position the match ended, so just substract the string
> length.
> The above example was a file with 700mb, took around 40s the first
> time, 2s subsequently, so disk I/O is the limiting factor in terms of
> speed (as usual).
> Oh, and also don't use binary encoding if you are dealing with another
> one ;)

--
Posted via http://www.ruby-forum.com/.

From: Joel VanderWerf on 1 Jul 2010 13:03

Michael Fellinger wrote:
> On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
> <stuart.clarke1986(a)gmail.com> wrote:
>> Hey all,
>>
>> Could anyone advise me on a fast way to search a single, but very large
>> file (1Gb) quickly for a string of text? Also, is there a library to
>> identify the file offset this string was found within the file?
>
> You can use IO#grep like this:
> File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
> io.grep(/apiKey/){|m| p io.pos => m } }
>
> The pos is the position the match ended

Actually, pos will be the position of the end of the line on which the
match was found, because #grep works line by line.

| Next | Last
Pages: 1 2
Prev: Synchronized Circular Buffer
Next: New AES gem available -- fast-aes