Fast searching of large files [Ruby]

Prev: Synchronized Circular Buffer
Next: New AES gem available -- fast-aes

From: brabuhr on 1 Jul 2010 15:28

On Thu, Jul 1, 2010 at 7:03 AM, Robert Klemme
<shortcutter(a)googlemail.com> wrote:
> 2010/7/1 Michael Fellinger <m.fellinger(a)gmail.com>:
>> On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke
>> <stuart.clarke1986(a)gmail.com> wrote:
>>> Could anyone advise me on a fast way to search a single, but very large
>>> file (1Gb) quickly for a string of text? Also, is there a library to
>>> identify the file offset this string was found within the file?
>>
>> You can use IO#grep like this:
>> File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io|
>> io.grep(/apiKey/){|m| p io.pos => m } }
>>
>> The pos is the position the match ended, so just substract the string length.
>> The above example was a file with 700mb, took around 40s the first
>> time, 2s subsequently, so disk I/O is the limiting factor in terms of
>> speed (as usual).
>
> If you only need to know whether the string occurs in the file you can do
> found = File.foreach("foo").any? {|line| /apiKey/ =~ line}
> This will stop searching as soon as the sequence is found.
>
> "fgrep -l foo" is likely faster.

irb> `fgrep -l waters /usr/share/dict/words`.size > 0
=> true
irb> `fgrep -l watershed /usr/share/dict/words`.size > 0
=> true
irb> `fgrep -l watershedz /usr/share/dict/words`.size > 0
=> false

irb> `fgrep -ob waters /usr/share/dict/words`.split.map{|s| s.split(':').first}
=> ["153088", "153102", "204143", "234643", "472357", "856441",
"913606", "913613", "913623", "913635", "913646", "913656", "913668",
"913679", "913690", "913703"]
irb> `fgrep -ob watershed /usr/share/dict/words`.split.map{|s|
s.split(':').first}
=> ["913613", "913623", "913635"]
irb> `fgrep -ob watershedz /usr/share/dict/words`.split.map{|s|
s.split(':').first}
=> []

From: Roger Pack on 1 Jul 2010 15:33

Stuart Clarke wrote:
> Hey all,
>
> Could anyone advise me on a fast way to search a single, but very large
> file (1Gb) quickly for a string of text? Also, is there a library to
> identify the file offset this string was found within the file?

a fast way is to do it in C :)

Here are a few other helpers, though:

1.9 has faster regexes
boost regexes: http://github.com/michaeledgar/ruby-boost-regex (you
could probably optimize it more than it currently is, as well...)

Rubinius also might help.

Also make sure to open your file in binary mode if you're on 1.9. That
reads much faster. If that's an option, anyway.
GL.
-rp
--
Posted via http://www.ruby-forum.com/.

First | Prev |
Pages: 1 2
Prev: Synchronized Circular Buffer
Next: New AES gem available -- fast-aes