Count substrings in string, scan too slow [Ruby]

Prev: ruby gem installation problem on MAC PC which is behind firewall
Next: IO select. Reconnect procedure.

From: Robert Klemme on 24 Jun 2010 12:05

2010/6/24 Jesús Gabriel y Galán <jgabrielygalan(a)gmail.com>:
> On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis(a)gmail.com> wrote:
>> Hello everyone,
>> I need to count the number of times a substring occurs in a string.
>> I am currently doing this using the scan method, but it is simply too
>> slow. I feel there should be a faster way to do this since the scan
>> method is really designed for more advanced things than this. I do not
>> need to do regex matching or to process the matches, just count
>> substrings. So what I want is something like this:
>>
>> s = "you like to play with your yo-yo"
>> s.magical_count_method("yo") => 4
>>
>> Once again, what I'm really looking for is something fast. I've tried
>> using external linux commands such as awk, but that was much much
>> slower. Any ideas?
>
> I don't know how slow is scan for you. An implementation using
> String#index and a loop is a little bit faster, but not too much:
>
> require 'benchmark'
>
> TIMES = 100_000
> s = "you like to play with your yo-yo"
>
> Benchmark.bmbm do |x|
> x.report("scan") do
> TIMES.times do
> s.scan("yo").size
> end
> end
> x.report("while") do
> TIMES.times do
> index = -1
> count = 0
> while (index = s.index("yo", index+1))
> count += 1
> end
> count
> end
> end
> end
>
> $ ruby scan_vs_while.rb
> Rehearsal -----------------------------------------
> scan 0.560000 0.020000 0.580000 ( 0.585972)
> while 0.440000 0.060000 0.500000 ( 0.492969)
> -------------------------------- total: 1.080000sec
>
> user system total real
> scan 0.510000 0.010000 0.520000 ( 0.519078)
> while 0.470000 0.020000 0.490000 ( 0.493562)
>
> Don't know if this is enough for you, probably not :-)

I took the liberty to extend the benchmark a bit:

http://gist.github.com/451622

I would have expected regexp to be faster...

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

From: Jesús Gabriel y Galán on 24 Jun 2010 12:16

On Thu, Jun 24, 2010 at 6:05 PM, Robert Klemme
<shortcutter(a)googlemail.com> wrote:
> 2010/6/24 Jesús Gabriel y Galán <jgabrielygalan(a)gmail.com>:
>> On Thu, Jun 24, 2010 at 5:04 PM, Danny Challis <dannychallis(a)gmail.com> wrote:
>>> Hello everyone,
>>> I need to count the number of times a substring occurs in a string.
>>> I am currently doing this using the scan method, but it is simply too
>>> slow. I feel there should be a faster way to do this since the scan
>>> method is really designed for more advanced things than this. I do not
>>> need to do regex matching or to process the matches, just count
>>> substrings. So what I want is something like this:
>>>
>>> s = "you like to play with your yo-yo"
>>> s.magical_count_method("yo") => 4
>>>
>>> Once again, what I'm really looking for is something fast. I've tried
>>> using external linux commands such as awk, but that was much much
>>> slower. Any ideas?
>>
>> I don't know how slow is scan for you. An implementation using
>> String#index and a loop is a little bit faster, but not too much:
>>
>> require 'benchmark'
>>
>> TIMES = 100_000
>> s = "you like to play with your yo-yo"
>>
>> Benchmark.bmbm do |x|
>> x.report("scan") do
>> TIMES.times do
>> s.scan("yo").size
>> end
>> end
>> x.report("while") do
>> TIMES.times do
>> index = -1
>> count = 0
>> while (index = s.index("yo", index+1))
>> count += 1
>> end
>> count
>> end
>> end
>> end
>>
>> $ ruby scan_vs_while.rb
>> Rehearsal -----------------------------------------
>> scan 0.560000 0.020000 0.580000 ( 0.585972)
>> while 0.440000 0.060000 0.500000 ( 0.492969)
>> -------------------------------- total: 1.080000sec
>>
>> user system total real
>> scan 0.510000 0.010000 0.520000 ( 0.519078)
>> while 0.470000 0.020000 0.490000 ( 0.493562)
>>
>> Don't know if this is enough for you, probably not :-)
>
> I took the liberty to extend the benchmark a bit:
>
> http://gist.github.com/451622
>
> I would have expected regexp to be faster...

This thing about adding the length of the match can be argued
depending on the requirements, I think.
What would you expect from:

"yoyoyoyo".magical_count_method("yoyo")

2 or 3?

If you add the length to the index you get 2. If you add 1, you get 3.

irb(main):018:0> s = "yoyoyoyo"
=> "yoyoyoyo"
irb(main):019:0> count = 0
=> 0
irb(main):020:0> len = s.length
=> 8
irb(main):021:0> search = "yoyo"
=> "yoyo"
irb(main):023:0> len = search.length
=> 4
irb(main):024:0> index = -len
=> -4
irb(main):025:0> while (index = s.index(search, index + len))
irb(main):026:1> count += 1
irb(main):027:1> end
=> nil
irb(main):028:0> count
=> 2

irb(main):029:0> count = 0
=> 0
irb(main):030:0> index = -1
=> -1
irb(main):031:0> while (index = s.index(search, index + 1))
irb(main):032:1> count += 1
irb(main):033:1> end
=> nil
irb(main):034:0> count
=> 3

So, I don't know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.

Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I've read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it's near the best or
worst case for each, but anyway.

Jesus.

From: Danny Challis on 24 Jun 2010 13:03

I'm looking for non-overlapping matches (so a 2 in your example)
I modified your code to do this for me like you showed and it works
fine. I was thinking of trying a Boyer-Moore implementation, but I
suspect if I implement this manually in Ruby it will be much slower.

Jesús Gabriel y Galán wrote:
> On Thu, Jun 24, 2010 at 6:05 PM, Robert Klemme
> <shortcutter(a)googlemail.com> wrote:
>>>> s = "you like to play with your yo-yo"

>
> So, I don't know. Of course, if the requirement is to get 2 from the
> above situation, adding the length is better.
>
> Also of notice is that the block versions of scan are slower because
> they have to call a block for each match.
> I think I've read that the String#index method uses Rabin-Karp. It
> would be interesting to compare this to a Boyer-Moore implementation.
> Of course it will depend on the input data, if it's near the best or
> worst case for each, but anyway.
>
> Jesus.
--
Posted via http://www.ruby-forum.com/.

From: botp on 24 Jun 2010 12:35

On Fri, Jun 25, 2010 at 12:05 AM, Robert Klemme
<shortcutter(a)googlemail.com> wrote:
> http://gist.github.com/451622
> I would have expected regexp to be faster...

you don't like strscan ? :)
best regards -botp

From: Michael Fellinger on 24 Jun 2010 13:16

On Fri, Jun 25, 2010 at 1:35 AM, botp <botpena(a)gmail.com> wrote:
> On Fri, Jun 25, 2010 at 12:05 AM, Robert Klemme
> <shortcutter(a)googlemail.com> wrote:
>> http://gist.github.com/451622
>> I would have expected regexp to be faster...
>
> you don't like strscan ? :)
> best regards -botp

I've just run some benchmarks with strscan, and it's at least in the
same ballpark as the other approaches, unless you're on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: http://gist.github.com/451675

--
Michael Fellinger
CTO, The Rubyists, LLC

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: ruby gem installation problem on MAC PC which is behind firewall
Next: IO select. Reconnect procedure.