From: Iain Barnett on
Hi,

I've some more regex questions. I wrote a pattern to check for valid regexes and inspect the parts (we all have our reasons for the things we do:) It wasn't working so I went down to simpler and simpler patterns, but I'm a bit surprised at the way Ruby 1.9 is handling the regexes. I tested the same pattern in Perl and it came out with the answers I'd expect.

Is this down to me using perl regexes for so long, or is there something I'm missing about Ruby's implementation? It appears ^ at the beginning of a string doesn't bind as strongly as I'd expect.


I believe this test should fail as <delim> should be bound to the beginning of the string by the ^ , and the match result is a little bit crazy - shouldn't the main capture be "d\\d" if it's following the logical route it's chosen?
$ ruby -e '
md = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/.match( %q!/\d\d\\d! )
puts md.inspect
'
#<MatchData "/\\d" mors:nil delim:"d" pat:"\\">


Here I add on a trailing slash to the string, and (I believe) it should bring me back what's between the / / :
$ ruby -e '
md = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/.match( %q!/\d\d\\d/! )
puts md.inspect
'
#<MatchData "/\\d" mors:nil delim:"d" pat:"\\">

Here's the first string in perl 5.12 :
$ perl -e '
if ( q(/\d\d\\d) =~ /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g{delim}/ ) {
while ( my ($key, $value) = each(%+) ) {
print "$key => $value\n";
}
}
'
<nothing here, what I'd expect>

And here it is with the "valid" string:
$ perl -e '
if ( q(/\d\d\\d/) =~ /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g{delim}/ ) {
while ( my ($key, $value) = each(%+) ) {
print "$key => $value\n";
}
}
'
pat => \d\d\d
delim => /

These are the answers I'd expect.


Even this seems unexpected to me, if I remove the <mors> then surely ^ should bind <delim> to the beginning???
$ ruby -e '
md = /^(?<delim>.)(?<pat>.+?)\g<delim>/.match( %q!/\d\d\\d/! )
puts md.inspect
'
#<MatchData "/\\d" delim:"d" pat:"\\">


These work as I'd expect by using the end of line $ :
$ ruby -e '
md = /^(?<delim>.)(?<pat>.+?)\g<delim>$/.match( %q!/\d\d\\d/! )
puts md.inspect
'
#<MatchData "/\\d\\d\\d/" delim:"/" pat:"\\d\\d\\d">

$ ruby -e '
md = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>$/.match( %q!/\d\d\\d/! )
puts md.inspect
'
#<MatchData "/\\d\\d\\d/" mors:nil delim:"/" pat:"\\d\\d\\d">

And finally, if I remove the caret but leave the $ I get the answer I'd expect (or that I'm looking for) :
$ ruby -e '
md = /(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>$/.match( %q!/\d\d\\d/! )
puts md.inspect
'

#<MatchData "/\\d\\d\\d/" mors:nil delim:"/" pat:"\\d\\d\\d">


Regards,
Iain




From: Robert Klemme on
On 07/26/2010 06:01 PM, Iain Barnett wrote:
> Hi,
>
> I've some more regex questions. I wrote a pattern to check for valid
> regexes and inspect the parts (we all have our reasons for the things
> we do:) It wasn't working so I went down to simpler and simpler
> patterns, but I'm a bit surprised at the way Ruby 1.9 is handling the
> regexes. I tested the same pattern in Perl and it came out with the
> answers I'd expect.
>
> Is this down to me using perl regexes for so long, or is there
> something I'm missing about Ruby's implementation? It appears ^ at
> the beginning of a string doesn't bind as strongly as I'd expect.
>
>
> I believe this test should fail as<delim> should be bound to the
> beginning of the string by the ^ , and the match result is a little
> bit crazy - shouldn't the main capture be "d\\d" if it's following
> the logical route it's chosen? $ ruby -e ' md =
> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/.match( %q!/\d\d\\d! )
> puts md.inspect ' #<MatchData "/\\d" mors:nil delim:"d" pat:"\\">
>

I think you found a bug - probably related to referring to back
references to named capturing groups:

irb(main):013:0> s = %q!/\d\d\\d!
=> "/\\d\\d\\d"

irb(main):027:0> r = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)/
=> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)/
irb(main):028:0> md = r.match s
=> #<MatchData "/\\" mors:nil delim:"/" pat:"\\">

This must not match at all:

irb(main):029:0> r = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/
=> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/
irb(main):030:0> md = r.match s
=> #<MatchData "/\\d" mors:nil delim:"d" pat:"\\">

It seems to work better with numbered capturing groups

irb(main):027:0> r = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)/
=> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)/
irb(main):028:0> md = r.match s
=> #<MatchData "/\\" mors:nil delim:"/" pat:"\\">
irb(main):029:0> r = /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/
=> /^(?<mors>m)?(?<delim>.)(?<pat>.+?)\g<delim>/
irb(main):030:0> md = r.match s
=> #<MatchData "/\\d" mors:nil delim:"d" pat:"\\">

Normal greediness:

irb(main):035:0> r = /^(m)?(.)(.+)\2/
=> /^(m)?(.)(.+)\2/
irb(main):036:0> md = r.match s
=> nil

This works:

irb(main):038:0> /^(m)?(.)(.+)\2/.match 'abbba'
=> #<MatchData "abbba" 1:nil 2:"a" 3:"bbb">

Maybe the numbering gets out of order if we try to mix:

irb(main):039:0> /^(?<delim>m)?(.)(.+)\2/.match 'abbba'
SyntaxError: (irb):39: numbered backref/call is not allowed. (use name):
/^(?<delim>m)?(.)(.+)\2/
from /usr/local/bin/irb19:12:in `<main>'
irb(main):040:0> /^(?<delim>m)?(.)(.+)\k<2>/.match 'abbba'
SyntaxError: (irb):40: numbered backref/call is not allowed. (use name):
/^(?<delim>m)?(.)(.+)\k<2>/
from /usr/local/bin/irb19:12:in `<main>'
irb(main):041:0>

irb(main):047:0> RUBY_VERSION
=> "1.9.1"
irb(main):048:0> RUBY_PATCHLEVEL
=> 376

Frankly, I never used named capturing groups yet (simply for habit and
compatibility). It was probably a good choice so far.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
From: Iain Barnett on

On 26 Jul 2010, at 21:40, Robert Klemme wrote:
>
> I think you found a bug - probably related to referring to back references to named capturing groups:
>
>
>
> Frankly, I never used named capturing groups yet (simply for habit and compatibility). It was probably a good choice so far.
>
> Kind regards
>
> robert
>

Thanks for checking that. While searching for more information on the Oniguruma engine I noticed that there was a CPAN library for running it under Perl, so I installed it and ran the same regexes against the perl engine, and it had the same results as Ruby. This indicates that it's a problem with the engine and not something Ruby is doing along the way, so I'll file a report with the Oniguruma team and include all your tests too and see what happens.

With Oniguruma:

$ perl -Mre::engine::Oniguruma -e '
if ( q(/\d\d\\d/) =~ /^(?<delim>.)(?<pat>.+?)\g{delim}/ ) {
while ( my ($key, $value) = each(%+) ) {
print "$key => $value\n";
}
}
'
<nothing here>


Usual Perl engine:

$ perl -e '
if ( q(/\d\d\\d/) =~ /^(?<delim>.)(?<pat>.+?)\g{delim}/ ) {
while ( my ($key, $value) = each(%+) ) {
print "$key => $value\n";
}
}
'
pat => \d\d\d
delim => /

Regards,
Iain