From: kens on
On Apr 13, 9:07 pm, "Robbie Hatley" <lonew...(a)well.com> wrote:
> Today I was editing a URL-likifying program I wrote several
> weeks ago, and I ran across some issues with q{} and qr{}
> which are puzzling me.
>
> Here's an edited-for-brevity version of the program:
>
> my $Legal = q{[[:alnum:];/?:@=&#%$_.+!*'(),-]};
> my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)}
> my $Regex2 = qr{(s?https?://$Legal+)};
> while (<>)
> {
> s{$Regex1}{http://$1}g;
> s{$Regex2}{\n<p><a href="$1">$1</a></p>\n}g;
> print ($_);
>
> }
>
> (As an afterthought, I also tacked the entire program on the
> end of this post, for anyone who's interested.)
>
> I have two questions:
>
> 1. I had a "\" before the "$" to prevent "$_" from being
> interpolated. But when I took the "\" out, the regexes
> still worked fine! Seems to me they should break, because
> $_ is now a variable rather than just "dollar sign followed
> by underscore". But $_ seems not to be interpolated.
> So, is variable interpolation always strictly "one pass"?

q{} is equivalent to the single-quote operator. Strings inside single
quotes do not get interpolated (as opposed to double quotes - "" or
qq{}.

>
> 2. I've read that qr{} "compiles" the regex; I'm hoping that
> means that the s/// operators in the while loop will not
> recompile $Regex1 and $Regex2 each iteration, even though
> I didn't use a /o flag? (No sense wasting CPU time
> recompiling, because the patterns are fixed.)
>
Based on the documentation (perldoc perlop), qr may invoke a
precompilation of the pattern. To me that implies that it is
implementation specific, but there are others with more expertise in
this area than me.

HTH, Ken

> lines deleted

>
> --
> Cheers,
> Robbie Hatley
> lonewolf aatt well dott com
> www dott well dott com slant user slant lonewolf slant

From: John W. Krahn on
Robbie Hatley wrote:
> Today I was editing a URL-likifying program I wrote several
> weeks ago, and I ran across some issues with q{} and qr{}
> which are puzzling me.
>
> Here's an edited-for-brevity version of the program:
>
> my $Legal = q{[[:alnum:];/?:@=&#%$_.+!*'(),-]};
> my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)}
> my $Regex2 = qr{(s?https?://$Legal+)};
> while (<>)
> {
> s{$Regex1}{http://$1}g;
> s{$Regex2}{\n<p><a href="$1">$1</a></p>\n}g;
> print ($_);
> }
>
> (As an afterthought, I also tacked the entire program on the
> end of this post, for anyone who's interested.)
>
> I have two questions:
>
> 1. I had a "\" before the "$" to prevent "$_" from being
> interpolated.

That just adds a '\' character to your character class:

$ perl -le'$x = q{[$_]}; print qr{$x}'
(?-xism:[$_])
$ perl -le'$x = q{[\$_]}; print qr{$x}'
(?-xism:[\$_])

Which it doesn't look like you intended to include.

> But when I took the "\" out, the regexes
> still worked fine! Seems to me they should break, because
> $_ is now a variable rather than just "dollar sign followed
> by underscore". But $_ seems not to be interpolated.
> So, is variable interpolation always strictly "one pass"?

Read the "Gory details of parsing quoted constructs" section of:

perldoc perlop

> 2. I've read that qr{} "compiles" the regex; I'm hoping that
> means that the s/// operators in the while loop will not
> recompile $Regex1 and $Regex2 each iteration,

That is correct.

> even though
> I didn't use a /o flag? (No sense wasting CPU time
> recompiling, because the patterns are fixed.)

perldoc -q /o



John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall
From: Ben Bullock on
On Apr 14, 10:07 am, "Robbie Hatley" <lonew...(a)well.com> wrote:

> my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)}

;


> # This regex says "find a string which is probably a URL minus the 'http://'
> # part; save any such found string as a backreference":
> my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)}

;

Also, here [a-z0-9-]{3,63} (ignoring case) is enough. Your regex will
get things which aren't valid URLs. The following catches anything
valid:

my $validdns = '[0-9a-z-]{2,63}';
m/\b(($validdns\.){1,62}$validdns)\b/i # Catches any valid thing.

> s{$Regex1}{http://$1}g;

> print ($_);

You can just say

print;

here if you like.
From: Robbie Hatley on

"John W. Krahn" wrote:

> Robbie Hatley wrote:
>
> > is variable interpolation always strictly "one pass"?
>
> Read the "Gory details of parsing quoted constructs" section of:
> perldoc perlop

Thanks for the tip, but that section doesn't actually say
whether Perl variable interpolation is single-pass or
multi-pass (recursive).

However, when I scrolled up from that section, I noticed
that one of the sections above that:
http://perldoc.perl.org/perlop.html#Quote-and-Quote-like-Operators
*does* specify what I was looking for. It says:

"Perl does not expand multiple levels of interpolation."

Bingo. That's what I was wondering. That explains why "$_"
wasn't being interpolated in my program.

perl -le 'my $Cat=q/Fifi/; my $Dog=q/$Cat/; print qq/$Dog/;'

Prints "$Cat", not "Fifi" as I had expected. Now that I
understand why, I can avoid being surprised by that.

--
Cheers,
Robbie Hatley
perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"'
perl -le 'print "\150ttp\72//\167ww.\167ell.\143om/~\154onewolf/"'


From: Robbie Hatley on

"Ben Bullock" wrote:

> On Apr 14, 10:07 am, "Robbie Hatley" <lonew...(a)well.com> wrote:
>
> > # This regex says "find a string which is probably a URL minus the 'http://'
> > # part; save any such found string as a backreference":
> > my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)}
>
> ... [a-z0-9-]{3,63} (ignoring case) is enough. Your regex will
> get things which aren't valid URLs. The following catches anything
> valid:
>
> my $validdns = '[0-9a-z-]{2,63}';
> m/\b(($validdns\.){1,62}$validdns)\b/i # Catches any valid thing.

I can see that your pattern looks for just the dns part
of the url, which has fewer valid characters; but since it
doesn't look for "/", it will convert this string:

references in Sec 35.74 paragraph B

to

references in Sec http://35.74 paragraph B

I believe you're right in that it will find most valid dns
strings; but it also catches things that aren't part of URLs
at all (such as numbers with decimal points), and it rejects
certain well-formed domain strings (such as "j.qbc.net.ca",
which fails the "{2,63}" assertion).

My pattern at least insists on "stuff.stuff/stuff", so it
rejects "35.74". It rejects domain-level URLs and only
linkifys document-level URLs. That may be a blessing or
a curse, depending on your expectations.

Also, both your pattern and my are broken in that they match
http://www.asdf.com/qwer.html, and indeed convert it to
http://http://www.asdf.com/qwer.html .

Oops! What was really intended was to find "bare" URLs
(without "http://") and tack "http://" on the beginning.


Ok, this should do the trick; it blends features from your
approach and mine, and solves the bugs I just mentioned,
as well as some other bugs I've noticed:


#!/usr/bin/perl

# linkify.perl

# Converts any text document into an HTML document with all of the contents of
# the original, but with any HTTP URLs converted to clickable hyperlinks.

# First print the standard opening lines of an HTML file.
# The title will be "Linkifyed HTML Document",
# the body text is in a "div" element,
# and the paragraphs will have 5-pixel margins on all 4 sides:

use strict;
use warnings;

# Print initial tags for HTML file:
print ("<html>\n");
print ("<head>\n");
print ("<title>Linkifyed HTML Document</title>\n");
print ("<style>p{margin:5px;}</style>\n");
print ("</head>\n");
print ("<body>\n");
print ("<div>\n");
print ("<pre>\n");

# A valid URL must consist solely of the following 82 characters
#
# alphanumeric: [:alnum:] 62
# reserved: ;/?:@=& 7
# anchor-id: # 1
# encoding: % 1
# special: $_.+!*'(),- 11
# Total: 82
#

# Make a non-interpolated string version of a character class
# consisting of the above 82 URL-legal characters:
my $Legal = q<[[:alnum:];/?:@=&#%$_.+!*'(),-]>;

# Make a non-interpolated string version of a regex specifying
# a cluster of 1-63 DNS-valid characters:
my $Dns = q<[0-9A-Za-z-]{1,63}>;

# Make a non-interpolated string version of a regex specifying
# a URL header:
my $Header = q<s?https?://>;

# Make a non-interpolated string version of a regex specifying
# a URL suffix:
my $Suffix = qq<(?:$Dns\\.){1,62}$Dns/$Legal+>;

# This regex says "find a string which is probably a URL suffix,
# at start of line, and save any such found suffix as a backreference":
my $Regex1 = qr{^($Suffix)};

# This regex says "find a string which is probably a URL suffix,
# preceded by some space, and save any such found suffix as a backreference":
my $Regex2 = qr{(\s+)($Suffix)};

# This regex says "find a string which is probably a URL with header,
# and save any such found URL as a backreference":
my $Regex3 = qr{($Header$Suffix)};

# Now loop through all lines of text in the original file. First add http:// to
# any URLs that need it; then wrap all URLS in "a" and "p" elements, with the
# URL used as both the text and the "href" attribute of the "a" element:

#print $Regex1,"\n";
#print $Regex2,"\n";
#print $Regex3,"\n";

while (<>)
{
# Tack 'http://' onto be beginning of any strings which are
# probably URLS but lack 'http://':

$_ =~ s{$Regex1}{http://$1}; # No sense using g here (beginning of line only).
#print ("Regex1 matched ", $&, "\n");

$_ =~ s{$Regex2}{$1http://$2}g; # This one could be anywhere on the line.
#print ("Regex2 matched ", $&, "\n");

# Wrap each found URL in an html anchor element with the found URL used both
# as the "href" atttribute and as the text:
$_ =~ s{$Regex3}{<a href="$1">$1</a>}g;
#print ("Regex3 matched ", $&, "\n");

# Print the edited line. If the line did not contain a URL, it will be
# printed unexpurgated. To redirect output to a file, use ">" on the
# command line.
print;
}

# Print element-closure tags for pre, div, body, html:
print ("</pre>\n");
print ("</div>\n");
print ("</body>\n");
print ("</html>\n");