|
From: kens on 13 Apr 2008 21:31 On Apr 13, 9:07 pm, "Robbie Hatley" <lonew...(a)well.com> wrote: > Today I was editing a URL-likifying program I wrote several > weeks ago, and I ran across some issues with q{} and qr{} > which are puzzling me. > > Here's an edited-for-brevity version of the program: > > my $Legal = q{[[:alnum:];/?:@=&#%$_.+!*'(),-]}; > my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)} > my $Regex2 = qr{(s?https?://$Legal+)}; > while (<>) > { > s{$Regex1}{http://$1}g; > s{$Regex2}{\n<p><a href="$1">$1</a></p>\n}g; > print ($_); > > } > > (As an afterthought, I also tacked the entire program on the > end of this post, for anyone who's interested.) > > I have two questions: > > 1. I had a "\" before the "$" to prevent "$_" from being > interpolated. But when I took the "\" out, the regexes > still worked fine! Seems to me they should break, because > $_ is now a variable rather than just "dollar sign followed > by underscore". But $_ seems not to be interpolated. > So, is variable interpolation always strictly "one pass"? q{} is equivalent to the single-quote operator. Strings inside single quotes do not get interpolated (as opposed to double quotes - "" or qq{}. > > 2. I've read that qr{} "compiles" the regex; I'm hoping that > means that the s/// operators in the while loop will not > recompile $Regex1 and $Regex2 each iteration, even though > I didn't use a /o flag? (No sense wasting CPU time > recompiling, because the patterns are fixed.) > Based on the documentation (perldoc perlop), qr may invoke a precompilation of the pattern. To me that implies that it is implementation specific, but there are others with more expertise in this area than me. HTH, Ken > lines deleted > > -- > Cheers, > Robbie Hatley > lonewolf aatt well dott com > www dott well dott com slant user slant lonewolf slant
From: John W. Krahn on 13 Apr 2008 23:01 Robbie Hatley wrote: > Today I was editing a URL-likifying program I wrote several > weeks ago, and I ran across some issues with q{} and qr{} > which are puzzling me. > > Here's an edited-for-brevity version of the program: > > my $Legal = q{[[:alnum:];/?:@=&#%$_.+!*'(),-]}; > my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)} > my $Regex2 = qr{(s?https?://$Legal+)}; > while (<>) > { > s{$Regex1}{http://$1}g; > s{$Regex2}{\n<p><a href="$1">$1</a></p>\n}g; > print ($_); > } > > (As an afterthought, I also tacked the entire program on the > end of this post, for anyone who's interested.) > > I have two questions: > > 1. I had a "\" before the "$" to prevent "$_" from being > interpolated. That just adds a '\' character to your character class: $ perl -le'$x = q{[$_]}; print qr{$x}' (?-xism:[$_]) $ perl -le'$x = q{[\$_]}; print qr{$x}' (?-xism:[\$_]) Which it doesn't look like you intended to include. > But when I took the "\" out, the regexes > still worked fine! Seems to me they should break, because > $_ is now a variable rather than just "dollar sign followed > by underscore". But $_ seems not to be interpolated. > So, is variable interpolation always strictly "one pass"? Read the "Gory details of parsing quoted constructs" section of: perldoc perlop > 2. I've read that qr{} "compiles" the regex; I'm hoping that > means that the s/// operators in the while loop will not > recompile $Regex1 and $Regex2 each iteration, That is correct. > even though > I didn't use a /o flag? (No sense wasting CPU time > recompiling, because the patterns are fixed.) perldoc -q /o John -- Perl isn't a toolbox, but a small machine shop where you can special-order certain sorts of tools at low cost and in short order. -- Larry Wall
From: Ben Bullock on 14 Apr 2008 00:29 On Apr 14, 10:07 am, "Robbie Hatley" <lonew...(a)well.com> wrote: > my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)} ; > # This regex says "find a string which is probably a URL minus the 'http://' > # part; save any such found string as a backreference": > my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)} ; Also, here [a-z0-9-]{3,63} (ignoring case) is enough. Your regex will get things which aren't valid URLs. The following catches anything valid: my $validdns = '[0-9a-z-]{2,63}'; m/\b(($validdns\.){1,62}$validdns)\b/i # Catches any valid thing. > s{$Regex1}{http://$1}g; > print ($_); You can just say print; here if you like.
From: Robbie Hatley on 14 Apr 2008 04:35 "John W. Krahn" wrote: > Robbie Hatley wrote: > > > is variable interpolation always strictly "one pass"? > > Read the "Gory details of parsing quoted constructs" section of: > perldoc perlop Thanks for the tip, but that section doesn't actually say whether Perl variable interpolation is single-pass or multi-pass (recursive). However, when I scrolled up from that section, I noticed that one of the sections above that: http://perldoc.perl.org/perlop.html#Quote-and-Quote-like-Operators *does* specify what I was looking for. It says: "Perl does not expand multiple levels of interpolation." Bingo. That's what I was wondering. That explains why "$_" wasn't being interpolated in my program. perl -le 'my $Cat=q/Fifi/; my $Dog=q/$Cat/; print qq/$Dog/;' Prints "$Cat", not "Fifi" as I had expected. Now that I understand why, I can avoid being surprised by that. -- Cheers, Robbie Hatley perl -le 'print "\154o\156e\167o\154f\100w\145ll\56c\157m"' perl -le 'print "\150ttp\72//\167ww.\167ell.\143om/~\154onewolf/"'
From: Robbie Hatley on 14 Apr 2008 13:40 "Ben Bullock" wrote: > On Apr 14, 10:07 am, "Robbie Hatley" <lonew...(a)well.com> wrote: > > > # This regex says "find a string which is probably a URL minus the 'http://' > > # part; save any such found string as a backreference": > > my $Regex1 = qr{($Legal+\.$Legal+/$Legal+)} > > ... [a-z0-9-]{3,63} (ignoring case) is enough. Your regex will > get things which aren't valid URLs. The following catches anything > valid: > > my $validdns = '[0-9a-z-]{2,63}'; > m/\b(($validdns\.){1,62}$validdns)\b/i # Catches any valid thing. I can see that your pattern looks for just the dns part of the url, which has fewer valid characters; but since it doesn't look for "/", it will convert this string: references in Sec 35.74 paragraph B to references in Sec http://35.74 paragraph B I believe you're right in that it will find most valid dns strings; but it also catches things that aren't part of URLs at all (such as numbers with decimal points), and it rejects certain well-formed domain strings (such as "j.qbc.net.ca", which fails the "{2,63}" assertion). My pattern at least insists on "stuff.stuff/stuff", so it rejects "35.74". It rejects domain-level URLs and only linkifys document-level URLs. That may be a blessing or a curse, depending on your expectations. Also, both your pattern and my are broken in that they match http://www.asdf.com/qwer.html, and indeed convert it to http://http://www.asdf.com/qwer.html . Oops! What was really intended was to find "bare" URLs (without "http://") and tack "http://" on the beginning. Ok, this should do the trick; it blends features from your approach and mine, and solves the bugs I just mentioned, as well as some other bugs I've noticed: #!/usr/bin/perl # linkify.perl # Converts any text document into an HTML document with all of the contents of # the original, but with any HTTP URLs converted to clickable hyperlinks. # First print the standard opening lines of an HTML file. # The title will be "Linkifyed HTML Document", # the body text is in a "div" element, # and the paragraphs will have 5-pixel margins on all 4 sides: use strict; use warnings; # Print initial tags for HTML file: print ("<html>\n"); print ("<head>\n"); print ("<title>Linkifyed HTML Document</title>\n"); print ("<style>p{margin:5px;}</style>\n"); print ("</head>\n"); print ("<body>\n"); print ("<div>\n"); print ("<pre>\n"); # A valid URL must consist solely of the following 82 characters # # alphanumeric: [:alnum:] 62 # reserved: ;/?:@=& 7 # anchor-id: # 1 # encoding: % 1 # special: $_.+!*'(),- 11 # Total: 82 # # Make a non-interpolated string version of a character class # consisting of the above 82 URL-legal characters: my $Legal = q<[[:alnum:];/?:@=&#%$_.+!*'(),-]>; # Make a non-interpolated string version of a regex specifying # a cluster of 1-63 DNS-valid characters: my $Dns = q<[0-9A-Za-z-]{1,63}>; # Make a non-interpolated string version of a regex specifying # a URL header: my $Header = q<s?https?://>; # Make a non-interpolated string version of a regex specifying # a URL suffix: my $Suffix = qq<(?:$Dns\\.){1,62}$Dns/$Legal+>; # This regex says "find a string which is probably a URL suffix, # at start of line, and save any such found suffix as a backreference": my $Regex1 = qr{^($Suffix)}; # This regex says "find a string which is probably a URL suffix, # preceded by some space, and save any such found suffix as a backreference": my $Regex2 = qr{(\s+)($Suffix)}; # This regex says "find a string which is probably a URL with header, # and save any such found URL as a backreference": my $Regex3 = qr{($Header$Suffix)}; # Now loop through all lines of text in the original file. First add http:// to # any URLs that need it; then wrap all URLS in "a" and "p" elements, with the # URL used as both the text and the "href" attribute of the "a" element: #print $Regex1,"\n"; #print $Regex2,"\n"; #print $Regex3,"\n"; while (<>) { # Tack 'http://' onto be beginning of any strings which are # probably URLS but lack 'http://': $_ =~ s{$Regex1}{http://$1}; # No sense using g here (beginning of line only). #print ("Regex1 matched ", $&, "\n"); $_ =~ s{$Regex2}{$1http://$2}g; # This one could be anywhere on the line. #print ("Regex2 matched ", $&, "\n"); # Wrap each found URL in an html anchor element with the found URL used both # as the "href" atttribute and as the text: $_ =~ s{$Regex3}{<a href="$1">$1</a>}g; #print ("Regex3 matched ", $&, "\n"); # Print the edited line. If the line did not contain a URL, it will be # printed unexpurgated. To redirect output to a file, use ">" on the # command line. print; } # Print element-closure tags for pre, div, body, html: print ("</pre>\n"); print ("</div>\n"); print ("</body>\n"); print ("</html>\n");
|
Next
|
Last
Pages: 1 2 Prev: FAQ 9.17 How do I check a valid mail address? Next: anyone has done this kind of perl/CGI? |