From: Ben Bullock on
On Tue, 15 Apr 2008 13:35:27 -0700, Robbie Hatley wrote:

> "Ben Bullock" wrote:
>
>> Well OK but if I was going to do this for real, I would use something
>> like /\b(($validdns\.){1,62}(com|net|org|us|uk|ca|jp))\b/i or similar
>> (I haven't checked this regex with the machine yet but hopefully you
>> get the picture).
>
> The problem with "(com|net|org|us|uk|ca|jp)" or similar is that there
> are hundreds or thousands of such valid domain suffixes.

I think there are only about 200 or so, most of which are rare.

> You're
> forgetting "es" (Spain), "ru" (Russia), "uk" (Ukraine), "us" (USA), not
> to mention "mil", "gov", "edu", "biz", "info", etc, etc, etc.

Um, I have both "us" and "uk" there. I didn't know that uk was Ukraine
though.

> That's
> part of why my URL-matching regex was so vague.

>> I just wanted to make the point that the &$% stuff is not valid as part
>> of the web address.
>
> Those characters all appear in web addresses.

Did you really not understand my point?

> Hence I tend to go for a vauge RE that I believe
> captures every valid document URL, at the cost of occasionally
> caputuring a few invalid ones. Unless someone knows a better approach.

Well, even if they do know a better approach, they might not have the
energy to discuss it with you.
From: Ben Bullock on
On Wed, 16 Apr 2008 12:49:32 -0700, Robbie Hatley wrote:

> Ok, I just downloaded Regexp-Common-2.120. Now I have a folder with a
> bunch of stuff in it. This may sound like an incredibly stupid
> question, but what do I do with it? I've never actually used a CPAN
> module before. Any hints a CPAN newbie should be aware of?

If I want to install a cpan module, I usually don't directly download
the .tar.gz file. Instead I log in as root and type

cpan Regexp::Common

You might need to prefix that with "sudo" if you are using Ubuntu/Debian
linux.

If you are using ActiveState Perl on Windows, you are better off using
"ppm", the Perl Package Manager, which has precompiled versions of the
modules.