From: J�rgen Exner on
"Kyle T. Jones" <KBfoMe(a)realdomain.net> wrote:
>Tad McClellan wrote:
>> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
>>> Steve wrote:
>>
>>>> like lets say I searched a site
>>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>>> would want to extract only the links that said hello in the title.
>>> Read up on perl regular expressions.
>>
>>
>> While reading up on regular expressions is certainly a good idea,
>> it is a horrid idea for the purposes of parsing HTML.
>>
>
>Ummm. Could you expand on that?
>
>My initial reaction would be something like - I'm pretty sure *any*
>method, including the use of HTML::LinkExtor, or XML transform (both
>outlined upthread), involves using regular expressions "for the purposes
>of parsing HTML".

Regular expressions recognize regular languages. But HTML is a
context-free language and therefore cannot be recognized solely by a
regular parser.
Having said that Perl's extended regular expressions are indeed more
powerful than regular, but still it is a bad idea because the
expressions are becoming way to complex.

>At best, you're just abstracting the regex work back to the includes.
>AFAIK, and feel free to correct me (I'll go take a look at some of the
>relevant module code in a bit), every CPAN module that is involved with
>parsing HTML uses fairly straightforward regex matching somewhere within
>that module's methods.

Using REs to do _part_ of the work of parsing any language is a
no-brainer, of course everyone does it e.g. in the tokenizer.

But unless your language is a regular language (and there aren't many
useful regular languages because regular is just too restrictive) you
need additional algorithms that cannot be expressed as REs to actually
parse a context-free or context-sensitive language.

>I think there's an argument that, considering you can do this so easily
>(in under 15 lines of code) without the overhead of unnecessary
>includes, my way would be more efficient. We can run some benchmarks if
>you want (see further down for working code).

But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
Theory of Computer Languages or Basics of Compiler Construction?
What do people learn in Computer Science today?

jue
From: Ben Morrow on

Quoth J�rgen Exner <jurgenex(a)hotmail.com>:
>
> But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
> Theory of Computer Languages or Basics of Compiler Construction?
> What do people learn in Computer Science today?

I suspect that most people writing Perl have never formally studied
Computer Science. I certainly haven't, though I've picked up a fair bit
of the theory along the way because I'm interested.

Ben

From: Kyle T. Jones on
J�rgen Exner wrote:
> "Kyle T. Jones" <KBfoMe(a)realdomain.net> wrote:
>> Tad McClellan wrote:
>>> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
>>>> Steve wrote:
>>>>> like lets say I searched a site
>>>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>>>> would want to extract only the links that said hello in the title.
>>>> Read up on perl regular expressions.
>>>
>>> While reading up on regular expressions is certainly a good idea,
>>> it is a horrid idea for the purposes of parsing HTML.
>>>
>> Ummm. Could you expand on that?
>>
>> My initial reaction would be something like - I'm pretty sure *any*
>> method, including the use of HTML::LinkExtor, or XML transform (both
>> outlined upthread), involves using regular expressions "for the purposes
>> of parsing HTML".
>
> Regular expressions recognize regular languages. But HTML is a
> context-free language and therefore cannot be recognized solely by a
> regular parser.
> Having said that Perl's extended regular expressions are indeed more
> powerful than regular, but still it is a bad idea because the
> expressions are becoming way to complex.
>
>> At best, you're just abstracting the regex work back to the includes.
>> AFAIK, and feel free to correct me (I'll go take a look at some of the
>> relevant module code in a bit), every CPAN module that is involved with
>> parsing HTML uses fairly straightforward regex matching somewhere within
>> that module's methods.
>
> Using REs to do _part_ of the work of parsing any language is a
> no-brainer, of course everyone does it e.g. in the tokenizer.
>
> But unless your language is a regular language (and there aren't many
> useful regular languages because regular is just too restrictive) you
> need additional algorithms that cannot be expressed as REs to actually
> parse a context-free or context-sensitive language.
>
>> I think there's an argument that, considering you can do this so easily
>> (in under 15 lines of code) without the overhead of unnecessary
>> includes, my way would be more efficient. We can run some benchmarks if
>> you want (see further down for working code).
>
> But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
> Theory of Computer Languages or Basics of Compiler Construction?
> What do people learn in Computer Science today?
>
> jue

But isn't the Chomsky Hierarchy completely irrelevant in this (forgive
the pun) context? Surely you "get" that my input is analyzed in terms
of being nothing more or less than a sequence of characters - that it
was originally written in HTML, or any other CFG-based language, is
meaningless - both syntactical and semantical considerations of that
original language are irrelevant in the (again, forgive me) context of
what I'm attempting - which is simply to match one finite sequence of
characters against another finite sequence of characters - I could care
less what those characters mean, what href indicates, what a <body> tag
is, etc.

I don't need to understand English to count the # of e's in the above
passage, right? Neither does Perl.

I believe what you say above is true - to truly "parse" the page AS HTML
is beyond the ability of REs - but I'm not parsing anything AS HTML, if
that makes sense. In fact, to take that a step further, I'm not
"parsing" period - so perhaps it was a mistake for me to use that term.
I meant to use the term colloquially, sorry if that caused any confusion.

Cheers.


" 'Regular expressions' [...] are only marginally related to real
regular expressions. Nevertheless, the term has grown with the
capabilities of our pattern matching engines, so I'm not going to try to
fight linguistic necessity here. I will, however, generally call them
"regexes" (or "regexen", when I'm in an Anglo-Saxon mood)" - Larry Wall
From: Tad McClellan on
Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
> Tad McClellan wrote:
>> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
>>> Steve wrote:
>>
>>>> like lets say I searched a site
>>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>>> would want to extract only the links that said hello in the title.
>>> Read up on perl regular expressions.
>>
>>
>> While reading up on regular expressions is certainly a good idea,
>> it is a horrid idea for the purposes of parsing HTML.
>>
>
> Ummm. Could you expand on that?


I think the FAQ answer does a pretty good job of it.


> My initial reaction would be something like - I'm pretty sure *any*
> method, including the use of HTML::LinkExtor, or XML transform (both
> outlined upthread), involves using regular expressions "for the purposes
> of parsing HTML".


"pattern matching" is not at all the same as "parsing".

Regular expressions are *great* for pattern matching.

It is mathematically impossible to do a proper parse of a context-free
lanuguage such as HTML with nothing more than regular expressions.

They do not contain the requisite power.

Google for the "Chomsky hierarchy".

HTML allows a table within a table within a table within a table,
to an arbitrary depth. ie. it is not "regular".


> I think there's an argument that, considering you can do this so easily
> (in under 15 lines of code) without the overhead of unnecessary
> includes, my way would be more efficient.


Do you want easy and wrong or hard and correct?


> you want (see further down for working code).


You have a strange definition of "working"...


>> Have you read the FAQ answers that mention HTML?
>>
>> perldoc -q HTML


Did you try that yet?

It points out at least one way that your code below can fail.


> I think this works fine:


You just haven't used a data set that exposes its flaws.

You are not "parsing", you are "pattern matching".

"pattern matching" is often "good enough", but you should realize
its fragility so that you can assess whether it is worth the ease
of implementation or not.


> #!/usr/bin/perl -w
^^
^^
> use strict;
> use warnings;
^^^^^^^^


Turning on warnings 2 times is kind of silly...

Lose the command line switch, lexical warnings are much better.


Try it with this:

-------------------
my $contents = '
<html><body>
<!--
this is NOT a link...
<a href="google.com">Google</a>
-->
</body></html>
';
-------------------


It will make output when it should make none.


> my @semiparsed=split(/href/i, $contents);
>
> foreach(@semiparsed){
> if($_=~/^\s*=\s*('|")(.*?)('|")/){


Gak!

Whitespace is not a scarce resource, feel free to use as much of it
as you like to make your code easier to read and understand.

Character classes are much more efficient than alternation.

Either be explicit in both places:

foreach $_ (
if ( $_ =~ /...

or in neither:

foreach (
if ( /...

be consistent.

So, let's rewrite that line as an experienced Perl programmer might:

if ( /^\s*=\s*['"](.*?)['"]/ ) { # now link will be in $1 instead of $2


Also, your code does not address the OP's question.

It tests the URL for a string rather than testing the <a> tag's _contents_.

That is, he wanted to test

<a href="...">...</a>
^^^
^^^ here

rather than

<a href="...">...</a>
^^^
^^^

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.
From: Ben Morrow on

Quoth "Kyle T. Jones" <KBfoMe(a)realdomain.net>:
>
> But isn't the Chomsky Hierarchy completely irrelevant in this (forgive
> the pun) context? Surely you "get" that my input is analyzed in terms
> of being nothing more or less than a sequence of characters - that it
> was originally written in HTML, or any other CFG-based language, is
> meaningless - both syntactical and semantical considerations of that
> original language are irrelevant in the (again, forgive me) context of
> what I'm attempting - which is simply to match one finite sequence of
> characters against another finite sequence of characters - I could care
> less what those characters mean, what href indicates, what a <body> tag
> is, etc.

This is correct, and treating HTML (or whatever) as plain text for the
purposes of grabbing something you want can be a valuable technique.
It's worth being aware that it's basically a hack, though, and that a
problem like 'find all the links in this document' is much better solved
by parsing the HTML properly than by trying to construct a regex to
match all possible forms of <a> tag.

Ben