RegEx help - how to check for a link in HTML? [Framework]

Prev: code encryption (dotfuscation)
Next: CreateProcessAsUser horror,...

From: Alan Silver on 25 Feb 2010 15:15

>Try the HTMLAgilityPack, it's much better for getting the information
>you want.
>
>See Codeplex.com/HtmlAgilityPack

Harumph! If I'd seen that earlier, I could have saved a good few hours
of frustration!

Mind you, what I ended up with was very neat and compact, so I can't
complain I suppose.

Thanks for pointing that one out. It certainly deserves a close look.

--
Alan Silver
(anything added below this line is nothing to do with me)

From: Michael Wojcik on 2 Mar 2010 12:18

Alan Silver wrote:
>
> Say I want to look for a link to the domain www.fred.com, then the regex...
>
> <a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a>
>
> ...will match the following...
>
> <a rel="nofollow" href="http://www.fred.com">fred</a>
>
> ...which is right, but it will also match...
>
> <a href="http://www.cnn.com/" rel="nofollow">CNN</a><a
> href="http://www.fred.com">fred</a>
>
> ...which I don't want. It seems that the regex is matching the nofollow
> part to the first link, and so telling me that the whole HTML fragment
> contains a nofollow link to www.fred.com. This is wrong.

It's difficult to do this completely reliably, because implementing
the entire HTML DTD *plus* violations of it that are accepted by
common UAs (browsers and such) in a DFA is very complicated.

If we make some assumptions about the quality of the HTML you're
dealing with, though, we can simplify it considerably. Let's say that
it has to be well-formed, and that there's no whitespace between "<"
and "a" of an anchor tag.

Then you can prevent your regex above from spanning multiple anchor
elements by:

- Ensuring you don't span the end of the <a> tag when matching the
attributes you're looking for within it. Change ".*" in that part of
the regex to "[^>]*", so the subexpression will stop at the closing ">".

- Ensuring we don't capture "</a" between the "<a>" tag and the
closing "</a>" tag - that is, that we stop at the first "</a>" and
don't continue on to a later one, swallowing additional entire anchor
elements in the process. You can do that with a regex that matches:

- any number of:
- any number of characters that aren't "<", then
- either:
- "<" followed by a character that isn't "/", or
- "</" followed by a character that isn't "a"

That can be expressed by this regex expression:

([^<]*((<[^/])|(</[^a]))*)*

(Read it from the inside out. "(<[^/])" is "'<' followed by a
character that isn't '/'", and so on.)

That gives us:

<a
[^>]*nofollow[^>]*http://www\.fred\.com[^>]*>([^<]*((<[^/])|(</[^a]))*)*</a>

(That's probably going to be wrapped. It should be all on one line,
obviously.)

Also note that you don't need the "?" operator after ".*"; the "*"
matches zero or more of the preceding element.

This works with your examples above. It also correctly handles child
elements of the anchor element (other than <a> within <a>, which isn't
well-formed):

<a rel="nofollow" href="http://fred.com">f<b>r</b>ed</a>

It seems to me that there ought to be a way to handle the second half
of that regex with negative lookahead, which might be simpler, but I
couldn't get that to work with a couple of quick tries.

All this is assuming you actually need to match the entire anchor
element in the HTML source for some reason. If you just want to verify
whether the <a> tag is present with those attributes, you can ignore
what comes after the closing ">" and greatly simplify the regex.

--
Michael Wojcik
Micro Focus
Rhetoric & Writing, Michigan State University

From: Alan Silver on 3 Mar 2010 09:37

Wow, what a comprehensive reply! Comments below...

In article <hmjm6d01qpp(a)news4.newsguy.com>, Michael Wojcik
<mwojcik(a)newsguy.com> writes
>It's difficult to do this completely reliably, because implementing
>the entire HTML DTD *plus* violations of it that are accepted by
>common UAs (browsers and such) in a DFA is very complicated.

Yup, I was assuming (perhaps foolishly) that such a simple thing as an
anchor tag might be generally well-formed ;-)

<snip>
><a
>[^>]*nofollow[^>]*http://www\.fred\.com[^>]*>([^<]*((<[^/])|(</[^a]))*)*</a>
<Snip>
>All this is assuming you actually need to match the entire anchor
>element in the HTML source for some reason. If you just want to verify
>whether the <a> tag is present with those attributes, you can ignore
>what comes after the closing ">" and greatly simplify the regex.

I realised after posting that I was only interested in the opening part
of the tag, as my interest here is whether or not the link is there, and
if there is a nofollow value set. I ignored the anchor text and closing
tag.

So, how does your regex compare with the one I posted a couple of days
ago? I solved the problem I had in a similar way to yours (I think), and
ended up with...

<a [^<>]+nofollow[^<>]+http://www\.fred\.com[^<>]+>

This one only matches if there is a nofollow. I need to detect that, so
I had one regex to check for an anchor tag...

<a .*?http://www\.fred\.com.*?>.*?</a>

....and then the previous regex to match a nofollow before the href and a
similar one for when the nofollow is after the href.

Is there anything to choose between your method and mine? I'm a rank
beginner as regexs, so if yours has some distinct advantage, please
explain what. It could just be that they are two slightly different ways
of doing the same thing, I don't know.

Thanks very much for the reply

--
Alan Silver
(anything added below this line is nothing to do with me)

From: Michael Wojcik on 3 Mar 2010 12:48

Alan Silver wrote:
> In article <hmjm6d01qpp(a)news4.newsguy.com>, Michael Wojcik
> <mwojcik(a)newsguy.com> writes
>> It's difficult to do this completely reliably, because implementing
>> the entire HTML DTD *plus* violations of it that are accepted by
>> common UAs (browsers and such) in a DFA is very complicated.
>
> Yup, I was assuming (perhaps foolishly) that such a simple thing as an
> anchor tag might be generally well-formed ;-)

Alas, with HTML, you never know (unless you validate the HTML). User
Agents will accept all sorts of garbage, so many authors don't feel
any need to create valid markup. But usually you can get by with some
assumptions and live with a small probability of encountering bogus
markup that doesn't work.

>> <a
>> [^>]*nofollow[^>]*http://www\.fred\.com[^>]*>([^<]*((<[^/])|(</[^a]))*)*</a>
>>
> <Snip>
>> All this is assuming you actually need to match the entire anchor
>> element in the HTML source for some reason. If you just want to verify
>> whether the <a> tag is present with those attributes, you can ignore
>> what comes after the closing ">" and greatly simplify the regex.
>
> I realised after posting that I was only interested in the opening part
> of the tag, as my interest here is whether or not the link is there, and
> if there is a nofollow value set. I ignored the anchor text and closing
> tag.
>
> So, how does your regex compare with the one I posted a couple of days
> ago? I solved the problem I had in a similar way to yours (I think), and
> ended up with...
>
> <a [^<>]+nofollow[^<>]+http://www\.fred\.com[^<>]+>

If you remove the part of mine that captures the element content and
closing tag, your regex and mine have a few differences, but in
practice they should be equally usable.

You're eliminating "<" from inside the a tag. It shouldn't appear
there (unless the page uses the SGML short tag syntax, but I've never
seen anyone do so), so in practice my "[^>]" and your "[^<>]" will
produce the same results. Use whichever you prefer. (Some people might
find yours more readable, due to its visual symmetry.)

You're using the + operator where I use the * operator. We expect that
at least one character will be matched in all of those places, so
again this shouldn't make any difference in practice.

> This one only matches if there is a nofollow. I need to detect that, so
> I had one regex to check for an anchor tag...
>
> <a .*?http://www\.fred\.com.*?>.*?</a>
>
> ...and then the previous regex to match a nofollow before the href and a
> similar one for when the nofollow is after the href.

You could combine all three of these into a single expression, but
frankly if all you're looking for is whether you have a match - you're
not capturing groups or anything like that - I'd stick with the three
regexes you have now. They work, and they're easier to read,
understand, and maintain.

People who write a lot of regexes tend to start viewing them as an
opportunity for cleverness to the point of obscurity, like TECO macros
were back in the day. Personally, I'm a fan of readability and
maintainability. Where I have hard-coded regexes in my code, I usually
split the string up into component parts with comments, so the reader
can see what I'm doing.

--
Michael Wojcik
Micro Focus
Rhetoric & Writing, Michigan State University

From: Alan Silver on 7 Mar 2010 14:12

In article <hmm9l4027gr(a)news3.newsguy.com>, Michael Wojcik
<mwojcik(a)newsguy.com> writes
>People who write a lot of regexes tend to start viewing them as an
>opportunity for cleverness to the point of obscurity, like TECO macros
>were back in the day. Personally, I'm a fan of readability and
>maintainability. Where I have hard-coded regexes in my code, I usually
>split the string up into component parts with comments, so the reader
>can see what I'm doing.

Hee hee, I'm with you. I remember my (fairly brief) foray in Perl. I got
the same impression there - some people were only interested in how
short (and therefore unreadable) they could make their coding.

Anyway, I'm glad what I did is basically the same as yours. I understand
it a lot better now. Thanks very much for the help.

--
Alan Silver
(anything added below this line is nothing to do with me)

First | Prev |
Pages: 1 2
Prev: code encryption (dotfuscation)
Next: CreateProcessAsUser horror,...