RegEx help - how to check for a link in HTML? [Framework]

Prev: code encryption (dotfuscation)
Next: CreateProcessAsUser horror,...

From: Alan Silver on 24 Feb 2010 13:16

Hello,

I'm trying to write some code to check for a link in some HTML that has
been pulled from a web site. I think this should be easy with a RegEx,
but I can't get my head round it.

To make sure it's clear, a normal HTML link looks like...

<a href="http://www.microsoft.com/sompage.aspx">some page</a>

....but can also look like...

<a href="http://www.microsoft.com/sompage.aspx" rel="nofollow">some
page</a>

There are loads of other variations, but this is all that interests me
right now.

I want to check the HTML to see...

1) Is there a link to my target URL (which will be given), and
2) Does that link have the rel="nofollow" part or not?

Anyone any ideas how I would do this? I've tried all sorts of things,
but not got anything that works.

Just to throw a spanner in the works, the rel="nofollow" bit could
appear before or after the href="whatever" bit.

I would be really grateful for any help here.

TIA

--
Alan Silver
(anything added below this line is nothing to do with me)

From: Alan Silver on 24 Feb 2010 14:01

>Hello,

Just to follow up on my own post, I've finally got something that nearly
works, but it isn't quite there.

Say I want to look for a link to the domain www.fred.com, then the
regex...

<a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a>

....will match the following...

<a rel="nofollow" href="http://www.fred.com">fred</a>

....which is right, but it will also match...

<a href="http://www.cnn.com/" rel="nofollow">CNN</a><a
href="http://www.fred.com">fred</a>

....which I don't want. It seems that the regex is matching the nofollow
part to the first link, and so telling me that the whole HTML fragment
contains a nofollow link to www.fred.com. This is wrong.

So, how do I modify this regex so that it won't look at the nofollow
part in another link?

Thanks for any help

--
Alan Silver
(anything added below this line is nothing to do with me)

From: Jesse Houwing on 25 Feb 2010 05:17

* Alan Silver wrote, On 24-2-2010 19:16:
> Hello,
>
> I'm trying to write some code to check for a link in some HTML that has
> been pulled from a web site. I think this should be easy with a RegEx,
> but I can't get my head round it.
>
> To make sure it's clear, a normal HTML link looks like...
>
> <a href="http://www.microsoft.com/sompage.aspx">some page</a>
>
> ...but can also look like...
>
> <a href="http://www.microsoft.com/sompage.aspx" rel="nofollow">some
> page</a>
>
> There are loads of other variations, but this is all that interests me
> right now.
>
> I want to check the HTML to see...
>
> 1) Is there a link to my target URL (which will be given), and
> 2) Does that link have the rel="nofollow" part or not?
>
> Anyone any ideas how I would do this? I've tried all sorts of things,
> but not got anything that works.
>
> Just to throw a spanner in the works, the rel="nofollow" bit could
> appear before or after the href="whatever" bit.
>
> I would be really grateful for any help here.
>
> TIA
>

Try the HTMLAgilityPack, it's much better for getting the information
you want.

See Codeplex.com/HtmlAgilityPack

Jesse

--
Jesse Houwing
jesse.houwing at sogeti.nl

From: eBob.com on 25 Feb 2010 09:07

I haven't played with the HTMLAgilityPack or any other HTML parser so I
can't compare that approach to RegEx.

I highly recommend Expresso from UltraPico for experimenting with regular
expressions. (It's free.)

I think your problem is that .*? is sucking up too many characters and
overflowing into another tag. So instead of matching . (any character) you
could try matching any character other than "<".

Based on what you've told us, and just off the top of my head, I think my
expression would look for, in pseudo regex,

<a optional nofollow http://www\.fred\.com optional nofollow </a>

That would match some dumb html which had nofollow before and after the url,
but I'd guess that doesn't matter. I don't know if there is a way in regex
to insist that the nofollow can appear in one place or another but not both.
But using "named groups" (I think that's the right terminology) you could
determine where the nofollows had occurred.

Good Luck, Bob

"Alan Silver" <alan-silver(a)nospam.thanx.invalid> wrote in message
news:BaBPjcFneXhLFwvb(a)nospamthankyou.spam...
> >Hello,
>
> Just to follow up on my own post, I've finally got something that nearly
> works, but it isn't quite there.
>
> Say I want to look for a link to the domain www.fred.com, then the
> regex...
>
> <a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a>
>
> ...will match the following...
>
> <a rel="nofollow" href="http://www.fred.com">fred</a>
>
> ...which is right, but it will also match...
>
> <a href="http://www.cnn.com/" rel="nofollow">CNN</a><a
> href="http://www.fred.com">fred</a>
>
> ...which I don't want. It seems that the regex is matching the nofollow
> part to the first link, and so telling me that the whole HTML fragment
> contains a nofollow link to www.fred.com. This is wrong.
>
> So, how do I modify this regex so that it won't look at the nofollow part
> in another link?
>
> Thanks for any help
>
> --
> Alan Silver
> (anything added below this line is nothing to do with me)

From: Alan Silver on 25 Feb 2010 15:10

Hello,

Thanks for the reply. I have Expresso, which is very good, but doesn't
necessarily tell you how to build the regex you want.

However, after some playing around, I came up with something that
worked. As you pointed out, the regex was greedy, and was matching with
stuff outside of the current tag. I added some bits to stop that, and it
worked fine.

I had to do two regexs, one to catch the nofollow before the href, and
one when it was after. The code I ended up with was...

Regex regLink = new Regex(@"<a .*?http://" + targetUrl.Replace(".",
@"\.") + @".*?>.*?</a>", RegexOptions.Singleline);

Regex regLinkNofollowL = new Regex(@"<a [^<>]+nofollow[^<>]+http://" +
targetUrl.Replace(".", @"\.") + @"[^<>]+>", RegexOptions.Singleline);

Regex regLinkNofollowR = new Regex(@"<a [^<>]+http://" +
targetUrl.Replace(".", @"\.") + @"[^<>]+nofollow[^<>]+>",
RegexOptions.Singleline);

The string variable targetUrl contains the domain name of the link I
want to look for.

regLink.IsMatch(html) will be true if a link is found

regLinkNofollowL.IsMatch(html) will be true if the link has a nofollow
before the href

regLinkNofollowR.IsMatch(html) will be true if the link has a nofollow
after the href

Hope this is of some use to someone.

Thanks again for the reply.

>I haven't played with the HTMLAgilityPack or any other HTML parser so I
>can't compare that approach to RegEx.
>
>I highly recommend Expresso from UltraPico for experimenting with regular
>expressions. (It's free.)
>
>I think your problem is that .*? is sucking up too many characters and
>overflowing into another tag. So instead of matching . (any character) you
>could try matching any character other than "<".
>
>Based on what you've told us, and just off the top of my head, I think my
>expression would look for, in pseudo regex,
>
><a optional nofollow http://www\.fred\.com optional nofollow </a>
>
>That would match some dumb html which had nofollow before and after the url,
>but I'd guess that doesn't matter. I don't know if there is a way in regex
>to insist that the nofollow can appear in one place or another but not both.
>But using "named groups" (I think that's the right terminology) you could
>determine where the nofollows had occurred.
>
>Good Luck, Bob
>
>
>"Alan Silver" <alan-silver(a)nospam.thanx.invalid> wrote in message
>news:BaBPjcFneXhLFwvb(a)nospamthankyou.spam...
>> >Hello,
>>
>> Just to follow up on my own post, I've finally got something that nearly
>> works, but it isn't quite there.
>>
>> Say I want to look for a link to the domain www.fred.com, then the
>> regex...
>>
>> <a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a>
>>
>> ...will match the following...
>>
>> <a rel="nofollow" href="http://www.fred.com">fred</a>
>>
>> ...which is right, but it will also match...
>>
>> <a href="http://www.cnn.com/" rel="nofollow">CNN</a><a
>> href="http://www.fred.com">fred</a>
>>
>> ...which I don't want. It seems that the regex is matching the nofollow
>> part to the first link, and so telling me that the whole HTML fragment
>> contains a nofollow link to www.fred.com. This is wrong.
>>
>> So, how do I modify this regex so that it won't look at the nofollow part
>> in another link?
>>
>> Thanks for any help
>>
>> --
>> Alan Silver
>> (anything added below this line is nothing to do with me)
>
>

--
Alan Silver
(anything added below this line is nothing to do with me)

| Next | Last
Pages: 1 2
Prev: code encryption (dotfuscation)
Next: CreateProcessAsUser horror,...