Perl HTML searching [Perl]

Prev: PDF::API2 underlining text
Next: FAQ 5.41 How do I delete a directory tree?

From: Kyle T. Jones on 19 Mar 2010 14:58

Steve wrote:
> On Mar 19, 11:01 am, J�rgen Exner <jurge...(a)hotmail.com> wrote:
>> Steve <st...(a)staticg.com> wrote:
>>> I started a little project where I need to search web pages for their
>>> text and return the links of those pages to me. I am using
>>> LWP::Simple, HTML::LinkExtor, and Data::Dumper. Basically all I have
>>> done so far is a list of URL's from my search query of a website, but
>>> I want to be able to filter this content based on the pages contents.
>>> How can I do this? How can I get the content of a web page, and not
>>> just the URL?
>> ???
>>
>> I don't understand.
>>
>> use LWP::Simple;
>> $content = get("http://www.whateverURL");
>>
>> will get you exactly the content of that web page and assign it to
>> $content and apparently you are doing that already.
>>
>> So what is your problem?
>>
>> jue
>
> Sorry I am a little overwhelmed with the coding so far (I'm not very
> good at perl). I have what you have posted, but my problem is that I
> would like to filter that content... like lets say I searched a site
> that had 15 news links and 3 of them said "Hello" in the title. I
> would want to extract only the links that said hello in the title.

Read up on perl regular expressions.

for instance, taking the above, you might first split it into a
"one-line per" array -

@stuff=split(/\n/, $content);

then parse each line for hello -

foreach(@stuff){
if($_=~/Hello/){
do whatever;}
}

Cheers.

From: Ben Morrow on 19 Mar 2010 14:53

Quoth Steve <steve(a)staticg.com>:
> On Mar 19, 11:01�am, J�rgen Exner <jurge...(a)hotmail.com> wrote:
> > Steve <st...(a)staticg.com> wrote:
> > >I started a little project where I need to search web pages for their
> > >text and return the links of those pages to me. �I am using
> > >LWP::Simple, HTML::LinkExtor, and Data::Dumper. �Basically all I have
> > >done so far is a list of URL's from my search query of a website, but
> > >I want to be able to filter this content based on the pages contents.
> > >How can I do this? How can I get the content of a web page, and not
> > >just the URL?
> >
> > � � � � use LWP::Simple;
> > � � � � $content = get("http://www.whateverURL");
> >
> > will get you exactly the content of that web page and assign it to
> > $content and apparently you are doing that already.
>
> Sorry I am a little overwhelmed with the coding so far (I'm not very
> good at perl). I have what you have posted, but my problem is that I
> would like to filter that content... like lets say I searched a site
> that had 15 news links and 3 of them said "Hello" in the title. I
> would want to extract only the links that said hello in the title.

Ah, you don't want the content *pointed to* by the link, you want the
content of the <a> element itself. I don't think you can use
HTML::LinkExtor for that.

I would start by building a DOM for the page, and then going through and
finding the <a> elements and checking their content. XML::LibXML
(despite the name) has a decent HTML parser, though you will probably
want to set the 'recover' option if you are parsing random HTML from the
Web. You can then use DOM methods like ->getElementsByTagName to find
the <a> elements and ->textContent to find their contents (ignoring
further tags within the <a> element).

Ben

From: Steve on 19 Mar 2010 17:10

On Mar 19, 11:42 am, "J. Gleixner" <glex_no-s...(a)qwest-spam-
no.invalid> wrote:
> Steve wrote:
> > On Mar 19, 11:01 am, J rgen Exner <jurge...(a)hotmail.com> wrote:
> >> Steve <st...(a)staticg.com> wrote:
> >>> I started a little project where I need to search web pages for their
> >>> text and return the links of those pages to me. I am using
> >>> LWP::Simple, HTML::LinkExtor, and Data::Dumper. Basically all I have
> >>> done so far is a list of URL's from my search query of a website, but
> >>> I want to be able to filter this content based on the pages contents.
> >>> How can I do this? How can I get the content of a web page, and not
> >>> just the URL?
> >> ???
>
> >> I don't understand.
>
> >> use LWP::Simple;
> >> $content = get("http://www.whateverURL");
>
> >> will get you exactly the content of that web page and assign it to
> >> $content and apparently you are doing that already.
>
> >> So what is your problem?
>
> >> jue
>
> > Sorry I am a little overwhelmed with the coding so far (I'm not very
> > good at perl). I have what you have posted, but my problem is that I
> > would like to filter that content... like lets say I searched a site
> > that had 15 news links and 3 of them said "Hello" in the title. I
> > would want to extract only the links that said hello in the title.
>
> '"Hello" in the title'??.. The title element of the HTML????
> Or the 'a' element contains 'Hello'?? e.g. <a href="...">Hello Kitty</a>
>
> How are you using HTML::LinkExtor??
>
> That seems like the right choice.
>
> Why are you using Data::Dumper?
>
> That's helpful when debugging, or logging, so how are you using it?
>
> Post your very short example, because there's something you're
> missing and no one can tell what that is based on your description.

Based on what you all said, I can make a more clear description.
Essentially, I'm trying to search craigslist more efficiently. I want
the link the a tag points to, as well as the description. here is
code I used already that I made that gets me only the links:
-----------------------------

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
use Data::Dumper;

###### VARIABLES ######
my $craigs = "http://seattle.craigslist.org";
my $source = "$craigs/search/sss?query=what+Im+Looking
+for&catAbbreviation=sss";
my $browser = 'google-chrome';

###### SEARCH #######

my $page = get("$source");
my $parser = HTML::LinkExtor->new();

$parser->parse($page);
my @links = $parser->links;
open LINKS, ">/home/me/Desktop/links.txt";
print LINKS Dumper \@links;

open READLINKS, "</home/me/Desktop/links.txt";
open OUT, ">/home/me/Desktop/final.txt";
while (<READLINKS>){
if ( /html/ ){
my $url = $_;
for ($url){
s/\'//g;
s/^\s+//;
}

print OUT "$craigs$url";
}
}
open BROWSE, "</home/me/Desktop/final.txt";

system ($browser);
foreach(<BROWSE>){
system ($browser, $_);
}
-----------------------------

I've since created a different script that's a little more cleaned up

From: J. Gleixner on 19 Mar 2010 17:10

J. Gleixner wrote:
> Steve wrote:
>> On Mar 19, 11:01 am, J�rgen Exner <jurge...(a)hotmail.com> wrote:
>>> Steve <st...(a)staticg.com> wrote:
>>>> I started a little project where I need to search web pages for their
>>>> text and return the links of those pages to me. I am using
>>>> LWP::Simple, HTML::LinkExtor, and Data::Dumper. Basically all I have
>>>> done so far is a list of URL's from my search query of a website, but
>>>> I want to be able to filter this content based on the pages contents.
>>>> How can I do this? How can I get the content of a web page, and not
>>>> just the URL?
>>> ???
>>>
>>> I don't understand.
>>>
>>> use LWP::Simple;
>>> $content = get("http://www.whateverURL");
>>>
>>> will get you exactly the content of that web page and assign it to
>>> $content and apparently you are doing that already.
>>>
>>> So what is your problem?
>>>
>>> jue
>>
>> Sorry I am a little overwhelmed with the coding so far (I'm not very
>> good at perl). I have what you have posted, but my problem is that I
>> would like to filter that content... like lets say I searched a site
>> that had 15 news links and 3 of them said "Hello" in the title. I
>> would want to extract only the links that said hello in the title.
>
>
> '"Hello" in the title'??.. The title element of the HTML????
> Or the 'a' element contains 'Hello'?? e.g. <a href="...">Hello Kitty</a>
>
> How are you using HTML::LinkExtor??
>
> That seems like the right choice.
After looking at it further, HTML::LinkExtor only gives the
attributes, not the text that makes up the hyperlink. Seems
like that would be a useful enhancement.

This might help you:

http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.64/eg/hanchors

From: Ben Morrow on 19 Mar 2010 17:40

Quoth Steve <steve(a)staticg.com>:
>
> Based on what you all said, I can make a more clear description.
> Essentially, I'm trying to search craigslist more efficiently. I want

Are you sure craigslist's Terms of Use allow this? Most sites of this
nature don't.

> the link the a tag points to, as well as the description. here is
> code I used already that I made that gets me only the links:
> -----------------------------
>
> #!/usr/bin/perl -w
> use strict;
> use LWP::Simple;
> use HTML::LinkExtor;
> use Data::Dumper;
>
> ###### VARIABLES ######
> my $craigs = "http://seattle.craigslist.org";
> my $source = "$craigs/search/sss?query=what+Im+Looking
> +for&catAbbreviation=sss";
> my $browser = 'google-chrome';
>
> ###### SEARCH #######
>
> my $page = get("$source");
> my $parser = HTML::LinkExtor->new();
>
> $parser->parse($page);
> my @links = $parser->links;
> open LINKS, ">/home/me/Desktop/links.txt";

Use 3-arg open.
Use lexical filehandles.
*Always* check the return value of open.

open my $LINKS, ">", "/home/me/Desktop/links.txt"
or die "can't write to 'links.txt': $!";

You may wish to consider using the 'autodie' module from CPAN, which
will do the 'or die' checks for you.

> print LINKS Dumper \@links;
>
> open READLINKS, "</home/me/Desktop/links.txt";
> open OUT, ">/home/me/Desktop/final.txt";

As above.

> while (<READLINKS>){

Why are you writing the links out to a file only to read them in again?
Just use the array you already have:

for (@links) {

> if ( /html/ ){
> my $url = $_;
> for ($url){
> s/\'//g;
> s/^\s+//;
> }
>
> print OUT "$craigs$url";
> }
> }
> open BROWSE, "</home/me/Desktop/final.txt";

As above.

Ben

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: PDF::API2 underlining text
Next: FAQ 5.41 How do I delete a directory tree?