Extracting table in html page [Perl]

Prev: FAQ 4.61 How can I always keep my hash sorted?
Next: Speed of reading some MB of data using qx(...)

From: shankar_perl_rookie on 21 Jul 2010 18:21

Hello All,

I have an html file where I am trying to extract a table. The problem
I am facing is there are lot of tables in the page and the table I am
looking to extract appears after a particular string say $some_text. I
know of a way that I can search for the string in the html page but
what I want to do is capture a table that immediately follows the
$some_text.

Any suggestions on how to do this ??

Thanks,
Shankar

From: Jim Gibson on 21 Jul 2010 19:08

In article
<233f66ab-b5eb-449d-b3b0-b2542b8dbe31(a)i31g2000yqm.googlegroups.com>,
shankar_perl_rookie <mulshankar(a)gmail.com> wrote:

> Hello All,
>
> I have an html file where I am trying to extract a table. The problem
> I am facing is there are lot of tables in the page and the table I am
> looking to extract appears after a particular string say $some_text. I
> know of a way that I can search for the string in the html page but
> what I want to do is capture a table that immediately follows the
> $some_text.
>
> Any suggestions on how to do this ??

The most reliable way would be to use the HTML::Parser module to parse
the html file, register appropriate handlers for the table elements
(<table>, <tr>, <td>) and one for text elements, look for your string,
and process the next table encountered in a callback (handler
subroutines are called as callbacks by the parsing method).

Another way would be to use a module to extract tables from HTML. There
are at least two on CPAN: HTML::TableExtract and HTML::TableParser. The
problem using these is to find the table after the specified text. Is
there some other way of identifying the table?

The quick and dirty way is to use a regular expression (untested):

if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) {
# table contents in $1
}

However, this will not always work. It fails if you have nested tables,
for example, which is a common occurrence in some HTML. However, if you
are in a hurry it might work for you. It is always better to use a real
parser for HTML.

--
Jim Gibson

From: HASM on 21 Jul 2010 21:06

Jim Gibson <jimsgibson(a)gmail.com> writes:

>> I have an html file where I am trying to extract a table. The problem
>> I am facing is there are lot of tables in the page and the table I am
>> looking to extract appears after a particular string say $some_text.

> The most reliable way would be to use the HTML::Parser module to parse
> the html file,

Or HTML::TreeBuilder;

use HTML::TreeBuilder;
use LWP::UserAgent;
my $url = 'http://www.example.com/...";
my $browser = LWP::UserAgent->new;
my $response = $browser->request (HTTP::Request->new(GET => $url));
if ($response->is_success) {
my $tree = HTML::TreeBuilder->new;
my $content =
$tree->parse_content($response->decoded_content);
# search for text with look_down (there are other way)
my $text = $content->look_down (...)
# then for your table
my $table = $content->look_down ('_tag', 'table', ...)

etc,

-- HASM

From: sopan.shewale on 22 Jul 2010 01:22

The best way can be:
use split on $some_text and throw away the first part.
my ($junk, $interest_html) = split (/$some_text/, $html);

on $interest_html - use HTML::TreeBuilder to parse the tables.
grab the first table - you are done.

Let me know if you find difficult to use HTML::TreeBuilder.

--sopan shewale

On Jul 22, 3:21 am, shankar_perl_rookie <mulshan...(a)gmail.com> wrote:
> Hello All,
>
> I have an html file where I am trying to extract a table. The problem
> I am facing is there are lot of tables in the page and the table I am
> looking to extract appears after a particular string say $some_text. I
> know of a way that I can search for the string in the html page but
> what I want to do is capture a table that immediately follows the
> $some_text.
>
> Any suggestions on how to do this ??
>
> Thanks,
> Shankar

|
Pages: 1
Prev: FAQ 4.61 How can I always keep my hash sorted?
Next: Speed of reading some MB of data using qx(...)