Prev: FAQ 4.61 How can I always keep my hash sorted?
Next: Speed of reading some MB of data using qx(...)
From: shankar_perl_rookie on 21 Jul 2010 18:21 Hello All, I have an html file where I am trying to extract a table. The problem I am facing is there are lot of tables in the page and the table I am looking to extract appears after a particular string say $some_text. I know of a way that I can search for the string in the html page but what I want to do is capture a table that immediately follows the $some_text. Any suggestions on how to do this ?? Thanks, Shankar
From: Jim Gibson on 21 Jul 2010 19:08 In article <233f66ab-b5eb-449d-b3b0-b2542b8dbe31(a)i31g2000yqm.googlegroups.com>, shankar_perl_rookie <mulshankar(a)gmail.com> wrote: > Hello All, > > I have an html file where I am trying to extract a table. The problem > I am facing is there are lot of tables in the page and the table I am > looking to extract appears after a particular string say $some_text. I > know of a way that I can search for the string in the html page but > what I want to do is capture a table that immediately follows the > $some_text. > > Any suggestions on how to do this ?? The most reliable way would be to use the HTML::Parser module to parse the html file, register appropriate handlers for the table elements (<table>, <tr>, <td>) and one for text elements, look for your string, and process the next table encountered in a callback (handler subroutines are called as callbacks by the parsing method). Another way would be to use a module to extract tables from HTML. There are at least two on CPAN: HTML::TableExtract and HTML::TableParser. The problem using these is to find the table after the specified text. Is there some other way of identifying the table? The quick and dirty way is to use a regular expression (untested): if( $html =~ m{ $some_text .*? <table> (.*?) </table> }isx ) { # table contents in $1 } However, this will not always work. It fails if you have nested tables, for example, which is a common occurrence in some HTML. However, if you are in a hurry it might work for you. It is always better to use a real parser for HTML. -- Jim Gibson
From: HASM on 21 Jul 2010 21:06 Jim Gibson <jimsgibson(a)gmail.com> writes: >> I have an html file where I am trying to extract a table. The problem >> I am facing is there are lot of tables in the page and the table I am >> looking to extract appears after a particular string say $some_text. > The most reliable way would be to use the HTML::Parser module to parse > the html file, Or HTML::TreeBuilder; use HTML::TreeBuilder; use LWP::UserAgent; my $url = 'http://www.example.com/..."; my $browser = LWP::UserAgent->new; my $response = $browser->request (HTTP::Request->new(GET => $url)); if ($response->is_success) { my $tree = HTML::TreeBuilder->new; my $content = $tree->parse_content($response->decoded_content); # search for text with look_down (there are other way) my $text = $content->look_down (...) # then for your table my $table = $content->look_down ('_tag', 'table', ...) etc, -- HASM
From: sopan.shewale on 22 Jul 2010 01:22 The best way can be: use split on $some_text and throw away the first part. my ($junk, $interest_html) = split (/$some_text/, $html); on $interest_html - use HTML::TreeBuilder to parse the tables. grab the first table - you are done. Let me know if you find difficult to use HTML::TreeBuilder. --sopan shewale On Jul 22, 3:21 am, shankar_perl_rookie <mulshan...(a)gmail.com> wrote: > Hello All, > > I have an html file where I am trying to extract a table. The problem > I am facing is there are lot of tables in the page and the table I am > looking to extract appears after a particular string say $some_text. I > know of a way that I can search for the string in the html page but > what I want to do is capture a table that immediately follows the > $some_text. > > Any suggestions on how to do this ?? > > Thanks, > Shankar
|
Pages: 1 Prev: FAQ 4.61 How can I always keep my hash sorted? Next: Speed of reading some MB of data using qx(...) |