From: Nathan Nobbe on
On Sat, Apr 3, 2010 at 8:29 AM, tedd <tedd(a)sperling.com> wrote:

> Hi gang:
>
> Here's the problem.
>
> I have 184 HTML pages in a directory and each page contain a question. The
> question is noted in the HTML DOM like so:
>
> <p class="question">
> Who is Roger Rabbit?
> </p>
>
> My question is -- how can I extract the string "Who is Roger Rabbit?" from
> each page using php? You see, I want to store the questions in a database
> without having to re-type, or cut/paste, each one.
>

if the files are html on the server then it should be easy to loop over each
one, loading the markup into memory and searching for what you want. id go
for xpath myself; i tend to always start there and fall back to regex since
xpath & xsl are so much cleaner for dealing w/ markup.

anyway heres the demo
--------------
tedd.html
--------------
<html>
<div>
sadfasdf
</div>
<h1>hello</h1>
<p class="question">
Who is Roger Rabbit?
</p>
<h2>more stuff</h2>
<p class="question">
Who is Roger Rabbit?
</p>
</html>

---------------------
transform.php
---------------------
<?php
// here is where you load a single file or change to iterate over a
// directory of files
$oDomDoc = DOMDocument::loadHTMLFile('./tedd.html');

// here is where you search for the question sections of each file
$oDomXpath = new DOMXPath($oDomDoc);
$oNodeList = $oDomXpath->query("//p[@class='question']");

// here is where you extract the question sections of each file
foreach($oNodeList as $oDomNode)
var_dump($oDomNode->nodeValue);


should be trivial to expand that to work w/ multiple files.



> Now, I can extract each question by using javascript --
>
> document.getElementById("question").innerHTML;
>

tedd, are you slipping? i thought you were searching by the class
attribute, lol.

-nathan
From: tedd on
At 3:58 PM +0100 4/3/10, Ashley Sheridan wrote:
>I don't think there is a getElementsByClass function. HTML5 is
>proposing one, but that will most likely be implemented in
>Javascript before PHP Dom. There is a way to tidy up the HTML to
>make it XHTML, but I'm not sure what it is. If you know roughly
>where in the document the HTML snippet is you can use XPath to grab
>it.
>
>Failing that, what about a regex? It shouldn't be too hard to write
>a regex to match your example above.
>
>Thanks,
>Ash

Ash:

I don't have a problem solving the problem the long way, which is to:

1. Load the file;
2. Parse between the markers;
3. Strip tags and replace extra white space.
4. Save to the db.

In fact, here's the code I used to solve the problem:

//--------

$filesize = filesize($filename);
$file = fopen( $filename, "r" );
$text = fread( $file, $filesize );
fclose( $file );

$marker1 = "<p class=\"question\">";
$marker2 = "</p>";

$first = strpos($text, $marker1)+20;
$last = strpos($text, $marker2);
$len = $last - $first;

$text = substr($text, $first , $len);
$text = strip_tags($text);

$space = array(' ', "\t", "\n", "\r", "\x0B", "\x0C");

$words = array();
$all_words = explode(' ', $text);
{
$line = str_replace($space, '', $line);
if (strlen($line) > 0)
{
$words[] = $line;
}
}

$text = implode(' ',$words);
$text = htmlspecialchars($text);

//---------

I was just exploring PHP's getElement thing and wasn't having much
luck with it.

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
From: "Peter Pei" on

>> Somejavascript engine already support GetElementByClass, for example
>> Opera does.
>
> My example shows how, namely:
>
> document.getElementById("question").innerHTML;
>
> will return the value within the class.
>
> Cheers,
>
> tedd
>

In your original post, you said the data you had was:

<p class="question">
Who is Roger Rabbit?
</p>

Does that still stand? or there was a typo, and class should really be ID?
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: tedd on
At 8:14 AM -0600 4/3/10, Peter Pei wrote:
>No javascript's getElementByID() won't work here. As "question" is a
>class, not an ID. But like what was mentioned here, you can use
>getElementByClass() with Opera, and that will work.

Sort of.

Like I said, the folling will work:

document.getElementById("question").innerHTML;

While you are using a getElementById, which returns an ID, but adding
..innerHTML will return the class value.

Try it.

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
From: tedd on
At 5:16 PM +0200 4/3/10, Piero Steinger wrote:
>
>Hi
>
>You could replace the "class" with "id" and then go on with JavaScript.
>
>A possible better way are regular expressions...
>
>
>Greetz
>Piero

I can go with javascript "as-is" (what I showed) and don't have to
change any html.

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com