From: Raymond Irving on
Hello,

I'm experiencing another issue when attempting to use DOMDocument::loadXML()
to load the following HTML code:

<?php
$html = '
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<body>
<script type="text/javascript">
<!--
var i = 0, html = "<strong>Bold Text</strong>,Normal Text";
document.write(html);
i--; // this line causes the parser to fail
alert(html);
-->
</script>
</body>
</html>';
$dom = new DOMDocument();
$dom->loadXML($html);
echo $dom->saveHTML();
?>

The parser throws the following error when it encounters "i--" in inside the
<script> tag:

Warning: DOMDocument::loadXML() [domdocument.loadxml]: Comment not
terminated <!-- var i = 0, html = "<strong>Bold Text< in Entity

If I remove the like "i--" it will load the HTML code just fine.

Any ideas as to why this throws an error?

__
Raymond
From: Adam Richardson on
On Sun, Jun 6, 2010 at 10:39 PM, Raymond Irving <xwisdom(a)gmail.com> wrote:

> Hello,
>
> I'm experiencing another issue when attempting to use
> DOMDocument::loadXML()
> to load the following HTML code:
>
> <?php
> $html = '
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html>
> <body>
> <script type="text/javascript">
> <!--
> var i = 0, html = "<strong>Bold Text</strong>,Normal Text";
> document.write(html);
> i--; // this line causes the parser to fail
> alert(html);
> -->
> </script>
> </body>
> </html>';
> $dom = new DOMDocument();
> $dom->loadXML($html);
> echo $dom->saveHTML();
> ?>
>
> The parser throws the following error when it encounters "i--" in inside
> the
> <script> tag:
>
> Warning: DOMDocument::loadXML() [domdocument.loadxml]: Comment not
> terminated <!-- var i = 0, html = "<strong>Bold Text< in Entity
>
> If I remove the like "i--" it will load the HTML code just fine.
>
> Any ideas as to why this throws an error?
>
> __
> Raymond
>


A comment declaration starts with "<!", and ends with ">", with any number
of comments following the form --comment-- in between:
http://htmlhelp.com/reference/wilbur/misc/comment.html

You'll see at the bottom of the article that they advocate a simple rule in
comments:
An HTML comment begins with "<!--", ends with "-->" and does not contain "--"
or ">" anywhere in the comment.

The occurrence of "i--" breaks that rule.

In your case, if you're maintaining the pages, you can place the javascript
in a separate file or place the javascript in a CDATA section. If you're
parsing pages you don't maintain, you can rip out the javascript before
performing DOM tasks and parse it separately as needed to avoid potential
issues.

Adam

--
Nephtali: PHP web framework that functions beautifully
http://nephtaliproject.com
From: Raymond Irving on
Hi Adam,

Thanks for the update but I'm thinking that it would be much easier if the
DOM parser could just ignore the contents of the <script> tags when parsing
HTML content. This way we would not have to out JavaScript or force uses to
add JavaScript to a separate file.

What do you think?

__
Raymond Irving

On Sun, Jun 6, 2010 at 11:22 PM, Adam Richardson <simpleshot(a)gmail.com>wrote:

> On Sun, Jun 6, 2010 at 10:39 PM, Raymond Irving <xwisdom(a)gmail.com> wrote:
>
>> Hello,
>>
>> I'm experiencing another issue when attempting to use
>> DOMDocument::loadXML()
>> to load the following HTML code:
>>
>> <?php
>> $html = '
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
>> http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>> <html>
>> <body>
>> <script type="text/javascript">
>> <!--
>> var i = 0, html = "<strong>Bold Text</strong>,Normal Text";
>> document.write(html);
>> i--; // this line causes the parser to fail
>> alert(html);
>> -->
>> </script>
>> </body>
>> </html>';
>> $dom = new DOMDocument();
>> $dom->loadXML($html);
>> echo $dom->saveHTML();
>> ?>
>>
>> The parser throws the following error when it encounters "i--" in inside
>> the
>> <script> tag:
>>
>> Warning: DOMDocument::loadXML() [domdocument.loadxml]: Comment not
>> terminated <!-- var i = 0, html = "<strong>Bold Text< in Entity
>>
>> If I remove the like "i--" it will load the HTML code just fine.
>>
>> Any ideas as to why this throws an error?
>>
>> __
>> Raymond
>>
>
>
> A comment declaration starts with "<!", and ends with ">", with any number
> of comments following the form --comment-- in between:
> http://htmlhelp.com/reference/wilbur/misc/comment.html
>
> You'll see at the bottom of the article that they advocate a simple rule in
> comments:
> An HTML comment begins with "<!--", ends with "-->" and does not contain "
> --" or ">" anywhere in the comment.
>
> The occurrence of "i--" breaks that rule.
>
> In your case, if you're maintaining the pages, you can place the javascript
> in a separate file or place the javascript in a CDATA section. If you're
> parsing pages you don't maintain, you can rip out the javascript before
> performing DOM tasks and parse it separately as needed to avoid potential
> issues.
>
> Adam
>
> --
> Nephtali: PHP web framework that functions beautifully
> http://nephtaliproject.com
>
From: Andrew Ballard on
On Mon, Jun 7, 2010 at 3:30 PM, Raymond Irving <xwisdom(a)gmail.com> wrote:
> Hi Adam,
>
> Thanks for the update but I'm thinking that it would be much easier if the
> DOM parser could just ignore the contents of the <script> tags when parsing
> HTML content. This way we would not have to out JavaScript or force uses to
> add JavaScript to a separate file.
>
> What do you think?
>
> __
> Raymond Irving

You didn't tell it to open the contents as HTML; you told it to open
the contents as XML.

Andrew
From: Raymond Irving on
Well it actually failed when loadHTML() is used.
The strange thing is that it will fail regardless of the "--" characters:

"Unexpected end tag : strong in Entity"

__
Raymond Irving

On Mon, Jun 7, 2010 at 2:50 PM, Andrew Ballard <aballard(a)gmail.com> wrote:

> On Mon, Jun 7, 2010 at 3:30 PM, Raymond Irving <xwisdom(a)gmail.com> wrote:
> > Hi Adam,
> >
> > Thanks for the update but I'm thinking that it would be much easier if
> the
> > DOM parser could just ignore the contents of the <script> tags when
> parsing
> > HTML content. This way we would not have to out JavaScript or force uses
> to
> > add JavaScript to a separate file.
> >
> > What do you think?
> >
> > __
> > Raymond Irving
>
> You didn't tell it to open the contents as HTML; you told it to open
> the contents as XML.
>
> Andrew
>