intolerant HTML parser [Python]

Prev: [RELEASED] Python 2.7 alpha 3
Next: How to print all expressions that match a regular expression

From: Jim on 6 Feb 2010 14:09

I generate some HTML and I want to include in my unit tests a check
for syntax. So I am looking for a program that will complain at any
syntax irregularities.

I am familiar with Beautiful Soup (use it all the time) but it is
intended to cope with bad syntax. I just tried feeding
HTMLParser.HTMLParser some HTML containing 'ab' and it
didn't complain.

That is, this:
h=HTMLParser.HTMLParser()
try:
h.feed('ab')
h.close()
print "I expect not to see this line"
except Exception, err:
print "exception:",str(err)
gives me "I expect not to see this line".

Am I using that routine incorrectly? Is there a natural Python choice
for this job?

Thanks,
Jim

From: John Nagle on 6 Feb 2010 14:43

Jim wrote:
> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
>
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing 'ab' and it
> didn't complain.

Try HTML5lib.

http://code.google.com/p/html5lib/downloads/list

The syntax for HTML5 has well-defined notions of "correct",
"fixable", and "unparseable". For example, the common but
incorrect form of HTML comments,

<- comment ->

is understood.

HTML5lib is slow, though. Sometimes very slow. It's really a reference
implementation of the spec. There's code like this:

#Should speed up this check somehow (e.g. move the set to a constant)
if ((0x0001 <= charAsInt <= 0x0008) or
(0x000E <= charAsInt <= 0x001F) or
(0x007F <= charAsInt <= 0x009F) or
(0xFDD0 <= charAsInt <= 0xFDEF) or
charAsInt in frozenset([0x000B, 0xFFFE, 0xFFFF, 0x1FFFE,
0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE,
0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE,
0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE,
0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE,
0xFFFFF, 0x10FFFE, 0x10FFFF])):
self.tokenQueue.append({"type": tokenTypes["ParseError"],
"data":
"illegal-codepoint-for-numeric-entity",
"datavars": {"charAsInt": charAsInt}})

Every time through the loop (once per character), they build that frozen
set again.

John Nagle

From: Jim on 6 Feb 2010 14:35

Thank you, John. I did not find that by looking around; I must not
have used the right words. The speed of the unit tests are not
critical so this seems like the solution for me.

Jim

From: Nobody on 6 Feb 2010 23:33

On Sat, 06 Feb 2010 11:09:31 -0800, Jim wrote:

> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.
>
> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing 'ab' and it
> didn't complain.

HTMLParser is a tokeniser, not a parser. It treats the data as a
stream of tokens (tags, entities, PCDATA, etc); it doesn't know anything
about the HTML DTD. For all it knows, the above example could be perfectly
valid (the "b" element might allow both its start and end tags to be
omitted).

Does the validation need to be done in Python? If not, you can use
"nsgmls" to validate any SGML document for which you have a DTD. OpenSP
includes nsgmls along with the various HTML DTDs.

From: Stefan Behnel on 8 Feb 2010 04:16

Jim, 06.02.2010 20:09:
> I generate some HTML and I want to include in my unit tests a check
> for syntax. So I am looking for a program that will complain at any
> syntax irregularities.

First thing to note here is that you should consider switching to an HTML
generation tool that does this automatically. Generating markup manually is
usually not a good idea.

> I am familiar with Beautiful Soup (use it all the time) but it is
> intended to cope with bad syntax. I just tried feeding
> HTMLParser.HTMLParser some HTML containing 'ab' and it
> didn't complain.
>
> That is, this:
> h=HTMLParser.HTMLParser()
> try:
> h.feed('ab')
> h.close()
> print "I expect not to see this line"
> except Exception, err:
> print "exception:",str(err)
> gives me "I expect not to see this line".
>
> Am I using that routine incorrectly? Is there a natural Python choice
> for this job?

You can use lxml and let it validate the HTML output against the HTML DTD.
Just load the DTD from a catalog using the DOCTYPE in the document (see the
'docinfo' property on the parse tree).

http://codespeak.net/lxml/validation.html#id1

Note that when parsing the HTML file, you should disable the parser failure
recovery to make sure it barks on syntax errors instead of fixing them up.

http://codespeak.net/lxml/parsing.html#parser-options
http://codespeak.net/lxml/parsing.html#parsing-html

Stefan

| Next | Last
Pages: 1 2 3
Prev: [RELEASED] Python 2.7 alpha 3
Next: How to print all expressions that match a regular expression