From: "Gary ." on
On 7/8/10, Richard Quadling wrote:
> On 8 July 2010 16:15, Gary wrote:
>> Okay. At least one of the problems with this so called HTML seems to
>> be that the body tag looks like
>> <BODY vlink=#ffffff ...>
>> and xml_parse complains that "> required" on that line (i.e. it is
>> claiming it can't find the end of the tag!).
>>
>> I'm guessing that those attributes "must" be quoted in XML and
>> "should" be in HTML (but patently aren't)? Is there any way to get
>> xml_parse to ignore that? My element_handler functions never even get
>> a chance to see that line.
>>
>> Regex to insert quotes or remove the attributes entirely, perhaps?
>> *gulp* I hope there's a better way than that.
>
> So. Essentially, you want to parse some plain text which may or may
> not be well formed XML.

No. I don't *want* to.... And it isn't plain text, it's just sh*t html
(no doctype, missing closing tags in some cases, etc. It's an
absolute mess). Browsers are pretty good at handling it. XML
parsers... less so.

> How badly formed is the file going to be?

It's not a file. It comes from an embedded web server on a device. I
could ask them to change it. I can hear the laughter already.

> If it is things like missing ", then this could be managed with regex.
> Essentially you are going to have to do the clean up that Tidy could
> do for you.

Yeah :(
From: "Gary ." on
On 7/8/10, Marc Guay wrote:
>> And yes, I'd rather use DOM, but I can't.
>
> Could you use this: http://simplehtmldom.sourceforge.net/?

Interesting.

Although I can't use DOM or Tidy (because they're normally built in,
but TPTB decided to recompile PHP and exclude them, and I am not
allowed to recompile it with them in), that's external so might be a
possibility.

Thanks.
From: "Gary ." on
On 7/8/10, Nisse Engström wrote:
> On Thu, 8 Jul 2010 17:15:02 +0200, "Gary ." wrote:

>> I'm guessing that those attributes "must" be quoted in XML and
>> "should" be in HTML (but patently aren't)?
>
> For that attribute value, it's a "must" in both cases.

Okay. Please tell L******! :)
From: Richard Quadling on
On 8 July 2010 18:55, Gary . <php-general(a)garydjones.name> wrote:
> On 7/8/10, Marc Guay wrote:
>>> And yes, I'd rather use DOM, but I can't.
>>
>> Could you use this: http://simplehtmldom.sourceforge.net/?
>
> Interesting.
>
> Although I can't use DOM or Tidy (because they're normally built in,
> but TPTB decided to recompile PHP and exclude them, and I am not
> allowed to recompile it with them in), that's external so might be a
> possibility.
>
> Thanks.

If it were windows, then the Tidy extension is loadable via php.ini.

You could ask TPTB why they've removed the only tool that can read
this sh*t with any success?

Make the case for it. If they still say no, then tell them that the
sh*t is NOT XML and therefore the XML tools won't read it.