From: Helmut Richter on
For a seemingly simple problem with regular expressions I tried out several
solutions. One of them seems to be working now, but I would like to learn why
the solutions behave differently. Perl is 5.8.8 on Linux.

The task is to replace the characters # $ \ by their HTML entity, e.g. #
but not within markup. The following code reads and consumes a variable
$inbuf0 and builds up a variable $inbuf with the result.

Solution 1:

while ($inbuf0) {
$inbuf0 =~ /^(?: # skip initial sequences of
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
)*/x;
$inbuf .= $&;
$inbuf0 = $';
if ($inbuf0) {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};

Here the regexp eats up the maximal initial string (note the * at the end of
the regexp) that needs not be processed and then processes the first character
of the remainder.

In this version, it sometimes works and sometimes blows up with segmentation
fault.

Another version has * instead of + at the "harmless characters". That one does
not try all alternatives as the first one matches always, that is, the * at
the end of the regexp is not used in this case.

Yet another version has nothing instead of + at the "harmless characters";
thus eating zero or one character per iteration of the final *. This should
have the same net effect, but it always blows up with segmentation fault.


Solution 2:

while ($inbuf0) {
if ($inbuf0 =~ /^# skip initial
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
/x) {
$inbuf .= $&;
$inbuf0 = $';
} else {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};

Here the regexp eats up an initial string, typically not maximal (note the
absence of * at the end of the regexp), that needs not be processed and, if
nothing has been found, processes the first character of the input.

This version runs considerably slower, by a factor of three, but has so far
not yielded segmentation faults. I am using it now.

I am sure there are lots of other ways to do it. With which knowledge
could I have saved the time of the numerous trial-and-error cycles and
done it alright from the beginning?

--
Helmut Richter
From: Peter Makholm on
Helmut Richter <hhr-m(a)web.de> writes:

> For a seemingly simple problem with regular expressions I tried out several
> solutions. One of them seems to be working now, but I would like to learn why
> the solutions behave differently. Perl is 5.8.8 on Linux.

The regexp engine in perl 5.8.8 is implemented by recursion. This is
known to cause segmentation faults on some occasions. See
http://www.nntp.perl.org/group/perl.perl5.porters/2006/05/msg113036.html

Upgrading to perl 5.10 solves this issue by making the regexp engine
iterative instead.

> The task is to replace the characters # $ \ by their HTML entity, e.g. &#35;
> but not within markup. The following code reads and consumes a variable
> $inbuf0 and builds up a variable $inbuf with the result.

Trying to handle XML and HTML correctly by parsing it with regular
expressions isn't recommended at all. I would use some XML parser and
walk through the DOM and change the content of text nodes with the
trivial substitution on each text node.

//Makholm
From: J�rgen Exner on
Helmut Richter <hhr-m(a)web.de> wrote:
>For a seemingly simple problem with regular expressions I tried out several
>solutions. One of them seems to be working now, but I would like to learn why
>the solutions behave differently. Perl is 5.8.8 on Linux.
>
>The task is to replace the characters # $ \ by their HTML entity, e.g. &#35;
>but not within markup.
[...]

You may want to read up on Chomsky hierarchy. HTML is a not a a regular
language but a context-free language. Therefore it cannot be parsed by a
regular engine.

Granted, Perl's Regular Expressions have extensions that make them
significantly more powerful than a formal regular engine, but they are
still the wrong tool for the job. Use any standard HTML parser to
dissect your file into its components and then apply your substitution
to those components where you want them applied.

jue
From: Helmut Richter on
On Fri, 12 Feb 2010, wrote:

> You may want to read up on Chomsky hierarchy. HTML is a not a a regular
> language but a context-free language. Therefore it cannot be parsed by a
> regular engine.

But the distinction of markup and non-markup is. The only parenthesis-like
structure I have so far found is the nesting of brackets in <!CDATA[ ... ]]>
but this is also regular, as ]]> cannot occur inside.

*If* I were interested in the semantics of the tags, I would probably
follow the advice given here to use an XML analyser, provided I keep the
control of what to do when the input is not well-formed XML. Just being
told "your data is not okay, so cannot do anything with it" would not
suffice: Even in an environment where the end-user has full control of
everything, it is not always the best idea to have him fix every error
before proceeding; sometimes it is better to let errors in the input and
fix them at a later step.

--
Helmut Richter
From: Dr.Ruud on
Helmut Richter wrote:

> [again parsing the wrong way]

Is there a newsgroup or mailing list that we can refer "them" to?
I am sure that we are well past our monthly share already.

--
Ruud