From: jellybean stonerfish on
In processing xml files with comments, I have been piping them through
this function, before I do my processing. This is to remove the
comments, before I process.

xml-decomment ()
{
gawk '
BEGIN { RS = "<!--"; FS = "-->"; ORS=""; OFS="" } { if ( NR > 1 )
$1=""; print }
'
}

Using a command like

xml-decomment < file.xml | script.awk

I am working on how to make a pattern/action rule to add to the top of my
awk program that would do the same. Something like this maybe.

gawk '
/<!--/,/-->/ { $0="" }
rest of script
'

This works fine on lines that are only comments, but fails on lines with
code and comments.

<!-- This comment
is spread over more than one line
and there is nothing else on these
lines but comments. This will filter out fine -->

<!ENTITY entity-name "Entity Value"> <!-- These comments break rule-->

My simple pattern rule above will delete the above line.

Any pointers would be appreciated.

From: Dave B on
jellybean stonerfish wrote:

> In processing xml files

I suggest you take a look at xmlgawk.

> with comments, I have been piping them through
> this function, before I do my processing. This is to remove the
> comments, before I process.
>
> xml-decomment ()
> {
> gawk '
> BEGIN { RS = "<!--"; FS = "-->"; ORS=""; OFS="" } { if ( NR > 1 )
> $1=""; print }
> '
> }

Note that "-->" is perfectly valid content for a tag. Let's assume that that
does not occur inside a regular tag.

Assuming that, since you use gawk, you can do this:

gawk -v RS='<!--|-->' 'NR%2' file.xml

to remove all the comments. Add "-v ORS=" if you want a more compact output
format.

> Using a command like
>
> xml-decomment < file.xml | script.awk
>
> I am working on how to make a pattern/action rule to add to the top of my
> awk program that would do the same. Something like this maybe.
>
> gawk '
> /<!--/,/-->/ { $0="" }
> rest of script
> '
>
> This works fine on lines that are only comments, but fails on lines with
> code and comments.
>
> <!-- This comment
> is spread over more than one line
> and there is nothing else on these
> lines but comments. This will filter out fine -->
>
> <!ENTITY entity-name "Entity Value"> <!-- These comments break rule-->
>
> My simple pattern rule above will delete the above line.
>
> Any pointers would be appreciated.

As I said above, you might find xmlgawk useful for processing xml data. If
you insist on using awk, to do what you need would require examining each
line to remove single-line comments and comments shorter than a single line,
and, for multiline comments, determine whether you are inside a comment or
not, which, without any apriori knowledge of the format of your comments,
might become a bit complicated.
If many comments are on a single line, intermixed with regular xml code,
given the greedy nature of awk regular expressions, it becomes quite
difficult to remove only those without having the need to make too many
assumptions.
That could be probably done in perl, but my knowledge of perl is still too
much basic to be of help here.

--
echo 0|sed 's909=oO#3u)o19;s0#0ooo)].O0;s()(0bu}=(;s#}#.1m"?0^2{#;
s)")9v2@3%"9$);so%op]t(p$e#!o;sz(z^+.z;su+ur!z"au;sxzxd?_{h)cx;:b;
s/\(\(.\).\)\(\(..\)*\)\(\(.\).\)\(\(..\)*#.*\6.*\2.*\)/\5\3\1\7/;
tb'|awk '{while((i+=2)<=length($1)-18)a=a substr($1,i,1);print a}'