|
Prev: When do I need `expr ......` and when not? $(($aaa - $bbb))simplyfiable?
Next: Bash-script produces "Interrupted system call"-print outs
From: jellybean stonerfish on 15 Jun 2008 10:48 In processing xml files with comments, I have been piping them through this function, before I do my processing. This is to remove the comments, before I process. xml-decomment () { gawk ' BEGIN { RS = "<!--"; FS = "-->"; ORS=""; OFS="" } { if ( NR > 1 ) $1=""; print } ' } Using a command like xml-decomment < file.xml | script.awk I am working on how to make a pattern/action rule to add to the top of my awk program that would do the same. Something like this maybe. gawk ' /<!--/,/-->/ { $0="" } rest of script ' This works fine on lines that are only comments, but fails on lines with code and comments. <!-- This comment is spread over more than one line and there is nothing else on these lines but comments. This will filter out fine --> <!ENTITY entity-name "Entity Value"> <!-- These comments break rule--> My simple pattern rule above will delete the above line. Any pointers would be appreciated.
From: Dave B on 15 Jun 2008 13:52
jellybean stonerfish wrote: > In processing xml files I suggest you take a look at xmlgawk. > with comments, I have been piping them through > this function, before I do my processing. This is to remove the > comments, before I process. > > xml-decomment () > { > gawk ' > BEGIN { RS = "<!--"; FS = "-->"; ORS=""; OFS="" } { if ( NR > 1 ) > $1=""; print } > ' > } Note that "-->" is perfectly valid content for a tag. Let's assume that that does not occur inside a regular tag. Assuming that, since you use gawk, you can do this: gawk -v RS='<!--|-->' 'NR%2' file.xml to remove all the comments. Add "-v ORS=" if you want a more compact output format. > Using a command like > > xml-decomment < file.xml | script.awk > > I am working on how to make a pattern/action rule to add to the top of my > awk program that would do the same. Something like this maybe. > > gawk ' > /<!--/,/-->/ { $0="" } > rest of script > ' > > This works fine on lines that are only comments, but fails on lines with > code and comments. > > <!-- This comment > is spread over more than one line > and there is nothing else on these > lines but comments. This will filter out fine --> > > <!ENTITY entity-name "Entity Value"> <!-- These comments break rule--> > > My simple pattern rule above will delete the above line. > > Any pointers would be appreciated. As I said above, you might find xmlgawk useful for processing xml data. If you insist on using awk, to do what you need would require examining each line to remove single-line comments and comments shorter than a single line, and, for multiline comments, determine whether you are inside a comment or not, which, without any apriori knowledge of the format of your comments, might become a bit complicated. If many comments are on a single line, intermixed with regular xml code, given the greedy nature of awk regular expressions, it becomes quite difficult to remove only those without having the need to make too many assumptions. That could be probably done in perl, but my knowledge of perl is still too much basic to be of help here. -- echo 0|sed 's909=oO#3u)o19;s0#0ooo)].O0;s()(0bu}=(;s#}#.1m"?0^2{#; s)")9v2@3%"9$);so%op]t(p$e#!o;sz(z^+.z;su+ur!z"au;sxzxd?_{h)cx;:b; s/\(\(.\).\)\(\(..\)*\)\(\(.\).\)\(\(..\)*#.*\6.*\2.*\)/\5\3\1\7/; tb'|awk '{while((i+=2)<=length($1)-18)a=a substr($1,i,1);print a}' |