|
Prev: "Ctrl-o" equivalent in csh or tcsh
Next: How to list all active local ports where a server/services is listening?
From: tomasorti on 16 Jun 2008 13:33 Hi. I was wondering if it would be posible to filter the href from a HTML file with sed. For example: If a file has lots of these line: ..... </p><p></p><h1><a name="uh-55" href="http://www.grymoire.com/Unix/ Sed.html#toc-uh-55">Hold with h or H</a></h1><p>The <p><br>Click here to get file: <a href="http://www.grymoire.com/Unix/ Scripts/grep3a.sh">grep3a.sh</a><br> </p><p></p><h1><a name="uh-56" href="http://www.grymoire.com/Unix/ Sed.html#toc-uh-56">Keeping more than one line in the hold buffer</a></ h1><p>The <p><br>Click here to get file: <a href="http://www.grymoire.com/Unix/ Scripts/grep_previous.sh">grep_previous.sh</a><br> .... And I want to get the "http:// .........", without the quotes. Is it possible to do it with sed? Do I need awk? Help/hints are appreciated. Cheers, Tom.
From: Dave B on 16 Jun 2008 14:03 tomasorti wrote: > Hi. > I was wondering if it would be posible to filter the href from a HTML > file with sed. > > For example: If a file has lots of these line: > > .... > </p><p></p><h1><a name="uh-55" href="http://www.grymoire.com/Unix/ > Sed.html#toc-uh-55">Hold with h or H</a></h1><p>The > <p><br>Click here to get file: <a href="http://www.grymoire.com/Unix/ > Scripts/grep3a.sh">grep3a.sh</a><br> > </p><p></p><h1><a name="uh-56" href="http://www.grymoire.com/Unix/ > Sed.html#toc-uh-56">Keeping more than one line in the hold buffer</a></ > h1><p>The > <p><br>Click here to get file: <a href="http://www.grymoire.com/Unix/ > Scripts/grep_previous.sh">grep_previous.sh</a><br> > ... > > And I want to get the "http:// .........", without the quotes. > Is it possible to do it with sed? Do I need awk? > Help/hints are appreciated. Processing markup data with awk or sed can be difficult if the format of the input file isn't known in advance. You don't say if your input above is all a single line, or represents many lines of data. If you have a single link per input line, then you can do this with sed: sed 's/.*href="\([^"]*\)".*/\1/' file.html (assuming that the string you're interested in is the one between double quotes following the href=) With GNU awk, you can do something like this: awk -v RS='http://[^"]+' '{print RT}' file.html which, albeit by no means perfect, should work in most cases (beware that it adds a blank line at the end). Keep in mind that it has several problems, the first being that if http://something appears inside a regular tag, that, and everything that follows, up to the first ", will be reported as a link. However, that might work acceptably with your input. -- echo 0|sed 's909=oO#3u)o19;s0#0ooo)].O0;s()(0bu}=(;s#}#.1m"?0^2{#; s)")9v2@3%"9$);so%op]t(p$e#!o;sz(z^+.z;su+ur!z"au;sxzxd?_{h)cx;:b; s/\(\(.\).\)\(\(..\)*\)\(\(.\).\)\(\(..\)*#.*\6.*\2.*\)/\5\3\1\7/; tb'|awk '{while((i+=2)<=length($1)-18)a=a substr($1,i,1);print a}'
From: Dave B on 16 Jun 2008 17:17 Dave B wrote: > If you have a single link per input line, then you can do this with sed: > > sed 's/.*href="\([^"]*\)".*/\1/' file.html If you have *at most* a link per input line, this is better: sed -n 's/.*href="\([^"]*\)".*/\1/p' file.html > With GNU awk, you can do something like this: > > awk -v RS='http://[^"]+' '{print RT}' file.html A similar result (with the same problems) can be obtained with GNU grep, if links don't have newlines in them (as it instead seems the case in the data you posted. But maybe that was just line wraps): grep -o 'http://[^"]\{1,\}' file.html -- echo 0|sed 's909=oO#3u)o19;s0#0ooo)].O0;s()(0bu}=(;s#}#.1m"?0^2{#; s)")9v2@3%"9$);so%op]t(p$e#!o;sz(z^+.z;su+ur!z"au;sxzxd?_{h)cx;:b; s/\(\(.\).\)\(\(..\)*\)\(\(.\).\)\(\(..\)*#.*\6.*\2.*\)/\5\3\1\7/; tb'|awk '{while((i+=2)<=length($1)-18)a=a substr($1,i,1);print a}'
From: tomasorti on 17 Jun 2008 02:40
Great! Thank you very much! Yes, I meant 1 link per line, and no newlines beetween them. It seems that pasting them in the post broke them. You are the man! Thanks again. On Jun 16, 11:17 pm, Dave B <da...(a)addr.invalid> wrote: > Dave B wrote: > > If you have a single link per input line, then you can do this with sed: > > > sed 's/.*href="\([^"]*\)".*/\1/' file.html > > If you have *at most* a link per input line, this is better: > > sed -n 's/.*href="\([^"]*\)".*/\1/p' file.html > > > With GNU awk, you can do something like this: > > > awk -v RS='http://[^"]+' '{print RT}' file.html > > A similar result (with the same problems) can be obtained with GNU grep, if > links don't have newlines in them (as it instead seems the case in the data > you posted. But maybe that was just line wraps): > > grep -o 'http://[^"]\{1,\}' file.html > > -- > echo 0|sed 's909=oO#3u)o19;s0#0ooo)].O0;s()(0bu}=(;s#}#.1m"?0^2{#; > s)")9v2@3%"9$);so%op]t(p$e#!o;sz(z^+.z;su+ur!z"au;sxzxd?_{h)cx;:b; > s/\(\(.\).\)\(\(..\)*\)\(\(.\).\)\(\(..\)*#.*\6.*\2.*\)/\5\3\1\7/; > tb'|awk '{while((i+=2)<=length($1)-18)a=a substr($1,i,1);print a}' |