From: tomasorti on
Hi.
I was wondering if it would be posible to filter the href from a HTML
file with sed.

For example: If a file has lots of these line:

.....
</p><p></p><h1><a name="uh-55" href="http://www.grymoire.com/Unix/
Sed.html#toc-uh-55">Hold with h or H</a></h1><p>The
<p><br>Click here to get file: <a href="http://www.grymoire.com/Unix/
Scripts/grep3a.sh">grep3a.sh</a><br>
</p><p></p><h1><a name="uh-56" href="http://www.grymoire.com/Unix/
Sed.html#toc-uh-56">Keeping more than one line in the hold buffer</a></
h1><p>The
<p><br>Click here to get file: <a href="http://www.grymoire.com/Unix/
Scripts/grep_previous.sh">grep_previous.sh</a><br>
....

And I want to get the "http:// .........", without the quotes.
Is it possible to do it with sed? Do I need awk?
Help/hints are appreciated.

Cheers,
Tom.
From: Dave B on
tomasorti wrote:

> Hi.
> I was wondering if it would be posible to filter the href from a HTML
> file with sed.
>
> For example: If a file has lots of these line:
>
> ....
> </p><p></p><h1><a name="uh-55" href="http://www.grymoire.com/Unix/
> Sed.html#toc-uh-55">Hold with h or H</a></h1><p>The
> <p><br>Click here to get file: <a href="http://www.grymoire.com/Unix/
> Scripts/grep3a.sh">grep3a.sh</a><br>
> </p><p></p><h1><a name="uh-56" href="http://www.grymoire.com/Unix/
> Sed.html#toc-uh-56">Keeping more than one line in the hold buffer</a></
> h1><p>The
> <p><br>Click here to get file: <a href="http://www.grymoire.com/Unix/
> Scripts/grep_previous.sh">grep_previous.sh</a><br>
> ...
>
> And I want to get the "http:// .........", without the quotes.
> Is it possible to do it with sed? Do I need awk?
> Help/hints are appreciated.

Processing markup data with awk or sed can be difficult if the format of the
input file isn't known in advance. You don't say if your input above is all
a single line, or represents many lines of data.

If you have a single link per input line, then you can do this with sed:

sed 's/.*href="\([^"]*\)".*/\1/' file.html

(assuming that the string you're interested in is the one between double
quotes following the href=)

With GNU awk, you can do something like this:

awk -v RS='http://[^"]+' '{print RT}' file.html

which, albeit by no means perfect, should work in most cases (beware that it
adds a blank line at the end). Keep in mind that it has several problems,
the first being that if http://something appears inside a regular tag, that,
and everything that follows, up to the first ", will be reported as a link.
However, that might work acceptably with your input.

--
echo 0|sed 's909=oO#3u)o19;s0#0ooo)].O0;s()(0bu}=(;s#}#.1m"?0^2{#;
s)")9v2@3%"9$);so%op]t(p$e#!o;sz(z^+.z;su+ur!z"au;sxzxd?_{h)cx;:b;
s/\(\(.\).\)\(\(..\)*\)\(\(.\).\)\(\(..\)*#.*\6.*\2.*\)/\5\3\1\7/;
tb'|awk '{while((i+=2)<=length($1)-18)a=a substr($1,i,1);print a}'
From: Dave B on
Dave B wrote:

> If you have a single link per input line, then you can do this with sed:
>
> sed 's/.*href="\([^"]*\)".*/\1/' file.html

If you have *at most* a link per input line, this is better:

sed -n 's/.*href="\([^"]*\)".*/\1/p' file.html

> With GNU awk, you can do something like this:
>
> awk -v RS='http://[^"]+' '{print RT}' file.html

A similar result (with the same problems) can be obtained with GNU grep, if
links don't have newlines in them (as it instead seems the case in the data
you posted. But maybe that was just line wraps):

grep -o 'http://[^"]\{1,\}' file.html

--
echo 0|sed 's909=oO#3u)o19;s0#0ooo)].O0;s()(0bu}=(;s#}#.1m"?0^2{#;
s)")9v2@3%"9$);so%op]t(p$e#!o;sz(z^+.z;su+ur!z"au;sxzxd?_{h)cx;:b;
s/\(\(.\).\)\(\(..\)*\)\(\(.\).\)\(\(..\)*#.*\6.*\2.*\)/\5\3\1\7/;
tb'|awk '{while((i+=2)<=length($1)-18)a=a substr($1,i,1);print a}'
From: tomasorti on
Great! Thank you very much!
Yes, I meant 1 link per line, and no newlines beetween them.
It seems that pasting them in the post broke them.

You are the man!
Thanks again.

On Jun 16, 11:17 pm, Dave B <da...(a)addr.invalid> wrote:
> Dave B wrote:
> > If you have a single link per input line, then you can do this with sed:
>
> > sed 's/.*href="\([^"]*\)".*/\1/' file.html
>
> If you have *at most* a link per input line, this is better:
>
> sed -n 's/.*href="\([^"]*\)".*/\1/p' file.html
>
> > With GNU awk, you can do something like this:
>
> > awk -v RS='http://[^"]+' '{print RT}' file.html
>
> A similar result (with the same problems) can be obtained with GNU grep, if
> links don't have newlines in them (as it instead seems the case in the data
> you posted. But maybe that was just line wraps):
>
> grep -o 'http://[^"]\{1,\}' file.html
>
> --
> echo 0|sed 's909=oO#3u)o19;s0#0ooo)].O0;s()(0bu}=(;s#}#.1m"?0^2{#;
> s)")9v2@3%"9$);so%op]t(p$e#!o;sz(z^+.z;su+ur!z"au;sxzxd?_{h)cx;:b;
> s/\(\(.\).\)\(\(..\)*\)\(\(.\).\)\(\(..\)*#.*\6.*\2.*\)/\5\3\1\7/;
> tb'|awk '{while((i+=2)<=length($1)-18)a=a substr($1,i,1);print a}'