From: Andreas Marschke on
> Whereever you build pipelines of: cut, head, tail, sed, grep, tr, etc.
> etc. use (e.g.) awk(1) instead; and "avaiable on every possible machine";
> it's standard on Unix and available even for WinDOS if you like. Another
> option, if you're not repelled by it's syntax, is perl (it's non-standard
> on Unixes, but generally available as well).
>
> Janis

TBH I havent taken the time yet to have a look into awk. But I try to learn
perl besides my current work on c++ applications.

So well yes I will have a look and see what I can do with your tool of
choice. Thanks!
From: mop2 on
On Mon, 22 Feb 2010 11:06:09 -0300, mop2 <invalid(a)mail.address>
wrote:

> On Mon, 22 Feb 2010 07:43:38 -0300, Andreas Marschke
> <xxtjaxx(a)gmail.com> wrote:

>> To start it off Here is a simple bash script scraping the daily
>> JARGON off
>> the website for the new hackers dictionary:
>>
>> |+-+-+-+-+-+--+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|
>> #!/bin/bash
>>
>> wget http://www.jargon.net/ -O- 2>/dev/null | grep '<A
>> HREF="/jargonfile/[a-
>> z]/[a-zA-Z0-9]*.html">[a-zA-Z0-9]*</A>' | sed
>> 's:\(<[a-zA-Z0-9]*>\|</[a-zA-
>> Z0-9]*>\|<A
>> HREF="/[a-zA-Z0-9]*/[a-z]/[a-zA-Z0-9]*\.html">\|<[a-z]*>\|</[a-
>> z]*>\)::g' | sed s/\ \ */\ /g
>> |+-+-+-+-+-+--+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|


>
> An alternative for one line mono spaced, specific for that site:
>
> echo `wget -qO- http://www.jargon.net/|grep HR|sed 's/<[^>]*>//g'`
>
> For fragment of web pages, I think 3 generic functions are
> convenient:
> f1 - get the page
> f2 - filter the desired fragment
> f3 - remove html tags and display as text, monospaced and
> honoring newlines and, perhaps, bold tags
>
> Or perhaps one function with 3 parameters
>


With this I can see a fragment of page suggested as example by
Andreas:

wget -qO- http://www.jargon.net/|grep -A99 '^</sc'|grep -B99 -m1
'^<img'

It is a bit greater than the final target desired, and intended as
source
for the question bellow.

If I define the point A as start and B as end in the stream:

A='<br />
</center>
<p>
<font size="+1">'

B="</font></p><center>
You're Visitor"

Which is the suggestion to get all text between these two points, but
without them, and without the html tags?

I don't see nothing generic, practical and elegant.
Can someone help?
From: Ivan Shmakov on
>>>>> "AM" == Andreas Marschke <xxtjaxx(a)gmail.com> writes:

[...]

AM> To start it off Here is a simple bash script scraping the daily
AM> JARGON off the website for the new hackers dictionary:

AM> #!/bin/bash

AM> wget http://www.jargon.net/ -O- 2>/dev/null

[...]

Funny enough, but I have a similar script to fetch OISSTv2 [1]
data from an FTP server. Like:

#!/bin/bash

p=~/public/hist/logs/download/ftp.emc.ncep.noaa.gov-$(date +%s).

## NB: here, we parse the Squid caching proxy output, not the FTP
## server's one (as the latter isn't going to be HTML.)
wget -qO - \
--force-directories --timestamping \
ftp://ftp.emc.ncep.noaa.gov/cmb/sst/oisst_v2/ \
| sed -ne '\,.*<'A' HREF="\([^"]\+\)">[^"<>]*</A>/</H2>$, {
s,,ftp.emc.ncep.noaa.gov\1, ;
h ;
} ;
\,^<'A' HREF="\([^/"]\+\)">.*, {
s//\1/ ; G ; s/\(.*\)\n\(.*\)/\2\1/ ; p ;
}' \
| grep -E -- '\<oisst\.[[:digit:]]*' \
| LC_ALL=C sort -r \
| (while read f ; do test -e "$f" || echo ftp://"$f" ; done) \
> "$p"in

## NB: beware of the race here
LC_ALL=C wget -b -6 --quota=256M \
--server-response \
-i "$p"in -o "$p"out
sleep 1s
chmod =r -- "$p"{in,out}

But perhaps it should read instead:

....
## NB: race is still possible here (though a bit less likely)
exec > "$p"out
chmod =r -- "$p"out
## ... or should we just umask before exec instead?
LC_ALL=C wget -b -6 --quota=256M \
--server-response \
-i "$p"in -o /dev/stdout

As for the race condition, it's avoided by the virtue of the
fact that this script isn't usually run in parallel at all.

[1] http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.html

--
FSF associate member #7257
From: Ivan Shmakov on
>>>>> Ivan Shmakov <ivan(a)main.uusia.org> writes:

[...]

Oops, a silly mistake here.

## NB: here, we parse the Squid caching proxy output, not the FTP
## server's one (as the latter isn't going to be HTML.)
wget -qO - \
- --force-directories --timestamping \
ftp://ftp.emc.ncep.noaa.gov/cmb/sst/oisst_v2/ \

[...]

## NB: beware of the race here
LC_ALL=C wget -b -6 --quota=256M \
+ --force-directories --timestamping \
--server-response \
-i "$p"in -o "$p"out

[...]

Also note that the line below (and also the second wget
invocation) implies that the script should be run from the
directory below which the retrieved data is stored.

> | (while read f ; do test -e "$f" || echo ftp://"$f" ; done) \

--
FSF associate member #7257
From: mop2 on
On Mon, 22 Feb 2010 20:33:33 -0300, mop2 <invalid(a)mail.address>
wrote:

> On Mon, 22 Feb 2010 11:06:09 -0300, mop2 <invalid(a)mail.address>
> wrote:
>
>> On Mon, 22 Feb 2010 07:43:38 -0300, Andreas Marschke
>> <xxtjaxx(a)gmail.com> wrote:
>
>>> To start it off Here is a simple bash script scraping the daily
>>> JARGON off
>>> the website for the new hackers dictionary:
>>>
>>> |+-+-+-+-+-+--+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|
>>> #!/bin/bash
>>>
>>> wget http://www.jargon.net/ -O- 2>/dev/null | grep '<A
>>> HREF="/jargonfile/[a-
>>> z]/[a-zA-Z0-9]*.html">[a-zA-Z0-9]*</A>' | sed
>>> 's:\(<[a-zA-Z0-9]*>\|</[a-zA-
>>> Z0-9]*>\|<A
>>> HREF="/[a-zA-Z0-9]*/[a-z]/[a-zA-Z0-9]*\.html">\|<[a-z]*>\|</[a-
>>> z]*>\)::g' | sed s/\ \ */\ /g
>>> |+-+-+-+-+-+--+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|
>
>
>>
>> An alternative for one line mono spaced, specific for that site:
>>
>> echo `wget -qO- http://www.jargon.net/|grep HR|sed 's/<[^>]*>//g'`
>>
>> For fragment of web pages, I think 3 generic functions are
>> convenient:
>> f1 - get the page
>> f2 - filter the desired fragment
>> f3 - remove html tags and display as text, monospaced and
>> honoring newlines and, perhaps, bold tags
>>
>> Or perhaps one function with 3 parameters
>>
>
>
> With this I can see a fragment of page suggested as example by
> Andreas:
>
> wget -qO- http://www.jargon.net/|grep -A99 '^</sc'|grep -B99 -m1
> '^<img'
>
> It is a bit greater than the final target desired, and intended as
> source
> for the question bellow.
>
> If I define the point A as start and B as end in the stream:
>
> A='<br />
> </center>
> <p>
> <font size="+1">'
>
> B="</font></p><center>
> You're Visitor"
>
> Which is the suggestion to get all text between these two points,
> but
> without them, and without the html tags?
>
> I don't see nothing generic, practical and elegant.
> Can someone help?


Without newlines in the marks, no problems (well, elegance?? ):

$ cat g
#!/bin/bash

wg()
{
wget -qO- "$1"|
tr -s '\n\t' ' '|
sed "s/> </></g;s|.*$2||;s|$3.*||;s/<[^>]*>//g"|
fmt -w 78
}

case "$1" in
jargon) wg http://www.jargon.net/ ' 1995.<br /></center><p><font
size="+1">' "You're Visitor ";;
esac


$ ./g jargon
flat-file /adj./ A flattened representation of some database or tree
or
network structure as a single file from which the structure could
implicitly
be rebuilt, esp. one in flat-ASCII form. See also sharchive.
$


However, newlines are important, particularly when exist some TAGs ,
like
"<pre>" for example, or when there is repetitive marks, where just
one
more newline can be a differential.
The search continues...

Sorry, I have a special interest in this kind of "toy". :)