From: Ivan Shmakov on
IIRC, some time ago someone have asked here about a better way
to extract specific data from HTML files. Having learned the
basics of the XSLT 1.0 language almost two years ago, I cannot
help myself feeling that it is such a way.

Consider, e. g.:

$ xsltproc --html href.xsl \
http://en.wikipedia.org/wiki/
#column-one
#searchInput
/wiki/Wikipedia

http://wikimediafoundation.org/wiki/Privacy_policy
/wiki/Wikipedia:About
/wiki/Wikipedia:General_disclaimer
$

The XSLT code is as follows:

$ cat href.xsl
<?xml version="1.0"?> <!-- -*- XML -*- -->
<!-- href.xsl &mdash; Extract the payload of <a href="" /> -->
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="text" />

<xsl:template match="*">
<!--
<xsl:message>* Processing <xsl:value-of select="local-name (.)" />
</xsl:message>
-->
<xsl:apply-templates />
</xsl:template>

<xsl:template match="a">
<!--
<xsl:message>* Processing [a] <xsl:value-of select="local-name (.)" />
</xsl:message>
-->
<xsl:apply-templates select="@href" />
</xsl:template>

<xsl:template match="a/@href">
<xsl:value-of select="." />
<xsl:text>&#10;</xsl:text>
</xsl:template>

<xsl:template match="@*|text()|comment()">
<!-- do nothing -->
</xsl:template>

</xsl:stylesheet>
<!-- href.xsl ends here -->
$

(Sort of Awk-ish, isn't it?)

--
FSF associate member #7257
From: pk on
Ivan Shmakov wrote:

> </xsl:stylesheet>
> <!-- href.xsl ends here -->
> $
>
> (Sort of Awk-ish, isn't it?)

You may want to look into xmlgawk, in case you don't know already.

From: Thomas 'PointedEars' Lahn on
Ivan Shmakov wrote:

> IIRC, some time ago someone have asked here about a better way
> to extract specific data from HTML files. Having learned the
> basics of the XSLT 1.0 language almost two years ago, I cannot
> help myself feeling that it is such a way.

HTML is not necessarily well-formed, so generally you cannot apply XSLT to
it. You can try to convert it to XHTML with e.g. htmltidy(1), and if you
are lucky you can apply XSLT to the result (BTDT), or you can transform
XML/XHTML to HTML with XSLT.

> Consider, e. g.:
>
> $ xsltproc --html href.xsl \
> http://en.wikipedia.org/wiki/

This works by coincidence because the referred original document is written
in Valid XHTML, not HTML. However, for extracting specific data out of
markup documents you would use XPath directly; XSLT using XPath is a
possibility (and a less efficient one at that), but not a necessity.

What does this have to do with *x shells anyway?

> (Sort of Awk-ish, isn't it?)

Yes, like PHP 5 is sort of C++-ish. (Was that your question?)


PointedEars