From: Johann Spies on
I am trying to get csv-output from a html-file.

With this code I had a little success:
=========================
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re

f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
rows = table.findAll('tr')
for th in rows[0]:
t = th.find(text=True)
g.write(t)
g.write(',')
# print(','.join(t))

for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
t = td.find(text=True).replace(' ','')
g.write(t)
except:
g.write ('')
g.write(",")
g.write("\n")
===============================

producing output like this:

RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1,,,,drop,Log,Any,,,
2,All Users(a)Any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4,,,,drop,None,Any,,,
....

It left out all the non-plaintext parts of <td></td>

I then tried using

t.renderContents and then got something like this (one line broken into
many for the sake of this email):

1,<img src=icons/group.png>&nbsp;<a href=#OBJ_sunetint>
sunetint</A><BR>,
<img src=icons/gateway_cluster.png>&nbsp;<a>href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster</A> <BR>,
<img>src=icons/udp.png>&nbsp;<a href=#SVC_IKE >IKE</a><br>,
<img src=icons/drop.png>&nbsp;drop,
<img src=icons/log.png>&nbsp;Log&nbsp;,
<img src=icons/any.png>&nbsp;Any<br>&nbsp;,
<img src=icons/gateway_cluster.png>&nbsp;<a href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster</A> <BR>&nbsp;,&nbsp;

How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for <img src=icons/group.png>&nbsp;<a
href=#OBJ_sunetint>sunetint</A><BR>

and still provide the text-parts in the <td>'s with plain text?

I have experimented a little bit with regular expressions, but could
so far not find a solution.

Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"Lo, children are an heritage of the LORD: and the
fruit of the womb is his reward." Psalms 127:3
From: Gabriel Genellina on
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jspies(a)sun.ac.za>
escribi�:

> How do I get Beautifulsoup to render (taking the above line as
> example)
>
> sunentint for <img src=icons/group.png>&nbsp;<a
> href=#OBJ_sunetint>sunetint</A><BR>
>
> and still provide the text-parts in the <td>'s with plain text?

Hard to tell if we don't see what's inside those <td>'s - please provide
at least a few rows of the original HTML table.

--
Gabriel Genellina

From: Johann Spies on
Gabriel Genellina het geskryf:
> En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jspies(a)sun.ac.za>
> escribió:
>
>> How do I get Beautifulsoup to render (taking the above line as
>> example)
>>
>> sunentint for <img src=icons/group.png>&nbsp;<a
>> href=#OBJ_sunetint>sunetint</A><BR>
>>
>> and still provide the text-parts in the <td>'s with plain text?
>
> Hard to tell if we don't see what's inside those <td>'s - please
> provide at least a few rows of the original HTML table.
>
Thanks for your reply.

Here are a few lines:

<!------- Rule 1 ------->
<tr style="background-color: #ffffff"><td class=normal>2</td><td><img
src=icons/usrgroup.png>&nbsp;All Users(a)Any<br><td><im$
</td><td><img src=icons/any.png>&nbsp;Any<br></td><td><img
src=icons/clientencrypt.png>&nbsp;clientencrypt</td><td><img src$
&nbsp;</td><td>&nbsp;</td></tr>

<!------- Rule 2 ------->
<tr style="background-color: #eeeeee"><td class=normal>3</td><td><img
src=icons/any.png>&nbsp;Any<br><td><img src=icons/any$
&nbsp;</td><td>&nbsp;</td></tr>

<!------- Rule 3 ------->
<tr style="background-color: #ffffff"><td class=normal>4</td><td><img
src=icons/group.png>&nbsp;<a href=#OBJ_Rainwall_Group$
<td><img src=icons/group.png>&nbsp;<a href=#OBJ_Rainwall_Group
>Rainwall_Group</A> <BR>
</td><td><img src=icons/udp.png>&nbsp;<a href=#SVC_RainWall_Stop
>RainWall_Stop</a><br></td><td><img src=icons/drop.png>&nb$
&nbsp;</td><td>&nbsp;</td></tr>

<!------- Rule 4 ------->
<tr style="background-color: #eeeeee"><td class=normal>5</td><td><img
src=icons/host.png>&nbsp;<a href=#OBJ_Rainwall_Broadc$
<img src=icons/group.png>&nbsp;<a href=#OBJ_Rainwall_Group
>Rainwall_Group</A> <BR>
<td><img src=icons/group.png>&nbsp;<a href=#OBJ_Rainwall_Group
>Rainwall_Group</A> <BR>
<img src=icons/host.png>&nbsp;<a href=#OBJ_Rainwall_Broadcast
>Rainwall_Broadcast</A> <BR>
</td><td><img src=icons/udp.png>&nbsp;<a href=#SVC_RainWall_Daemon
>RainWall_Daemon</a><br></td><td><img src=icons/accept.p$
&nbsp;</td><td>&nbsp;</td></tr>

Regards
Johann

--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"Lo, children are an heritage of the LORD: and the
fruit of the womb is his reward." Psalms 127:3


From: Gabriel Genellina on
En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies <jspies(a)sun.ac.za>
escribi�:

> Gabriel Genellina het geskryf:
>> En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jspies(a)sun.ac.za>
>> escribi�:
>>
>>> How do I get Beautifulsoup to render (taking the above line as
>>> example)
>>>
>>> sunentint for <img src=icons/group.png>&nbsp;<a
>>> href=#OBJ_sunetint>sunetint</A><BR>
>>>
>>> and still provide the text-parts in the <td>'s with plain text?
>>
>> Hard to tell if we don't see what's inside those <td>'s - please
>> provide at least a few rows of the original HTML table.
>>
> Thanks for your reply. Here are a few lines:
>
> <!------- Rule 1 ------->
> <tr style="background-color: #ffffff"><td class=normal>2</td><td><img
> src=icons/usrgroup.png>&nbsp;All Users(a)Any<br><td><im$
> </td><td><img src=icons/any.png>&nbsp;Any<br></td><td><img
> src=icons/clientencrypt.png>&nbsp;clientencrypt</td><td><img src$
> &nbsp;</td><td>&nbsp;</td></tr>

I *think* I finally understand what you want (your previous example above
confused me).
If you want for Rule 1 to generate a line like this:

2,All Users(a)Any,<im$,Any,clientencrypt,,

this code should serve as a starting point:

lines = []
soup = BeautifulSoup(html)
for table in soup.findAll("table"):
for row in table.findAll("tr"):
line = []
for cell in row.findAll("td"):
text = ' '.join(
s.replace('\n',' ').replace('&nbsp;',' ')
for s in cell.findAll(text=True)).strip()
line.append(text)
lines.append(line)

import csv
with open("output.csv","wb") as f:
writer = csv.writer(f)
writer.writerows(lines)

cell.findAll(text=True) returns a list of all text nodes inside a <td>
cell; I preprocess all \n and &nbsp; in each text node, and join them all.
lines is a list of lists (each entry one cell), as expected by the csv
module used to write the output file.

--
Gabriel Genellina

From: Johann Spies on
On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:

> this code should serve as a starting point:

Thank you very much!

> cell.findAll(text=True) returns a list of all text nodes inside a
> <td> cell; I preprocess all \n and &nbsp; in each text node, and
> join them all. lines is a list of lists (each entry one cell), as
> expected by the csv module used to write the output file.

I have struggled a bit to find the documentation for (text=True).
Most of documentation for Beautifulsoup I saw mostly contained some
examples without explaining what the options do. Thanks for your
explanation.

As far as I can see there was no documentation installed with the
debian package.

Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"But I will hope continually, and will yet praise thee
more and more." Psalms 71:14