Elegant Solution to a Seemingly Simple Problem? [Ruby]

Prev: unsubscribe
Next: MD5 16 octet - how to compute?

From: Derek Cannon on 18 Apr 2010 06:17

Sure, I'll post HTML examples. In this non-simplified version, there are
20 columns per row which are:

availability, course_reference_number, subject, course_number, section,
campus, credit_hours, title, days, time, cap, registered, remaining,
xl_cap, xl_registered, xl_remaining, professor, date, location,
attributes

Here's three specific examples of the HTML that cover all the
possibilities (normal class, course with TBA day, and labs):



<TR>
<TD class="dddefault"><ABBR title="Not available for
registration">NR</ABBR></TD>
<TD class="dddefault"><A
href="https://ggc.gabest.usg.edu/pls/B400/bwckschd.p_disp_listcrse?term_in=201008&subj_in=ACCT&crse_in=2101&crn_in=80983"
onmouseover="window.status='Detail'; return true"
onfocus="window.status='Detail'; return true"
onmouseout="window.status=''; return true"
onblur="window.status=''; return true">80983</A></TD>
<TD class="dddefault">ACCT</TD>
<TD class="dddefault">2101</TD>
<TD class="dddefault">01</TD>
<TD class="dddefault">A</TD>
<TD class="dddefault">3.000</TD>
<TD class="dddefault">Intro to Financial Accounting</TD>
<TD class="dddefault">MW</TD>
<TD class="dddefault">08:00 am-09:15 am</TD>
<TD class="dddefault">30</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">30</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD>
<TD class="dddefault">08/23-12/09</TD>
<TD class="dddefault">A 1880</TD>
<TD class="dddefault"> </TD>
</TR>
<TR>



<TR>
<TD class="dddefault"><ABBR title="Closed">C</ABBR></TD>
<TD class="dddefault"><A
href="https://ggc.gabest.usg.edu/pls/B400/bwckschd.p_disp_listcrse?term_in=201008&subj_in=BUSA&crse_in=4700&crn_in=81085"
onmouseover="window.status='Detail'; return true"
onfocus="window.status='Detail'; return true"
onmouseout="window.status=''; return true"
onblur="window.status=''; return true">81085</A></TD>
<TD class="dddefault">BUSA</TD>
<TD class="dddefault">4700</TD>
<TD class="dddefault">01</TD>
<TD class="dddefault">A</TD>
<TD class="dddefault">3.000</TD>
<TD class="dddefault">Selected Topics in Business</TD>
<TD colspan="2" class="dddefault"><ABBR title="To Be
Announced">TBA</ABBR></TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD>
<TD class="dddefault">08/23-12/09</TD>
<TD class="dddefault"><ABBR title="To Be Announced">TBA</ABBR></TD>
<TD class="dddefault"> </TD>
</TR>



<TR>
<TD class="dddefault"><ABBR title="Not available for
registration">NR</ABBR></TD>
<TD class="dddefault"><A
href="https://ggc.gabest.usg.edu/pls/B400/bwckschd.p_disp_listcrse?term_in=201008&subj_in=CHEM&crse_in=1151K&crn_in=80073"
onmouseover="window.status='Detail'; return true"
onfocus="window.status='Detail'; return true"
onmouseout="window.status=''; return true"
onblur="window.status=''; return true">80073</A></TD>
<TD class="dddefault">CHEM</TD>
<TD class="dddefault">1151K</TD>
<TD class="dddefault">01</TD>
<TD class="dddefault">A</TD>
<TD class="dddefault">4.000</TD>
<TD class="dddefault">Survey of Chemistry I w/Lab</TD>
<TD class="dddefault">MF</TD>
<TD class="dddefault">11:00 am-12:15 pm</TD>
<TD class="dddefault">20</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">20</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">0</TD>
<TD class="dddefault">David Pursell (<ABBR
title="Primary">P</ABBR>)</TD>
<TD class="dddefault">08/23-12/09</TD>
<TD class="dddefault">A 1400</TD>
<TD class="dddefault"> </TD>
</TR>
<TR>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault">W</TD>
<TD class="dddefault">11:00 am-01:45 pm</TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault"> </TD>
<TD class="dddefault">David Pursell (<ABBR
title="Primary">P</ABBR>)</TD>
<TD class="dddefault">08/23-12/09</TD>
<TD class="dddefault">A 1290</TD>
<TD class="dddefault"> </TD>
</TR>
--
Posted via http://www.ruby-forum.com/.

From: Derek Cannon on 18 Apr 2010 06:21

> There are lots of ways to identify more precisely which part of the HTML
> you want, using CSS selectors. Most easily, if the rows are inside
> <table id='courses'> then seomthing like 'table#courses tr' could do it.

Since my original post, I've been playing around with the code some
more. I made a new way of getting courses that automatically filters out
"non-course" rows. The code is:

table = doc.css("tr").collect { |row|
row.css(".dddefault").collect { |column|
column.text.strip
}
}

This way, "non-courses" appear as empty arrays. I still don't know how
to neatly get rid of the empty arrays... I tried .compact! but that
doesn't seem to work.

>doc = Nokogiri::HTML(open(url))
>raw_course_list = doc.css("tr").collect { |row|
> t_row = row.css("td").collect { |column| column.text.strip }
> t_row.insert(2, "") if (t_row[1] == "TBA")
>}.reject{ |i| i.size != 4 }

Excellent example, I think this is much better than what I had earlier.
I guess I could now replace your reject with i.empty?, right?

PS - I changed raw_course_list to table to make it more readable.
--
Posted via http://www.ruby-forum.com/.

From: Ehsanul Hoque on 18 Apr 2010 07:42

> This way, "non-courses" appear as empty arrays. I still don't know how
> to neatly get rid of the empty arrays... I tried .compact! but that
> doesn't seem to work.

Try #flatten! instead. #compact! just gets rid of nil entries, and an empty array is not the same as nil.

- Ehsan

_________________________________________________________________
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

From: David A. Black on 18 Apr 2010 10:23

Hi --

On Sun, 18 Apr 2010, Derek Cannon wrote:

>> There are lots of ways to identify more precisely which part of the HTML
>> you want, using CSS selectors. Most easily, if the rows are inside
>> <table id='courses'> then seomthing like 'table#courses tr' could do it.
>
> Since my original post, I've been playing around with the code some
> more. I made a new way of getting courses that automatically filters out
> "non-course" rows. The code is:
>
> table = doc.css("tr").collect { |row|
> row.css(".dddefault").collect { |column|
> column.text.strip
> }
> }
>
> This way, "non-courses" appear as empty arrays. I still don't know how
> to neatly get rid of the empty arrays... I tried .compact! but that
> doesn't seem to work.
>
>> doc = Nokogiri::HTML(open(url))
>> raw_course_list = doc.css("tr").collect { |row|
>> t_row = row.css("td").collect { |column| column.text.strip }
>> t_row.insert(2, "") if (t_row[1] == "TBA")
>> }.reject{ |i| i.size != 4 }
>
> Excellent example, I think this is much better than what I had earlier.
> I guess I could now replace your reject with i.empty?, right?
>
> PS - I changed raw_course_list to table to make it more readable.

I think having the condition and the reject be the last things in the
code are going to make it hard to follow it later. I wouldn't rule out
doing something a tiny bit more procedural but maybe a little easier
to parse visually, like this:

doc = Nokogiri::HTML(open(url))
table = []
doc.css("tr").each do |row|
cells = row.css("td").map {|cell| cell.text.strip }
next unless cells.size == 4
next unless cells[1] == "TBA"
cells.insert(2, "")
table << cells
end

You could also extract some methods, and end up with something like:

table = doc.css("tr").
select {|row| valid_row?(row) }.
map {|row| prepare_row(row) }

(The above is all untested.)

David

--
David A. Black, Senior Developer, Cyrus Innovation Inc.

THE Ruby training with Black/Brown/McAnally
COMPLEAT Coming to Chicago area, June 18-19, 2010!
RUBYIST http://www.compleatrubyist.com

From: Phrogz on 18 Apr 2010 13:49

On Apr 18, 1:23 am, Derek Cannon <novelltermina...(a)gmail.com> wrote:
> [...]
> Earlier, someone on the forum showed me a very elegant way to collect
> this information (I use Nokogiri). It was:
>
> doc = Nokogiri::HTML(open(url))
>
> raw_course_list = doc.css("tr").collect { |row|
> row.css("td").collect { |column|
> column.text.strip
> }
>
> }
> [...]
> This works perfectly, except in 3 main cases.
>
> *** Problem 1: The <tr> does not contain course information. (It's some
> irrelevant part of the HTML). In this case, I did the following:
> raw_course_data.reject! { |i| i.size != 4 }, would filtered out
> non-courses. Note: no tables without course data had the size of one
> with course data (in the non-simplified version, the size is actually
> much larger).
>
> So, already I think it's ugly coding! It firsts loads ALL <tr> contents
> into arrays, then rejects them after creation.
> [...]

Generalized, you have an array of values and you want to map a subset
of them to new array. There are (at least) four patterns you can use
to handle this sort of situation:

1) Map the unwanted elements to a 'broken' value and then reject the
broken values later. (What you are doing now.) This can be hard if you
don't have a way of creating a broken value. For example, you might be
mapping all values directly to an object, but you don't have enough
information for the object constructor and no way of making up clearly
spurious values. Further, it's inefficient as you do the work and use
the memory of creating the object only to throw it out later.

2) Map the unwanted elements to nil and compact the array afterwards.
In your case, you'd need to look at the TDs in your row and decide if
you wanted to map the row to the mapping of them or nil. This is
convenient in terms of one-liners, but still slightly inefficient
because you're creating an intermediary array packed with nils that
you don't want. (You should be clear, though, that computational
inefficiency is not always more important than programmer convenience
of code clarity.)

3) Instead of using map (or the same effect under the longer name
'collect', as Robert apparently likes) to create a new array from your
original, explicitly create the new array and push values only as
valid. This is basically the same as above, but without the nil values
and the later compact. For example:
raw_course_list = []
doc.css("tr").each { |row|
tds = row.css("td")
if tds.have_the_values_I_want
raw_course_list << tds.map{ |col| ... }
end
}

4) Use map (collect) on the array as in #1 or #2, but before that do a
pass through your source array and sanitize it. Sanitization might be
mapping values to nil and then compacting (thus very similar to #2),
or fixing values (as in your TBA or continued description case). This
feels cleaner, but note that this has you doing one (or two, in the
case of map+compact) passes on your data before you get around to
mapping it.

Here's (very roughly) what I might do given what you wrote:
# Assuming you're using Ruby 1.9
course_info = []
trs = doc.css('tr')
trs.each.with_index{ |row,i|
tds = row.css('td')
title = ...
prof = ...
days = ...
times = ...
desc = ...
next_row = trs[i+1]
if next_row && next_row.is_a_continuation?
# Add content from next_row to description
# If needed, invalidate next_row so it will be skipped
elsif title && prof && days # If you have all the information you
need
course_info << Course.new( title, prof, days )
end
}

Regardless of the approach you use, remember that even though you're
annoyed that you are 'processing' (in one form or another) invalid
entries, you have to touch every row to find out if you like it or
not. It's up to you for how you detect which are invalid and handle
them.

First | Prev | Next | Last
Pages: 1 2 3
Prev: unsubscribe
Next: MD5 16 octet - how to compute?