comparing two arrays, too slow [Ruby]

Prev: Parse a Date From a String
Next: Problem with return value

From: joe chesak on 6 Aug 2010 11:51

[Note: parts of this message were removed to make it a legal post.]

I don't have access to the database, but everyday I get a csv dump of all
live customer listings.
There are around 15,000 live listings.
A record contains 3 fields: id, name, description
Everyday some new accounts are created while others are deleted.
Everyday I want to compare yesterday's dump with today's dump and print
the full record of each terminated record and each new record.
I am achieving this with the script below.
That last line makes the script run very slow.

Is there a more elegant way to compare two
arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
ride?

WORKING CODE:
require 'rubygems'
require 'fastercsv'

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = mondays_ids - sundays_ids
terminated_ids = sundays_ids - mondays_ids

sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
newaccount_ids.include? row[1]}

TEST DATA:
sunday.csv
id,name,otherstuff
1,larry,bald
2,curly,tall
3,moe,meanie

monday.csv
id,name,otherstuff
1,larry,bald
4,shemp,greasy

OUTPUT:
delete,2,curly
delete,3,moe
create,4,shemp

From: Matthew K. Williams on 6 Aug 2010 11:57

On Sat, 7 Aug 2010, joe chesak wrote:

> I don't have access to the database, but everyday I get a csv dump of all
> live customer listings.
> There are around 15,000 live listings.
> A record contains 3 fields: id, name, description
> Everyday some new accounts are created while others are deleted.
> Everyday I want to compare yesterday's dump with today's dump and print
> the full record of each terminated record and each new record.
> I am achieving this with the script below.
> That last line makes the script run very slow.
>
> Is there a more elegant way to compare two
> arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
> ride?
>
>
> WORKING CODE:
> require 'rubygems'
> require 'fastercsv'
>
> sundays_dump = FasterCSV.read("./sunday.csv")
> mondays_dump = FasterCSV.read("./monday.csv")
>
> sundays_ids = sundays_dump.collect {|row| row[1]}
> mondays_ids = mondays_dump.collect {|row| row[1]}
>
> newaccount_ids = mondays_ids - sundays_ids
> terminated_ids = sundays_ids - mondays_ids
>
> sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
> terminated_ids.include? row[1]}
> mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
> newaccount_ids.include? row[1]}

Rather than go through both sets of data, I'd do something like this:

terminated_ids.each do |row|
puts "delete,#{sundays_dump[row][0]},#{sundays_dump[row][1]}"
end

Same thing for the new account (substituting monday, of course)

This way you're only outputting the changed records.

Matt

From: Kirk Haines on 6 Aug 2010 13:20

On Fri, Aug 6, 2010 at 9:51 AM, joe chesak <rubytalk(a)chesak.com> wrote:
> I don't have access to the database, but everyday I get a csv dump of all
> live customer listings.
> There are around 15,000 live listings.
> A record contains 3 fields: id, name, description
> Everyday some new accounts are created while others are deleted.
> Everyday I want to compare yesterday's dump with today's dump and print
> the full record of each terminated record and each new record.
> I am achieving this with the script below.
> That last line makes the script run very slow.
>
> Is there a more elegant way to compare two
> arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
> ride?
>
>
> WORKING CODE:
> require 'rubygems'
> require 'fastercsv'
>
> sundays_dump = FasterCSV.read("./sunday.csv")
> mondays_dump = FasterCSV.read("./monday.csv")
>
> sundays_ids = sundays_dump.collect {|row| row[1]}
> mondays_ids = mondays_dump.collect {|row| row[1]}
>
> newaccount_ids = mondays_ids - sundays_ids
> terminated_ids = sundays_ids - mondays_ids
>
> sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
> terminated_ids.include? row[1]}
> mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
> newaccount_ids.include? row[1]}

The answer to, "it's too slow", is often, "don't do that, then."

If comparing arrays is too slow, don't compare arrays. Instead, do
something that's fast. Looking up data in a Hash is fast. And since
your data appears to be keyed by unique ids (though your code is
written with variable names assuming it is looking at ids, but with
array lookups that look at the names, which is confusing and might be
a bug), it is easy to stuff that data into a hash for later lookup.

-----
require 'rubygems'
require 'fastercsv'

def extract_data(dumpfile)
dump = FasterCSV.read(dumpfile)
ids = []
data = {}
dump.each do |row|
ids << row[0]
data[row[0]] = row
end

[ids,data]
end

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sunday_ids, sunday_data = extract_data("./sunday.csv")
monday_ids, monday_data = extract_data("./monday.csv")

newaccount_ids = monday_ids - sunday_ids
terminated_ids = sunday_ids - monday_ids

terminated_ids.each do |id|
row = sunday_data[id]
puts "delete,#{id},#{row[1]}"
end

newaccount_ids.each do |id|
row = monday_data[id]
puts "create,#{id},#{row[1]}"
end
-----

If this were my script, I'd further clean it up so that all of your
logic is in a simple library. Then either create a generic executable
that takes the file names as arguments, or write your scripts with
hard coded file names, but that just calls into your library with
those names. It'd be much cleaner unless you are absolutely sure that
this code will never be more than a one-off hack.

Kirk Haines

From: Caleb Clausen on 6 Aug 2010 13:27

On 8/6/10, joe chesak <rubytalk(a)chesak.com> wrote:
> I don't have access to the database, but everyday I get a csv dump of all
> live customer listings.
> There are around 15,000 live listings.
> A record contains 3 fields: id, name, description
> Everyday some new accounts are created while others are deleted.
> Everyday I want to compare yesterday's dump with today's dump and print
> the full record of each terminated record and each new record.
> I am achieving this with the script below.
> That last line makes the script run very slow.
>
> Is there a more elegant way to compare two
> arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the
> ride?
>
>
> WORKING CODE:
> require 'rubygems'
> require 'fastercsv'
>
> sundays_dump = FasterCSV.read("./sunday.csv")
> mondays_dump = FasterCSV.read("./monday.csv")
>
> sundays_ids = sundays_dump.collect {|row| row[1]}
> mondays_ids = mondays_dump.collect {|row| row[1]}
>
> newaccount_ids = mondays_ids - sundays_ids
> terminated_ids = sundays_ids - mondays_ids
>
> sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
> terminated_ids.include? row[1]}
> mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
> newaccount_ids.include? row[1]}
>
>
> TEST DATA:
> sunday.csv
> id,name,otherstuff
> 1,larry,bald
> 2,curly,tall
> 3,moe,meanie
>
> monday.csv
> id,name,otherstuff
> 1,larry,bald
> 4,shemp,greasy
>
> OUTPUT:
> delete,2,curly
> delete,3,moe
> create,4,shemp

Here's an idea... although, Matthew's suggestion is really the better
way to go (tho I think his code won't quite work... you'll have to
save away array indexes somewhere to make that approach effective).

Anyway, it should be much faster if you use Sets instead of arrays for
terminated_ids and newaccount_ids. Like this:

require 'rubygems'
require 'fastercsv'
require 'set'

sundays_dump = FasterCSV.read("./sunday.csv")
mondays_dump = FasterCSV.read("./monday.csv")

sundays_ids = sundays_dump.collect {|row| row[1]}
mondays_ids = mondays_dump.collect {|row| row[1]}

newaccount_ids = Set[*mondays_ids - sundays_ids]
terminated_ids = Set[*sundays_ids - mondays_ids]

sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if
terminated_ids.include? row[1]}
mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if
newaccount_ids.include? row[1]}

Turning sundays_ids and mondays_ids into Sets might also make it a tad
bit faster yet, but I think that the above code has achieved the lions
share of potential speedups.

From: Brian Candler on 6 Aug 2010 17:16

Another option: sort them by id, then walk through them together. If the
current id on both list A and list B is the same, then skip forward on
both. If the current id on list A is less than the current id on list B,
then it exists only in B (so report this, and skip forward on A). And
vice versa.

The advantage of this is that it works with huge files - you can read
them one line at a time instead of reading them all into RAM. And there
are tools which can do an external sort efficiently, if they're not
already sorted that way.

To avoid coding this, just use the unix shell command 'join' to do the
work for you (which you can invoke from IO.popen if you like). Just
beware that it's very fussy about having the files correctly sorted
first.

$ join -t, -j1 -v1 sunday.csv monday.csv
2,curly,tall
3,moe,meanie
$ join -t, -j1 -v2 sunday.csv monday.csv
4,shemp,greasy
--
Posted via http://www.ruby-forum.com/.

| Next | Last
Pages: 1 2
Prev: Parse a Date From a String
Next: Problem with return value