From: Jesús Gabriel y Galán on
On Sun, Aug 1, 2010 at 11:10 PM, Junhui Liao <junhui.liao(a)uclouvain.be> wrote:
> Hi, Jesus.
>
> Thanks a lot for your help!
> I modified a little to the script and  make it running as expected.
>
> Here is the code:
>
> def write_line_to_file every_line, base_time = Hash.new(0)
>  every_line.each_slice(2).with_index do |(time,signal), index|
>    File.open("header_split_#{index}"+".tsv" , "a") do |f|
>      f << "#{time.to_f - base_time[index].to_f}\t#{signal}\n"
>    end
>  end
> end
>
> #count = 0
>  first_line = file.readline.chomp.split("\t")
>  # counter +=1
>  # if counter >= 2
>  # puts "here!"
>  first_line_times = first_line.each_slice(2).map{|time,signal| time}
>  file.each_line do |record|
>    line_data = record.chomp.split("\t")
>      write_line_to_file line_data, first_line_times
>  end
>  end
>
>
> However, there existed two items need to be improved at least.
> Item 1, this code took ~2 hours to save into 4096 files.
> BTW, the original tsv file is around 250M. I wonder if there exist
> some tricks to make it speed up?

Maybe you can read it completely in memory, reorganize the contents
per file, and then write each file at once.
I think that should speed it up, although it implies a complete
refactor of the code.

> Item 2, the original data has 21 lines header. Although it could be
> deleted then read by the script. But I do want to update the script
> to make it exclude the fist 21 lines header.

If you do a first file.readline after opening the file, you will read
the first line.
Then continue with what you already had.

Jesus.

From: Junhui Liao on
Hi, Jesus,


> Maybe you can read it completely in memory, reorganize the contents
> per file, and then write each file at once.
> I think that should speed it up, although it implies a complete
> refactor of the code.


Could you please give me some tips on how to organize this new script ?
I have no idea to do this at all.


>> Item 2, the original data has 21 lines header. Although it could be
>> deleted then read by the script. But I do want to update the script
>> to make it exclude the fist 21 lines header.
>
> If you do a first file.readline after opening the file, you will read
> the first line.
> Then continue with what you already had.


As to this problem, I solved by insert this code.

until file.readline =~ /Data:/
file.readline
end


Thanks a lot in advance !
Cheers,
Junhui
--
Posted via http://www.ruby-forum.com/.

From: Jesús Gabriel y Galán on
On Tue, Aug 3, 2010 at 12:15 AM, Junhui Liao <junhui.liao(a)uclouvain.be> wrote:
> Hi, Jesus,
>
>
>> Maybe you can read it completely in memory, reorganize the contents
>> per file, and then write each file at once.
>> I think that should speed it up, although it implies a complete
>> refactor of the code.
>
>
> Could you please give me some tips on how to organize this new script ?
> I have no idea to do this at all.

You could read the lines one by one as you are doing now, but instead
of writing them to each file on each step, accumulate them in arrays.
For example you could have an array that contains and array of lines
for each file. Then when you've read the whole file, iterate through
the array writing each subarray to a file.

> >> Item 2, the original data has 21 lines header. Although it could be
>>> deleted then read by the script. But I do want to update the script
>>> to make it exclude the fist 21 lines header.
>>
>> If you do a first file.readline after opening the file, you will read
>> the first line.
>> Then continue with what you already had.
>
>
> As to this problem, I solved by insert this code.
>
>  until file.readline =~ /Data:/
>  file.readline
>  end

Be careful, you are doing two readlines per iteration, so you might
skip the important line. If you know you have only one line before the
important data you can just do file.readline and continue.

Jesus.

From: Junhui Liao on
Hi,

Jesus,

Thanks a lot for your comments.

> Be careful, you are doing two readlines per iteration, so you might
> skip the important line. If you know you have only one line before the
> important data you can just do file.readline and continue.

Totally, the header has 21 lines. And the last line of header is
''Data:".
Fortunately, the 21 is an odd number so if just in the view of result,
is same as the header has only one line :-). For sure, this is not
robust.
I will try to update it.

Cheers,
Junhui
--
Posted via http://www.ruby-forum.com/.

From: Junhui Liao on
Hi, Jesus,

> You could read the lines one by one as you are doing now, but instead
> of writing them to each file on each step, accumulate them in arrays.
> For example you could have an array that contains and array of lines
> for each file. Then when you've read the whole file, iterate through
> the array writing each subarray to a file.
>

I tried to use the following code to make all of the line's data
containing in an array.
But my code just read the last line of file. Could you please tell my
why?

File.open("../data/test_2lines.tsv") do |file|
total_data = []# Array containing all of the line_datas
line_data = [] # Array just containing one line's data
first_line = file.readline.chomp.split("\t")
first_line_times = first_line.each_slice(2).map{|time,signal| time}
file.each_line do |record|
line_data = record.chomp.split("\t")# write all time and sginal into
array line_data.
end
total_data = total_data + line_data# append line_data into total_data
puts total_data.length
File.open("total_data.txt","w") do |f|
f << total_data
end
end

However, under irb,

irb(main):001:0> a = [1,2,3]
=> [1, 2, 3]
irb(main):002:0> b = [4,5,6]
=> [4, 5, 6]
irb(main):003:0> a + b
=> [1, 2, 3, 4, 5, 6]

Thanks a lot in advance!
Cheers,
Junhui
--
Posted via http://www.ruby-forum.com/.