Prev: Newbie
Next: RubyGems 1.3.7
From: Cs Webgrl on
Hello,

I am working with scraping quite a bit of data and I would like to make
sure that I'm following some best practices for string manipulation. I
would like to be sure to take into account any speed and garbage
collection issues.

Does anyone know of any posts, websites, books or other resources that
provide "do this, not that" types of guidance?

For example, my understanding is that globbing everything into one line
when manipulating a string is not the best use of resources.

not good
"string+var".gsub('+','').strip.capitalize


better
s = "string+var
s.gsub('+','')
s.strip!
s.capitalize
s => 'String Var'

Are there resources that explain why one is better than the other that
also provides more best practices like this?

Thanks.
--
Posted via http://www.ruby-forum.com/.

From: Peter Hickman on
[Note: parts of this message were removed to make it a legal post.]

Personally doing things on one line is not a sin of itself. Only when it is
overdone! As to what counts as overdone depends on your reading ability.

Splitting things onto individual lines allows you to insert logging at
various points without fear of breaking the code which the one line approach
does not.

However the multiline approach can make an insignificant part of the code
take up lots of screen real estate which can make the larger code harder to
read.

For example x.downcase.gsub(/\s+/, ' ').strip.capitalize is a fairly easy to
read clean up on a string but if it goes multiline

x.downcase!
x.gsub!(/\s+/, ' ')
x.strip!
x.capitalize!

not only does it take up more of the screen but it has also altered x,
something that the single line version did not.

Of course if things get really silly you could just create a function and
stuff all the code in there.

From: Brian Candler on
Cs Webgrl wrote:
> better
> s = "string+var
> s.gsub('+','')
> s.strip!
> s.capitalize
> s => 'String Var'

(You need gsub! and capitalize! of course)

> Are there resources that explain why one is better than the other that
> also provides more best practices like this?

Methods like capitalize! work on the existing string buffer in memory.
The non-bang methods create a whole new string, which involves work
copying it, and then later garbage-collecting the original.

Most of the non-bang methods are implemented as a dup followed by
calling the bang method on the copy. They're written in C, but are
effectively like this:

class String
def capitalize
dup.capitalize!
end

def capitalize!
# scan the string and modify it in place
end
end

Of course, in most apps the original chained code you wrote will be just
fine, and it's easy to write and understand. If you will be processing
files which are hundreds of megabytes long then it may be worthwhile
rewriting to the second form.

Other thoughts:

* for large files, process them in chunks or lines rather than reading
them all in at once

* use block form when opening a file, to ensure it's closed as soon as
you've finished with it

File.open("/path/to/file","rb") do |f|
f.each_line do |line|
...
end
end
--
Posted via http://www.ruby-forum.com/.

From: Cs Webgrl on
Thanks so much for the help and guidance. Most of my data is parsed
from mechanize and broken into smaller chunks that will manipulated to
get the final format. From my understanding, I should be ok. I
definitely agree that the conciseness of fewer lines of code is easier
to read. Just wanted to make sure that I'm not compromising speed or
garbage collection for readability on these types of methods.







Brian Candler wrote:
> Cs Webgrl wrote:
>> better
>> s = "string+var
>> s.gsub('+','')
>> s.strip!
>> s.capitalize
>> s => 'String Var'
>
> (You need gsub! and capitalize! of course)
>
>> Are there resources that explain why one is better than the other that
>> also provides more best practices like this?
>
> Methods like capitalize! work on the existing string buffer in memory.
> The non-bang methods create a whole new string, which involves work
> copying it, and then later garbage-collecting the original.
>
> Most of the non-bang methods are implemented as a dup followed by
> calling the bang method on the copy. They're written in C, but are
> effectively like this:
>
> class String
> def capitalize
> dup.capitalize!
> end
>
> def capitalize!
> # scan the string and modify it in place
> end
> end
>
> Of course, in most apps the original chained code you wrote will be just
> fine, and it's easy to write and understand. If you will be processing
> files which are hundreds of megabytes long then it may be worthwhile
> rewriting to the second form.
>
> Other thoughts:
>
> * for large files, process them in chunks or lines rather than reading
> them all in at once
>
> * use block form when opening a file, to ensure it's closed as soon as
> you've finished with it
>
> File.open("/path/to/file","rb") do |f|
> f.each_line do |line|
> ...
> end
> end

--
Posted via http://www.ruby-forum.com/.

From: Josh Cheek on
[Note: parts of this message were removed to make it a legal post.]

On Wed, Jun 30, 2010 at 8:32 AM, Cs Webgrl <cschaller(a)gmail.com> wrote:

> Hello,
>
> I am working with scraping quite a bit of data and I would like to make
> sure that I'm following some best practices for string manipulation. I
> would like to be sure to take into account any speed and garbage
> collection issues.
>
> Does anyone know of any posts, websites, books or other resources that
> provide "do this, not that" types of guidance?
>
> For example, my understanding is that globbing everything into one line
> when manipulating a string is not the best use of resources.
>
> not good
> "string+var".gsub('+','').strip.capitalize
>
>
> better
> s = "string+var
> s.gsub('+','')
> s.strip!
> s.capitalize
> s => 'String Var'
>
> Are there resources that explain why one is better than the other that
> also provides more best practices like this?
>
> Thanks.
> --
> Posted via http://www.ruby-forum.com/.
>
>
I don't know about a specific site, but if you do not need to keep the value
of string, then string << var is better than string + var, since it mutates
string, rather than creating a new object. I once read benchmarks about
this, but I can't remember where I read them, and I can't seem to recreate
them, so maybe I am wrong.

# plus returns a new String
string , var = 'abc' , 'def'
string + var # => "abcdef"
string # => "abc"

# << mutates the receiver
string << var # => "abcdef"
string # => "abcdef"



You can use s.delete('+') instead of s.gsub('+','') and it will be faster,
prettier, and more expressive.



I expect the reason you heard that it is better to do it on multiple lines
is that it then lets you use the bang methods, which, for whatever reason
will return nil if they don't mutate the object. In general, it is faster to
say s.capitalize! than s.capitalize because in bang version, we mutate s
itself, in the second, we create a new object that is modified. But we are
not interested in keeping the original value of s, so creating all these
objects adds up.

# capitalize returns the capital version regardless of the original string
# so you can use it in the middle of a method chain
'Abc'.capitalize # => "Abc"
'abc'.capitalize # => "Abc"

# don't use capitalize! in the middle of a method chain because it can
return nil
'Abc'.capitalize! # => nil
'abc'.capitalize! # => "Abc"

# capitalize creates a new string, so is less efficient if you don't care
about the original
# also does not modify the receiver, so you have to capture its result
s = 'abc'
s.capitalize # => "Abc"
s # => "abc"

# capitalize! mutates the original string, so is more efficient if you don't
care about the original
# does modify the receiver, so don't have to capture its result
# in fact, _don't_ capture its result, because as shown above, result could
be nil
s = 'abc'
s.capitalize! # => "Abc"
s # => "Abc"

 |  Next  |  Last
Pages: 1 2
Prev: Newbie
Next: RubyGems 1.3.7