From: Bob05 Dr on
Hello,

I have some text files that I would like to extract text from, then join
them on one single line and save them to a text file.

Here is an example of the text I want to take out:

<Title>Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<ProtocolName>None</ProtocolName>

Here is how I would like the text to save as:

Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)GPM06600002310 None


So far I have this:

require 'rexml/document'
include REXML
file = File.new("1.xml")
doc = Document.new(file)
puts doc
aFile = File.new("1.txt", "w")
aFile.write(doc)
aFile.close

I was wondering, how can you split out text and join them on one line?
--
Posted via http://www.ruby-forum.com/.

From: Jesús Gabriel y Galán on
On Wed, Jun 30, 2010 at 1:54 AM, Bob05 Dr <knightplayer(a)gmail.com> wrote:
> Hello,
>
> I have some text files that I would like to extract text from, then join
> them on one single line and save them to a text file.
>
> Here is an example of the text I want to take out:
>
> <Title>Protein complexes in Saccharomyces cerevisiae
> (GPM06600002310)</Title>
> <ShortLabel>GPM06600002310</ShortLabel>
> <ProtocolName>None</ProtocolName>
>
> Here is how I would like the text to save as:
>
> Protein complexes in Saccharomyces cerevisiae
> (GPM06600002310)GPM06600002310 None
>
>
> So far I have this:
>
> require 'rexml/document'
> include REXML
> file = File.new("1.xml")
> doc = Document.new(file)
> puts doc
> aFile = File.new("1.txt", "w")
> aFile.write(doc)
> aFile.close
>
> I was wondering, how can you split out text and join them on one line?

First of all, your document doesn't parse well, because it has two root nodes.
After solving that, what you need is to get to each element and
extract its text children nodes.
Take a look at:

http://www.germane-software.com/software/rexml/docs/tutorial.html

And the methods:

elements
[]
text

of the API. Experiment a little in IRB:

irb(main):001:0> s = <<EOF
irb(main):002:0" <Title>Protein complexes in Saccharomyces cerevisiae
irb(main):003:0" (GPM06600002310)</Title>
irb(main):004:0" <ShortLabel>GPM06600002310</ShortLabel>
irb(main):005:0" <ProtocolName>None</ProtocolName>
irb(main):006:0" EOF
=> "<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>\n<ShortLabel>GPM06600002310</ShortLabel>\n<ProtocolName>None</ProtocolName>\n"
irb(main):007:0>
irb(main):008:0*
irb(main):009:0* require 'rexml/document'
=> true
irb(main):010:0> include REXML
=> Object
irb(main):011:0> doc = Document.new s
REXML::ParseException: #<RuntimeError: attempted adding second root
element to document>

ooooops, two root elements. I'll add a fake one surrounding everything:

irb(main):012:0> s = <<EOF
irb(main):013:0" <ROOT>
irb(main):014:0" <Title>Protein complexes in Saccharomyces cerevisiae
irb(main):015:0" (GPM06600002310)</Title>
irb(main):016:0" <ShortLabel>GPM06600002310</ShortLabel>
irb(main):017:0" <ProtocolName>None</ProtocolName>
irb(main):018:0" </ROOT>
irb(main):019:0" EOF
=> "<ROOT>\n<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>\n<ShortLabel>GPM06600002310</ShortLabel>\n<ProtocolName>None</ProtocolName>\n</ROOT>\n"
irb(main):020:0> doc = Document.new s
=> <UNDEFINED> ... </>
irb(main):025:0> doc.elements
=> #<REXML::Elements:0xb72907e0 @element=<UNDEFINED> ... </>>
irb(main):026:0> doc.elements.each {|el| p el}
<ROOT> ... </>
=> [<ROOT> ... </>]
irb(main):027:0> doc.to_a
=> [<ROOT> ... </>, "\n"]
irb(main):028:0> doc.elements.to_a
=> [<ROOT> ... </>]
irb(main):032:0> doc.elements["/Title"]
=> nil
irb(main):033:0> doc.elements["Title"]
=> nil
irb(main):034:0> root = doc.root
=> <ROOT> ... </>
irb(main):035:0> root.elements["Title"]
=> <Title> ... </>
irb(main):036:0> root.elements["Title"].to_s
=> "<Title>Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)</Title>"

Look, it seems that with that I can get the text of the Title element.
Let's see if there's a better way:

irb(main):039:0> root.elements["Title"].methods.sort
=> ["<<", "==", "===", "=~", "[]", "[]=", "__id__", "__send__", "add",
"add_attribute", "add_attributes", "add_element", "add_namespace",
"add_text", "all?", "any?", "attribute", "attributes", "bytes",
"cdatas", "children", "class", "clone", "collect", "comments",
"context", "context=", "count", "cycle", "dclone", "deep_clone",
"delete", "delete_at", "delete_attribute", "delete_element",
"delete_if", "delete_namespace", "detect", "display", "document",
"drop", "drop_while", "dup", "each", "each_child", "each_cons",
"each_element", "each_element_with_attribute",
"each_element_with_text", "each_index", "each_recursive",
"each_slice", "each_with_index", "elements", "entries", "enum_cons",
"enum_for", "enum_slice", "enum_with_index", "eql?", "equal?",
"expanded_name", "extend", "find", "find_all", "find_first_recursive",
"find_index", "first", "freeze", "frozen?", "fully_expanded_name",
"get_elements", "get_text", "grep", "group_by", "has_attributes?",
"has_elements?", "has_name?", "has_text?", "hash", "id",
"ignore_whitespace_nodes", "include?", "indent", "index",
"index_in_parent", "inject", "insert_after", "insert_before",
"inspect", "instance_eval", "instance_exec", "instance_of?",
"instance_variable_defined?", "instance_variable_get",
"instance_variable_set", "instance_variables", "instructions",
"is_a?", "kind_of?", "length", "local_name", "map", "max", "max_by",
"member?", "method", "methods", "min", "min_by", "minmax",
"minmax_by", "name", "name=", "namespace", "namespaces",
"next_element", "next_sibling", "next_sibling=", "next_sibling_node",
"nil?", "node_type", "none?", "object_id", "one?", "parent",
"parent=", "parent?", "partition", "prefix", "prefix=", "prefixes",
"previous_element", "previous_sibling", "previous_sibling=",
"previous_sibling_node", "private_methods", "protected_methods",
"public_methods", "push", "raw", "reduce", "reject", "remove",
"replace_child", "replace_with", "respond_to?", "reverse_each",
"root", "root_node", "select", "send", "singleton_methods", "size",
"sort", "sort_by", "taint", "tainted?", "take", "take_while", "tap",
"text", "text=", "texts", "to_a", "to_enum", "to_s", "to_set", "type",
"unshift", "untaint", "whitespace", "write", "xpath", "zip"]

There's a text method in there, would that do what I expect?

irb(main):040:0> root.elements["Title"].text
=> "Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)"

Bingo !

Is there a way to access it directly from the doc, instead of having a
root variable?

irb(main):042:0> doc.elements["ROOT/Title"].text
=> "Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)"


Now you can do the same for the other elements. I also recommend you
learn XPath and CSS selectors if you are going to be parsing markup,
and also look at other parsers like Nokogiri. This example was pretty
simple, but these things can get nasty.

Jesus.