From: Priyank Shah on
hi,

I read html file using nokogiri. and its work fine.

But after read when i print it, it shows me unknown charater like

" " in place of <somestarting>hello&nbsp;</somecomplete>

so it looks like "hello ".

it create problem bcoz of &nbsp and ending tag.

If any know about its solution please help.

Thanks,
Priyank Shah
--
Posted via http://www.ruby-forum.com/.

From: Brian Candler on
Try using
p str
or
puts str.inspect
or
puts str.bytes.to_a.inspect

to get a better look at what character codes are in there.
--
Posted via http://www.ruby-forum.com/.

From: Priyank Shah on
Brian Candler wrote:
> Try using
> p str
> or
> puts str.inspect
> or
> puts str.bytes.to_a.inspect
>
> to get a better look at what character codes are in there.

Hi

Thanks for reply,

But it is not useful for me if i use inspect it convert "hello\302\240"

i want simple space.

Thanks,
Priyank Shah
--
Posted via http://www.ruby-forum.com/.

From: Brian Candler on
Priyank Shah wrote:
> But it is not useful for me if i use inspect it convert "hello\302\240"

That is useful.

It shows that the &nbsp; has been converted into the sequence \302\240
(octal)
or \xc2\xa0 (hex)

That happens to be the code for a non-breaking space in UTF-8, codepoint
160:

$ irb19
>> 160.chr("UTF-8")
=> " "
>> 160.chr("UTF-8").bytes.to_a
=> [194, 160]
>> 160.chr("UTF-8").force_encoding("ASCII-8BIT")
=> "\xC2\xA0"

So the terminal you are trying to print it to is non-UTF-8. Perhaps a
Windows box? You didn't say what your platform was.

In that case, you need to re-encode it to the appropriate character set.
--
Posted via http://www.ruby-forum.com/.