Ruby 'C' Extensions and Unicode [Ruby]

Prev: ANN: toamqp 0.3.1
Next: Problems using the 'extensions' gem - can anyone help?

From: Praveen on 9 Feb 2010 07:14

Hi,

I am working on enhancing the IBM_DB Ruby driver (database driver for
DB2 and Informix) by providing unicode support.

I tried googling with no luck to find any documents or links which
talk about the Ruby C extension API's that can be used to unleash the
unicode support of Ruby-1.9 to

1) Convert Ruby string (unicode) object received in the extension API
into wchar (like rb_str2cstr, in ruby-1.8)

2) Convert wchar* to a Ruby Object (like rb_str_new2, in ruby-1.8).

3) Convert string objects between different formats (UCS-2, UCS-4).

Could some body put light on the answers for the above queries.

Along with the above things could you also tell me if Ruby by default
is compiled to use UCS-2 or UCS-4 or other format strings and how will
I be able to tap this info, of which format is being used,
programmatically in the extension.

Thanks

Praveen

From: KUBO Takehiro on 9 Feb 2010 08:06

On Tue, Feb 9, 2010 at 9:15 PM, Praveen <praveendevarao(a)gmail.com> wrote:
> Hi,
>
> I am working on enhancing the IBM_DB Ruby driver (database driver for
> DB2 and Informix) by providing unicode support.
>
> I tried googling with no luck to find any documents or links which
> talk about the Ruby C extension API's that can be used to unleash the
> unicode support of Ruby-1.9 to

Look at ruby-1.9.1-pxxx/include/ruby/encoding.h and ruby-1.9.1-pxxx/string.c.

> 1) Convert Ruby string (unicode) object received in the extension API
> into wchar (like rb_str2cstr, in ruby-1.8)

No generic way because wchar's encoding is platform-dependent.
As far as I know, it is UCS-2 in Windows, UCS-4 in Linux, locale-dependent
value in Solaris.

If it is UCS-2,
rb_encoding *ucs2_enc = rb_enc_find("UCS-2");
VALUE ucs2_string = rb_str_export_to_enc(string, ucs2_enc);
const char *ucs2_cstr = StringValueCStr(ucs2_string);

> 2) Convert wchar* to a Ruby Object (like rb_str_new2, in ruby-1.8).

If the wchar's encoding is UCS-2,
rb_encoding *ucs2_enc = rb_enc_find("UCS-2");
VALUE ucs2_string = rb_external_str_new_with_enc(cstr, len, usc2_enc);

> 3) Convert string objects between different formats (UCS-2, UCS-4).

rb_encoding *ucs2_enc = rb_enc_find("UCS-2");
rb_encoding *ucs4_enc = rb_enc_find("UCS-4");
VALUE ucs4_string = rb_str_conv_enc(ucs2_string, ucs2_enc, ucs4_enc);

From: KUBO Takehiro on 9 Feb 2010 08:14

On Tue, Feb 9, 2010 at 10:06 PM, KUBO Takehiro <kubo(a)jiubao.org> wrote:
> rb_encoding *ucs2_enc = rb_enc_find("UCS-2");
> rb_encoding *ucs4_enc = rb_enc_find("UCS-4");

Sorry, UCS-2 and UCS-4 are not defined in ruby 1.9.1.
Use UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE instead.

From: Praveen on 9 Feb 2010 11:39

Hi Kubo,

Thanks for the information.

I will give a try and get back to you on how I progress [with doubts/
Success ].

Thanks

Praveen

From: Praveen on 16 Feb 2010 04:13

Hi Kubo,

I tried proceeding with the above mentioned APIs. However I am seeing
some interesting stuffs. Not sure if I am using the right constructs.

Below is the Ruby script I am using:

======================================
#encoding: utf-8

puts "Results in C extension"
puts "----------------------"
require 'ibm_db'
str = "insert into woods (name) values ('GÃHRINGæ')"

conn = IBM_DB.connect 'DRIVER={IBM DB2 ODBC
DRIVER};DATABASE=devdb;HOSTNAME=9.124.159.74;PORT=50000;PROTOCOL=TCPIP;UID=db2admin;PWD=db2admin;','',''
stmt = IBM_DB.exec conn, str
IBM_DB.close conn

print "----------------------\n\n"

puts "Results in Ruby script"
puts "----------------------"

puts "str.length is :#{str.length}"
puts "str.bytesize: #{str.bytesize}"
puts "**Forcing encoding**"
str1 = str.force_encoding("UTF-16LE")
puts "str.length is :#{str1.length}"
puts "str.bytesize: #{str1.bytesize}"
======================================

In the script above, IBM_DB is the C extension module. However the
database call has got nothing to do with the unicode API usage. I have
just resused the module for trying the unicode support.

The snippet in C extension that uses the unicode functions is as
below:

======================================
VALUE ibm_db_exec(int argc, VALUE *argv, VALUE self){
rb_scan_args(argc, argv, "21", &connection, &stmt, &options);
if (!NIL_P(stmt)) {
rb_encoding *enc_received;
rb_encoding *ucs2_enc = rb_enc_find("UTF-16LE");
rb_encoding *ucs4_enc = rb_enc_find("UTF-32LE");

enc_received = rb_enc_from_index(ENCODING_GET(stmt));

printf("\nString in received format: %s\n",RSTRING_PTR(stmt));
printf("\nrb_str_length is: %d\n",rb_str_length(stmt));
printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt));
printf("\nEncoding format received: %s\n",enc_received->name);

stmt_ucs2 = rb_str_export_to_enc(stmt,ucs2_enc);

printf("\nString in utf16 format: %s\n",RSTRING_PTR(stmt_ucs2));
printf("\nrb_str_length is: %d\n",rb_str_length(stmt_ucs2));
printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt_ucs2));
printf("\nEncoding after conversion: %s\n",ucs2_enc->name);
}
}

======================================

The above ruby script run produces the following output:

======================================

Results in C extension
----------------------

String in received format: insert into woods (name) values
('GÃHRINGÃ¦')

rb_str_length is: 89

RSTRING_LEN is: 47

Encoding format received: UTF-8

String in utf16 format: i #Expected because used printf

rb_str_length is: 89

RSTRING_LEN is: 88

Encoding after conversion: UTF-16LE
----------------------

Results in Ruby script
----------------------
str.length is :44
str.bytesize: 47
**Forcing encoding**
str.length is :24
str.bytesize: 47

======================================

I am not sure why is there a difference in the string length in the
original string [44] (UTF-8 format) and string after changing the
encoding [24] (to UTF-16LE). The same is the case in case of output in
the C extension, the bytesize and the length are same (+1 or -1) and
the length is different in different encoding formats.

Could you tell me what is that I am doing wrong?

Along with this, in C extension is there any API that I can call to
check if the given string is in a particular encoding or should I use
rb_enc_from_index and from there read the struct member name and
determine in the extension that I write?

Thanks

Praveen

| Next | Last
Pages: 1 2 3
Prev: ANN: toamqp 0.3.1
Next: Problems using the 'extensions' gem - can anyone help?