From: aldnin on
> Please configure your email client so we don't receive 5 copies of your
> mail.

Just fixed that issue, don't be afraid of that in the future.

> This indicates that PHP not using UTF-8. That output is typical of
> UTF-8 output as Latin characters.

Well, maybe the output is not correct - when running the php script on console (cli) it outputs me the content in the wrong charset, but that's not the problem, doing a utf8_decode() lets me output it in the right charset.

> Not true, it only indicates that phpPgAdmin is is configured to handle
> UTF-8 correctly.

Well, I searched all the source code of phpPgAdmin for charsets and I found:

"echo "\t<meta http-equiv=\"Content-Type\" content=\"text/html; charset={$data->codemap[$dbEncoding]}\" />\r\n";"

So this means, phpPgAdmin sets the output charset to the charset which is used by the databased connected to - but that's still not the problem, because I also know how to fix charset output in browsers.

> Once again indicating your data needs to be converted from some other
> character set.

It's already converted to be compatible to utf8 when fetching it from some other ressources.

> I had similar problems getting PHP to work with UTF-8 and MySQL. Many
> of PHP's function are not multibyte aware and assume a Latin character set.
> What, if any, output buffering are you using? What is your
> default_charset set to?

Well, I've set the default_charset to UTF8, it was set before to "" (empty) - but the output on console (cli) and the problem is still the same also after changing this to UTF8, so: this is not the problem, and I don't need proper output on console without utf8_decode() - if I want proper output there I just do a decode, like I do when I want it to get outputed in the browser properly.

Maybe a cleaner explanation of the problem:

I fetch something from database, which looks like "lacarrière" when I output it in PHP - well don't let us get confused from PHPs output. Then I fetch something from another ressource looking like "lacarri�re" - when I compare both strings in PHP it tells me that they are "not equal".

So I HAVE TO do either an utf8_encode() on the string from the other ressource OR a utf8_decode() on the string from the database to compare them as "equal".

....and THIS means a lot of more code in my classes.

Hint: The other ressource is a socket connection (API) to another server.

The problem is quite simple I think, everything comming from the database is UTF8-byte encoded and needs to get UTF8-Decoded before you can work with it properly.

The default_charset seems to work only on output buffer, so the solution for that problem could only be a mechanism to tell PHP handling all strings UTF8 byte encoded, which should mean a lot of more ressources to be taken for this process - I understand that this is not a solution.

So the only solutions could be:

a) Decode and encode properly utf8 stuff and to take care if the content is utf8-byte encoded so it needs to be decoded before using it properly with other strings

b) A mechanism to tell the pg-functions in PHP to decode all data which is UTF8-Encoded. The ADODB-Layers seems to do that properly, but the pg-functions don't do that as I can see.

You can use this to reproduce it:

1. Create a table in postgres, on a UTF8 initialized database, insert something like "lacarri�re" into it. Check if it's inserted correctly..

2. Check with psql the normal output, you should get either "lacarri�re" or "lacarrière" so you can be sure it's inserted correctly.

3. Make a script which fetchs the string from the database to $dbString.

4. Set a string $phpString = "lacarri�re";

5. Compare both strings with "==" - you'll get "false"

Another hint:

Try to send "select 'lacarri�re' as test;' with pg_query to any postgres database, you'll get an error, if not... well, then I'm wrong and I've set up PHP wrong to handle UTF8-stuff.

If you send "select '".utf8_encode(lacarri�re)."' as test;" to your database this should work.

Also the above meant $phpString is NOT EQUAL to the result you would get from "select '".utf8_encode(lacarri�re)."' as test;", you would need to compare it to utf8_decode($dbString) to be EQUAL.
From: aldnin on
> You did not answer the most important question. What, if any, output
> buffering are you using? Are you using the mbstring module? If so, is
> it set to overload the old string functions?

Well, i checked for Multi Byte String functions, and it was enabled and configured before compiling with "=all".

After performing the query with pg_query, fetching the result with pg_fetch_all and putting the utf8 string into $dbString I tried to detect the encoding with:

mb_detect_encoding($dbSring)

I tells me:
ASCII

The content of $dbString is:
lacarrière

I overloaded the mbstring variables with:
mbstring.func_overload = 6
Setting it to "7" won't let me even echo something else.

mbstring.encoding_translation = On
mbstring.internal_encoding = UTF8

That's it, rest is default.

Is it possible for mbstring to overload the pg-functions I need?
From: aldnin on
thx a lot - what you're writing is really necessary to handle this problems in the future.

The reason why I was looking for a faster solution is when you have to handle huge data which is utf8, and sometimes not utf8... etc.... you understand what I mean? ;-)


Bruno Lustosa wrote:
> On 7/21/07, aldnin <aldnin(a)yahoo.de> wrote:
>> When I try to send this query (select 'lacarrière' as test;) to a UTF8
>> initialized pgsql-database (8.2.4) from PHP 5.2.3 I get this error:
>>
>> ERROR: invalid byte sequence for encoding "UTF8": 0xe87265
>
> Short answer: start using utf-8 for just everything, and your problems
> will be gone.
>
> Long explanation:
> This is usually the case when you get data from a form and put it in
> the database, and the two aren't using the same encoding.
> I guess your pg connection is using unicode (so the db expects unicode
> input), and your html is set to something else. To fix this, you have
> two choices:
>
> 1-Run utf8_encode() on the input from your forms; or
> 2-Set all your html pages to use utf-8 encoding.
>
> IMHO, option 2 is the way to go. I've been using utf-8 for everything
> for quite some time, and has solved all my problems dealing with
> accents, and so on.
> You will need:
> - All your HTML files encoded to utf-8 (quite easy with iconv, if you
> are using Linux);
> - Add a "Content-type: text/html; charset=utf-8" to all your pages.
> This is easily done using PHP's header() function in a file included
> by all your scripts.
>
> This way, the pages will be unicode, any data entered will be posted
> as unicode, and you will have no problems sending them to a database
> that uses unicode.
> Forget the <meta> tag that sets the encoding. It's only used in case
> the server doesn't send a Content-type header, which isn't the case
> normally. By default, I think at least apache sends the content-type
> as iso8859-1.
>
From: aldnin on
> output_handler=mb_output_handler

This helped me to fix any output to the browser properly, so I don't need to do any utf8_decode() any more, thanks.

> Setting it to "7" won't let me even echo something else.

Right, it's strange, but true... :-(

> mbstring.detect_order = UTF-8,eucjp-win,sjis-win

That solved the problem that mb_detect_encoding() was resulting with ASCII, now its saying "UTF-8", BUT only when running the script on console, with browser it tells me still ASCII, well not important.

But still the comparison test is "not equal", so the ut8_decode() is still needed when data comes from database, it's the same result in browser and on console (even it shows UTF-8 as detected).

> The other thing to be wary of, is output to the console. Some OSes do
> not support unicode in the console. So unless you're certain yours does,
> I wouldn't use it as a test.

I know, that's why I use the comparison test ;-)

Niel wrote:
> Hi
>
> You still haven't answered whether you're using any output handler, and
> if so which one. I use
>
> output_handler=mb_output_handler
>
>> I overloaded the mbstring variables with:
>> mbstring.func_overload = 6
>> Setting it to "7" won't let me even echo something else.
>
> Very strange, the only additional function overloaded is mail() and that
> shouldn't stop you using echo.
>
> As well as setting the internal encoding and enabling it with
> mbstring.encoding_translation = On
> mbstring.internal_encoding = UTF-8
>
> I would also use:
> mbstring.language = English
> ; or German in your case
> mbstring.detect_order = UTF-8,eucjp-win,sjis-win
> mbstring.http_input = UTF-8,SJIS,EUC-JP
> mbstring.http_output = UTF-8
>
>> Is it possible for mbstring to overload the pg-functions I need?
> No, and it shouldn't be needed. Those functions should be UTF-8 enabled
> in order to communicate with the database and supply the correct data
>
> You're still referring to 'UTF8' which as I pointed out isn't the
> official name of the encoding system. I have no idea if PHP will
> recognise it, but to be safe I suggest you use the official 'UTF-8'
> (hyphen between letters and number) in case it's causing problems.
> The other thing to be wary of, is output to the console. Some OSes do
> not support unicode in the console. So unless you're certain yours does,
> I wouldn't use it as a test.
>
> --
> Niel Archer