Convert UTF-8 to PHP defines [General]

Prev: Problem in image placing
Next: Google checkout nightmare

From: Ashley Sheridan on 27 May 2010 12:13

On Thu, 2010-05-27 at 12:08 -0400, Adam Richardson wrote:

> On Thu, May 27, 2010 at 9:45 AM, Guus Ellenkamp
> <Ellenkamp_Guus(a)hotmail.com>wrote:
>
> > Thanks, but are you sure of that? I did some research a while ago and found
> > that officially PHP files should be ascii and not have any specific
> > character encoding. I believe it will work anyhow (did not try this one),
> > but would like to stick with the standards.
> >
> > "Ashley Sheridan" <ash(a)ashleysheridan.co.uk> wrote in message
> > news:1274883714.2202.228.camel(a)localhost...
> > > On Wed, 2010-05-26 at 22:20 +0800, Guus Ellenkamp wrote:
> > >
> > >> We use PHP defines for defining text in different languages. As far as I
> > >> know PHP files are supposed to be ASCII, not UTF-8 or something like
> > >> that.
> > >> What I want to make is a conversion program that would convert a given
> > >> UTF-8
> > >> file with the format
> > >>
> > >> definetext1=this is a text in random UTF-8, probably arabic or similar
> > >> text
> > >> definetext2=this is another text in random UTF-8, probably arabic or
> > >> similar
> > >> text
> > >>
> > >> into a file with the following defines
> > >>
> > >>
> > define('definetext1',chr(<t_value>).chr(<h_value>).chr(<i_value>)...<chr(<x_value>).chr(<t_value>));
> > >>
> > define('definetext2,chr(<t_value>).chr(<h_value>).chr(<i_value>)...<chr(<x_value>).chr(<t_value>));
> > >>
> > >> Not sure if I'm using the correct chr/ord function, but I hope the above
> > >> is
> > >> clear enough to make clear what I'm looking for. Basically the output
> > >> file
> > >> should be ascii and not contain any utf-8.
> > >>
> > >> Any advise? The html_special_chars did not seem to work for Vietnamese
> > >> text
> > >> I tried to convert, so something seems to get wrong with just reading an
> > >> array of strings and converting the strings and putting them in defines.
> > >>
> > >>
> > >>
> > >
> > >
> > > PHP files can contain utf-8, and in-fact is the preference of most
> > > developers I know of.
> > >
> > > Thanks,
> > > Ash
> > > http://www.ashleysheridan.co.uk
> > >
> > >
> > >
> >
> >
> >
> > --
> > PHP General Mailing List (http://www.php.net/)
> > To unsubscribe, visit: http://www.php.net/unsub.php
> >
> >
> Because the lower range of UTF-8 matches the ascii character set
> (intentionally by design), you'll be able to use UTF-8 for PHP files without
> problem (i.e., ascii 7-bit chars have same encoding in UTF-8.)
> http://www.cl.cam.ac.uk/~mgk25/unicode.html
>
> However, if you were to use any of the multibyte characters of UTF-8 in a
> PHP file, you could run in to some trouble. I use UTF-8 for most of my PHP
> files, but I've been sticking to the ASCII subset exclusively.
>
> Adam
>

I don't use the higher range of characters often, but I do sometimes use
them for things like the graphical glyphs (Â½ââ, etc) I know I could do
those with regular text and the Wingdings font, but that's not available
on every computer, and breaks the semantic meaning behind the glyphs.

Thanks,
Ash
http://www.ashleysheridan.co.uk

From: tedd on 27 May 2010 12:31

At 5:13 PM +0100 5/27/10, Ashley Sheridan wrote:
>
>I don't use the higher range of characters often, but I do sometimes use
>them for things like the graphical glyphs (12)&, etc) I know I could do
>those with regular text and the Wingdings font, but that's not available
>on every computer, and breaks the semantic meaning behind the glyphs.
>
>Thanks,
>Ash

Ash:

I read briefly on the css discuss list there is a
movement to "force" download of fonts (i.e., char
sets) to make layouts work. Apparently some
browsers allow for that but I have not read up on
it and I may have the wrong impression, but that
was my take.

For the exception of "evil" fonts, it seemed like a good idea.

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com

From: "Bob McConnell" on 27 May 2010 14:06

From: Ashley Sheridan

>On Thu, 2010-05-27 at 12:08 -0400, Adam Richardson wrote:
>
>> On Thu, May 27, 2010 at 9:45 AM, Guus Ellenkamp
>> <Ellenkamp_Guus(a)hotmail.com>wrote:
>>
>> > Thanks, but are you sure of that? I did some research a while ago and found
>> > that officially PHP files should be ascii and not have any specific
>> > character encoding. I believe it will work anyhow (did not try this one),
>> > but would like to stick with the standards.
>> >
>> > "Ashley Sheridan" <ash(a)ashleysheridan.co.uk> wrote in message
>> > news:1274883714.2202.228.camel(a)localhost...
>> > > On Wed, 2010-05-26 at 22:20 +0800, Guus Ellenkamp wrote:
>> > >
>> > >> We use PHP defines for defining text in different languages. As far as I
>> > >> know PHP files are supposed to be ASCII, not UTF-8 or something like
>> > >> that.
>> > >> What I want to make is a conversion program that would convert a given
>> > >> UTF-8
>> > >> file with the format
>> > >>
>> > >> definetext1=this is a text in random UTF-8, probably arabic or similar
>> > >> text
>> > >> definetext2=this is another text in random UTF-8, probably arabic or
>> > >> similar
>> > >> text
>> > >>
>> > >> into a file with the following defines
>> > >>
>> > >>
>> > define('definetext1',chr(<t_value>).chr(<h_value>).chr(<i_value>)...
> <chr(<x_value>).chr(<t_value>));
>> > >>
>> > define('definetext2,chr(<t_value>).chr(<h_value>).chr(<i_value>)...
> <chr(<x_value>).chr(<t_value>));
>> > >>
>> > >> Not sure if I'm using the correct chr/ord function, but I hope the above
>> > >> is
>> > >> clear enough to make clear what I'm looking for. Basically the output
>> > >> file
>> > >> should be ascii and not contain any utf-8.
>> > >>
>> > >> Any advise? The html_special_chars did not seem to work for Vietnamese
>> > >> text
>> > >> I tried to convert, so something seems to get wrong with just reading an
>> > >> array of strings and converting the strings and putting them in defines.
>> > >
>> > > PHP files can contain utf-8, and in-fact is the preference of most
>> > > developers I know of.
>> > >
>> >
>> Because the lower range of UTF-8 matches the ascii character set
>> (intentionally by design), you'll be able to use UTF-8 for PHP files without
>> problem (i.e., ascii 7-bit chars have same encoding in UTF-8.)
>> http://www.cl.cam.ac.uk/~mgk25/unicode.html
>>
>> However, if you were to use any of the multibyte characters of UTF-8 in a
>> PHP file, you could run in to some trouble. I use UTF-8 for most of my PHP
>> files, but I've been sticking to the ASCII subset exclusively.
>
> I don't use the higher range of characters often, but I do sometimes use
> them for things like the graphical glyphs (Â½ââ, etc) I know I could do
> those with regular text and the Wingdings font, but that's not available
> on every computer, and breaks the semantic meaning behind the glyphs.

What higher range? ASCII only defined 128 values, the bottom 32 being control characters that don't print. Anything outside of that is not ASCII, but a proprietary extension. In particular, the glyphs usually associated with 0-32 and 128-255 are IBM specific and not guaranteed to be present outside of their original video ROM. So only the first 128 characters map directly into UTF-8.

Bob McConnell

Ref: pp 25-29 The Programmer's PC Sourcebook, 1988, Thom Hogan, Microsoft Press

From: Ashley Sheridan on 27 May 2010 14:11

On Thu, 2010-05-27 at 14:06 -0400, Bob McConnell wrote:

> From: Ashley Sheridan
>
> >On Thu, 2010-05-27 at 12:08 -0400, Adam Richardson wrote:
> >
> >> On Thu, May 27, 2010 at 9:45 AM, Guus Ellenkamp
> >> <Ellenkamp_Guus(a)hotmail.com>wrote:
> >>
> >> > Thanks, but are you sure of that? I did some research a while ago and found
> >> > that officially PHP files should be ascii and not have any specific
> >> > character encoding. I believe it will work anyhow (did not try this one),
> >> > but would like to stick with the standards.
> >> >
> >> > "Ashley Sheridan" <ash(a)ashleysheridan.co.uk> wrote in message
> >> > news:1274883714.2202.228.camel(a)localhost...
> >> > > On Wed, 2010-05-26 at 22:20 +0800, Guus Ellenkamp wrote:
> >> > >
> >> > >> We use PHP defines for defining text in different languages. As far as I
> >> > >> know PHP files are supposed to be ASCII, not UTF-8 or something like
> >> > >> that.
> >> > >> What I want to make is a conversion program that would convert a given
> >> > >> UTF-8
> >> > >> file with the format
> >> > >>
> >> > >> definetext1=this is a text in random UTF-8, probably arabic or similar
> >> > >> text
> >> > >> definetext2=this is another text in random UTF-8, probably arabic or
> >> > >> similar
> >> > >> text
> >> > >>
> >> > >> into a file with the following defines
> >> > >>
> >> > >>
> >> > define('definetext1',chr(<t_value>).chr(<h_value>).chr(<i_value>)...
> > <chr(<x_value>).chr(<t_value>));
> >> > >>
> >> > define('definetext2,chr(<t_value>).chr(<h_value>).chr(<i_value>)...
> > <chr(<x_value>).chr(<t_value>));
> >> > >>
> >> > >> Not sure if I'm using the correct chr/ord function, but I hope the above
> >> > >> is
> >> > >> clear enough to make clear what I'm looking for. Basically the output
> >> > >> file
> >> > >> should be ascii and not contain any utf-8.
> >> > >>
> >> > >> Any advise? The html_special_chars did not seem to work for Vietnamese
> >> > >> text
> >> > >> I tried to convert, so something seems to get wrong with just reading an
> >> > >> array of strings and converting the strings and putting them in defines.
> >> > >
> >> > > PHP files can contain utf-8, and in-fact is the preference of most
> >> > > developers I know of.
> >> > >
> >> >
> >> Because the lower range of UTF-8 matches the ascii character set
> >> (intentionally by design), you'll be able to use UTF-8 for PHP files without
> >> problem (i.e., ascii 7-bit chars have same encoding in UTF-8.)
> >> http://www.cl.cam.ac.uk/~mgk25/unicode.html
> >>
> >> However, if you were to use any of the multibyte characters of UTF-8 in a
> >> PHP file, you could run in to some trouble. I use UTF-8 for most of my PHP
> >> files, but I've been sticking to the ASCII subset exclusively.
> >
> > I don't use the higher range of characters often, but I do sometimes use
> > them for things like the graphical glyphs (Â½ââ, etc) I know I could do
> > those with regular text and the Wingdings font, but that's not available
> > on every computer, and breaks the semantic meaning behind the glyphs.
>
> What higher range? ASCII only defined 128 values, the bottom 32 being control characters that don't print. Anything outside of that is not ASCII, but a proprietary extension. In particular, the glyphs usually associated with 0-32 and 128-255 are IBM specific and not guaranteed to be present outside of their original video ROM. So only the first 128 characters map directly into UTF-8.
>
> Bob McConnell
>
> Ref: pp 25-29 The Programmer's PC Sourcebook, 1988, Thom Hogan, Microsoft Press

The higher range of utf8 characters that don't map to ascii values.

Thanks,
Ash
http://www.ashleysheridan.co.uk

From: tedd on 27 May 2010 15:13

At 7:11 PM +0100 5/27/10, Ashley Sheridan wrote:
>On Thu, 2010-05-27 at 14:06 -0400, Bob McConnell wrote:
> > From: Ashley Sheridan
> > > I don't use the higher range of characters often, but I do sometimes use
>> > them for things like the graphical glyphs (12)&, etc) I know I could do
>> > those with regular text and the Wingdings font, but that's not available
>> > on every computer, and breaks the semantic meaning behind the glyphs.
>>
>> What higher range? ASCII only defined 128
>>values, the bottom 32 being control characters
>>that don't print. Anything outside of that is
>>not ASCII, but a proprietary extension. In
>>particular, the glyphs usually associated with
>>0-32 and 128-255 are IBM specific and not
>>guaranteed to be present outside of their
>>original video ROM. So only the first 128
>>characters map directly into UTF-8.
>>
>> Bob McConnell
>>
>> Ref: pp 25-29 The Programmer's PC Sourcebook,
>>1988, Thom Hogan, Microsoft Press
>
>
>The higher range of utf8 characters that don't map to ascii values.
>
>Thanks,
>Ash

Bob:

I understood what Ash was referring re his
"higher range" statement, but his second
statement was somewhat confusing.

ASCII is defined as characters having a value of
0-127 DEC (00-7F HEX). The "higher range" of
128-255 DEC (80-FF HEX) have been loosely
characterized as "extended ASCII" but have not
been officially declared such. Both M$ and Apple
have their own characters appearing the range and
have used different character for different
things -- thus problems arose is using either. I
do not know if the problem was ever resolved.
It's probably best to never use such characters.

The Unicode database uses the same lower
character values (i.e., "code points") as does
ASCII, namely 0-127, and thus UFT-8 (8-bit
variable width encoding) is really a super-set
which includes the sub-set of ASCII.

The "Wingdings" font that Ash refers to is the
really the "Dingbat" char set in Unicode, as
shown here:

http://www.unicode.org/charts/PDF/U2700.pdf

These are real characters that can be used for
all sorts of things including url's, for example:

http://xn--gci.com

Please forgive the PUNYCODE url, but IE does not
recognize "other than ASCII" characters in url's,
whereas Safari will show the url correctly.
Clearly, Safari has the upper hand in resolving
"other than English" issues -- perhaps that's why
their overseas profits last year exceeded their
domestic -- but I digress.

The use of UFT-8 encoding in everything (web and
php) should present much less problems globally
than it is trying to fight it.

Here's some references that may help:

[1] <http://webstandardsgroup.org/>
[2] <http://www.w3.org/People/Ishida/>
[3] <http://www.w3.org/International>
[4] <http://shiflett.org/archive/177>
[5] <http://en.wikipedia.org/wiki/Universal_character_set>
[6] <http://www.unicode.org/>

Cheers,

tedd

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Problem in image placing
Next: Google checkout nightmare