Convert UTF-8 to PHP defines [General]

Prev: Problem in image placing
Next: Google checkout nightmare

From: Richard Quadling on 28 May 2010 03:53

On 28 May 2010 04:47, Guus Ellenkamp <Ellenkamp_Guus(a)hotmail.com> wrote:
> And I need(ed) this stuff especially for non-ASCII characters like Chinese,
> Arabic and stuff :)
>
> "Ashley Sheridan" <ash(a)ashleysheridan.co.uk> wrote in message
> news:1274976794.2202.274.camel(a)localhost...
> On Thu, 2010-05-27 at 12:08 -0400, Adam Richardson wrote:
>
>> On Thu, May 27, 2010 at 9:45 AM, Guus Ellenkamp
>> <Ellenkamp_Guus(a)hotmail.com>wrote:
>>
>> > Thanks, but are you sure of that? I did some research a while ago and
>> > found
>> > that officially PHP files should be ascii and not have any specific
>> > character encoding. I believe it will work anyhow (did not try this
>> > one),
>> > but would like to stick with the standards.
>> >
>> > "Ashley Sheridan" <ash(a)ashleysheridan.co.uk> wrote in message
>> > news:1274883714.2202.228.camel(a)localhost...
>> > > On Wed, 2010-05-26 at 22:20 +0800, Guus Ellenkamp wrote:
>> > >
>> > >> We use PHP defines for defining text in different languages. As far
>> > >> as I
>> > >> know PHP files are supposed to be ASCII, not UTF-8 or something like
>> > >> that.
>> > >> What I want to make is a conversion program that would convert a
>> > >> given
>> > >> UTF-8
>> > >> file with the format
>> > >>
>> > >> definetext1=this is a text in random UTF-8, probably arabic or
>> > >> similar
>> > >> text
>> > >> definetext2=this is another text in random UTF-8, probably arabic or
>> > >> similar
>> > >> text
>> > >>
>> > >> into a file with the following defines
>> > >>
>> > >>
>> > define('definetext1',chr(<t_value>).chr(<h_value>).chr(<i_value>)...<chr(<x_value>).chr(<t_value>));
>> > >>
>> > define('definetext2,chr(<t_value>).chr(<h_value>).chr(<i_value>)...<chr(<x_value>).chr(<t_value>));
>> > >>
>> > >> Not sure if I'm using the correct chr/ord function, but I hope the
>> > >> above
>> > >> is
>> > >> clear enough to make clear what I'm looking for. Basically the output
>> > >> file
>> > >> should be ascii and not contain any utf-8.
>> > >>
>> > >> Any advise? The html_special_chars did not seem to work for
>> > >> Vietnamese
>> > >> text
>> > >> I tried to convert, so something seems to get wrong with just reading
>> > >> an
>> > >> array of strings and converting the strings and putting them in
>> > >> defines.
>> > >>
>> > >>
>> > >>
>> > >
>> > >
>> > > PHP files can contain utf-8, and in-fact is the preference of most
>> > > developers I know of.
>> > >
>> > > Thanks,
>> > > Ash
>> > > http://www.ashleysheridan.co.uk
>> > >
>> > >
>> > >
>> >
>> >
>> >
>> > --
>> > PHP General Mailing List (http://www.php.net/)
>> > To unsubscribe, visit: http://www.php.net/unsub.php
>> >
>> >
>> Because the lower range of UTF-8 matches the ascii character set
>> (intentionally by design), you'll be able to use UTF-8 for PHP files
>> without
>> problem (i.e., ascii 7-bit chars have same encoding in UTF-8.)
>> http://www.cl.cam.ac.uk/~mgk25/unicode.html
>>
>> However, if you were to use any of the multibyte characters of UTF-8 in a
>> PHP file, you could run in to some trouble. Â I use UTF-8 for most of my
>> PHP
>> files, but I've been sticking to the ASCII subset exclusively.
>>
>> Adam
>>
>
>
> I don't use the higher range of characters often, but I do sometimes use
> them for things like the graphical glyphs (Â½??, etc) I know I could do
> those with regular text and the Wingdings font, but that's not available
> on every computer, and breaks the semantic meaning behind the glyphs.
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
>
>
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

Do you mean ...

<?php
echo 'æ©æ¨å¥½';
?>

If you cut and paste that into your editor, make sure that the font
you are using is a UTF-8 font. Otherwise you will see the font's
unknown symbol glyph rather than the correct ones.

If your font doesn't have the symbols, it doesn't affect the code. The
editor is only displaying the code. It doesn't alter the code.

Richard.

--
-----
Richard Quadling
"Standing on the shoulders of some very clever giants!"
EE : http://www.experts-exchange.com/M_248814.html
EE4Free : http://www.experts-exchange.com/becomeAnExpert.jsp
Zend Certified Engineer : http://zend.com/zce.php?c=ZEND002498&r=213474731
ZOPA : http://uk.zopa.com/member/RQuadling

From: tedd on 28 May 2010 11:13

Bob wrtote:

>>The real question is whether unicode is even relevant now that the UTF
>>series is available.

Ashley answered:

>Bob, UTF is unicode (Unicode Transformation Format)

Yes, Ashley is correct. UTF-8 is Unicode, as is UTF-16 and UTF-32,
which all use different a number of bytes for each code point. Both
UTF-8 and UTF-16 are variable length whereas UTF-32 is a fixed length
of four bytes per code point.

As is my understanding, UTF-8 will accommodate all the languages
(glyphs) of the world and then some. It will be a while before we
need UTF-16 or UTF-32 but those are just a larger super-sets.

In any event, I always use UTF-8 in all my encoding.

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com

From: tedd on 28 May 2010 11:39

At 8:33 PM +0100 5/27/10, Ashley Sheridan wrote:
>Tedd, does that URL actually go anywhere, as I got nothing when I
>tried visiting it, both the actual URL and the punycode version.

Ash:

Try it again (it worked for me).

In any event, the link was supposed to be redirected to this site:

http://xn--fci.com

If you run Safari, then the url will be shown as a check-mark.

My most popular IDNS site is square-root dot com (option v):

http://xn--19g.com

The story about that site is on the web page -- you may read if interested.

The site receives over 150 unique Mac visitors per day and that
number keeps climbing -- I don't know why. For example, one day I had
over 800 visitors from Spain -- why???

Obviously, I'm trying to sell the domain (for 6 figures), but have
had no takers.

I can always get back into Macintosh software development and use the
site to sell my own apps -- that's an option I ponder whenever my
clients don't call me for a week.

Who knows what may happen.

Cheers,

tedd

PS: I have over a dozen IDNS domains including the Pharmaceutical
Icon, Yin-Yang Symbol, Sigma, Delta, and DOT dot com (option 8).

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com

From: Nisse =?utf-8?Q?Engstr=C3=B6m?= on 28 May 2010 14:52

On Fri, 28 May 2010 11:13:35 -0400, tedd wrote:

> Bob wrtote:
>
>>>The real question is whether unicode is even relevant now that the UTF
>>>series is available.
>
> Ashley answered:
>
>>Bob, UTF is unicode (Unicode Transformation Format)

Or more precisely, UTF-{8,16,32} are different ways to
serialize Unicode code points into sequences of octets
that makes it possible to store and transmit Unicode
data.

> Yes, Ashley is correct. UTF-8 is Unicode, as is UTF-16 and UTF-32,
> which all use different a number of bytes for each code point. Both
> UTF-8 and UTF-16 are variable length whereas UTF-32 is a fixed length
> of four bytes per code point.
>
> As is my understanding, UTF-8 will accommodate all the languages
> (glyphs) of the world and then some. It will be a while before we
> need UTF-16 or UTF-32 but those are just a larger super-sets.

*blink*

They are all capable of representing the full Unicode
range, which is restricted to U+0000 - U+10ffff.

The theoretical limits are:

UTF-8 [0 - 7fffffff]
UTF-16 [0 - 10ffff]
UTF-32 [0 - ffffffff]

Also, there are many, many, *many* more glyphs than
characters (code point) in the world. As an example,
www.fonts.com lists 165,125 fonts. Every one has a
*different* glyph for the characer "A"...

/Nisse

From: tedd on 28 May 2010 16:52

At 8:52 PM +0200 5/28/10, Nisse =?utf-8?Q?Engstr=C3=B6m?= wrote:
>On Fri, 28 May 2010 11:13:35 -0400, tedd wrote:
>
> > As is my understanding, UTF-8 will accommodate all the languages
>> (glyphs) of the world and then some. It will be a while before we
>> need UTF-16 or UTF-32 but those are just a larger super-sets.
>
>*blink*
>
>They are all capable of representing the full Unicode
>range, which is restricted to U+0000 - U+10ffff.
>
>The theoretical limits are:
>
> UTF-8 [0 - 7fffffff]
> UTF-16 [0 - 10ffff]
> UTF-32 [0 - ffffffff]
>
>Also, there are many, many, *many* more glyphs than
>characters (code point) in the world. As an example,
>www.fonts.com lists 165,125 fonts. Every one has a
>*different* glyph for the characer "A"...
>
>/Nisse

*blink* *blink*

As you say, UTF-8 has a range of 0 to 7FFFFFFF

Forgive me, but isn't that 2,147,483,647 (DEC) code points?

Please note that 165,125 * 48 (upper/lower case) is only 7,925,952
code points -- IF -- each letter of each font was to have it's own
code point, which is not the case for Unicode.

Code points are assigned to specific char sets that belong to
specific language sets, such as English being assigned to the code
point range that is common with ASCII. From that, we can have as many
fonts as your software can handle. However, ASCII 65 DEC (41 HEX) or
code point 65 (41 HEX) is still tied to the letter "A" regardless of
if it is Helvetical or Times. So, don't confuse code points with
fonts.

If you spend some time looking at the numerous char sets that Unicode
offers you will see that just about every symbol known to man has
been cataloged -- even Klingon was considered. From Dingbats to
Architectural symbols, from simplified Chinese to traditional
Chinese, from Greek to Cherokee, from skull/cross-bones to yin/yang
symbol, every language in the world and glyph known to man has been
included -- a truly massive project.

IMO, it will be a while before we use up all the range Unicode code
points provides.

Cheers,

tedd

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Problem in image placing
Next: Google checkout nightmare