From: Thomas 'PointedEars' Lahn on
johncoltrane wrote:

> [Thomas 'PointedEars' Lahn wrote:]
>> [johncoltrane wrote:]
>>> AFAIK JavaScript is supposed to be UTF-8 compatible.
>>
>> You know nonsense; partially because you don't know what JavaScript is,
>> partially because you don't know what UTF-8 is.
>>
>> ,-[ECMAScript Language Specification, Edition 5 Final Draft]
>> |
>> | A conforming implementation of this International standard shall
>> | interpret characters in conformance with the Unicode Standard, Version
>> | 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the
>> | adopted encoding form, implementation level 3. If the adopted ISO/IEC
>> | 10646-1 subset is not otherwise specified, it is presumed to be the BMP
>> | subset, collection 300. If the adopted encoding form is not otherwise
>> | specified, it presumed to be the UTF-16 encoding form.
>>
>> The key phrase here being "If the adopted encoding form is not otherwise
>> specified". See below.
>>
>>> You can even use japanese hiragana as variable names.
>>
>> That is a subset of a character set (Unicode), not an encoding (UTF-8).
>> Learn to understand the difference.
>
> I know the difference. It was an example : variable names in non-ascii
> characters do work in... that mostly browser centric scripting language.

But you were not talking about characters, you were referring to a character
encoding. You were also talking about JavaScript in an unspecific way.
That makes your assumption questionable, if not wrong.

> Think of it as a preemptive illustration of your rebuttal.

You're funny.

>>> I just ran a few quick tests in Firefox with the factory default charset
>>
>> Nonsense. Obviously you don't know what "charset" means to begin with.
>>
>>> (iso-8859-1).
>>
>> That is a character encoding, and its being the *HTTP default* in reality
>> is heavily overrated. And there is *no* default value for the `charset'
>> attribute specified in HTML.
>
> Well, what I know is that when talking about HTML, the difference
> between "character set" and "encoding" is practically non-existent,
> both words being used (wrongly, I give you that) interchangeably.

Nonsense. The HTML Document Character Set is clearly specified and
implemented to be UCS (ISO/IEC 10646-1:1993, which is is character-by-
character equivalent to Unicode 3.0).

The HTML encoding is ... there is *no* default character encoding, which is
the reason why the encoding should be declared for an interoperable HTML
document. (See below for why even US-ASCII should be declared.)

I gave you the reference already. Go read it.

> Also I was referring to the default settings of Firefox, here.

But Firefox is not the only browser that is Mozilla-based, and not the only
one that implements JavaScript[tm] (short of other browsers that implement
other ECMAScript implementations, which must be considered given the OP's
unspecific use of the term). There was also no mention of Firefox before
you named it (which is another indication that "JavaScript" was not supposed
to mean JavaScript[tm]). And the default may very well depend not only on
the browser, but also on its localization.

> HTML has only the "charset" attribute

No.

> and it's not supposed to accept "Unicode" or "Hiragana" or "Occidental" as
> value.

Because the name of a *character encoding* is needed. "Unicode",
"Hiragana", or "Occidental" are not; those are names of character sets or
names of subsets of character sets. Got it?

> We are left with "utf-8"

Nonsense.

> (the most widely used way of representing the full/most of the Unicode
> standard, including Hiragana)

Whether UTF-8 would be most widely used (which remained for you to show) was
largely irrelevant. There are at least three widely *supported* encodings
for Unicode characters: UTF-7, UTF-8, and UTF-16 (UTF-32 is specified, but I
would rate support for it below the three others). In IE/MSHTML, UTF-*7* is
the fallback encoding and so, given IE/MSHTML's current dominant market
share, could be considered the most widely supported character encoding on
the Web.

Regardless, each of the specified UTFs can represent *all* Unicode
characters; what is often missing is only a renderer to support them beyond
the Basic Multilingual Plane and a font to display them.

<http://unicode.org/faq/>

> or "iso-8859-1" or a slew of other possibilities.

"We are left with ... anything." Great argument.

> Hell, in XML/XHTML we even have to use both terms.

Apples and oranges. By contrast to HTML, XML (and so XHTML -- when served
as application/xhtml+xml, application/xml, or text/xml --, as an application
of XML) has two default character encodings defined (that therefore do not
need to be declared), UTF-8 and UTF-16LE. The X(HT)ML Document Character
Set is the same as in HTML, though, UCS.

>> Learn to quote.
>
> Like that?

Yes, you only "forgot" a leading attribution line for each quotation level.


PointedEars
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$8300dec7(a)news.demon.co.uk> (2004)
From: Thomas 'PointedEars' Lahn on
Thomas 'PointedEars' Lahn wrote:

> By contrast to HTML, XML (and so XHTML -- when served as
> application/xhtml+xml, application/xml, or text/xml --, as an
> application of XML) has two default character encodings defined (that
> therefore do not need to be declared), UTF-8 and UTF-16LE. The X(HT)ML
> Document Character Set is the same as in HTML, though, UCS.

Correction: The default is not limited to UTF-8 and UTF-16LE. At least
UTF-16BE must be supported, too.

,-<http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding>
|
| [...]
| Each external parsed entity in an XML document may use a different
| encoding for its characters. All XML processors MUST be able to read
| entities in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and
| "UTF-16" in this specification do not apply to related character
| encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.
|
| Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin
| with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
| section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character,
| #xFEFF). This is an encoding signature, not part of either the markup or
| the character data of the XML document. XML processors MUST be able to
| use this character to differentiate between UTF-8 and UTF-16 encoded
| documents.
| [...]
| In the absence of information provided by an external transport protocol
| (e.g. HTTP or MIME), it is a fatal error for an entity including an
| encoding declaration to be presented to the XML processor in an encoding
| other than that named in the declaration, or for an entity which begins
| with neither a Byte Order Mark nor an encoding declaration to use an
| encoding other than UTF-8. Note that since ASCII is a subset of UTF-8,
| ordinary ASCII entities do not strictly need an encoding declaration.
| [...]
| Unless an encoding is determined by a higher-level protocol, it is also a
| fatal error if an XML entity contains no encoding declaration and its
| content is not legal UTF-8 or UTF-16.

I could not find normative definitions of what "including, but not limited
to" refers to. Appendix F (non-normative) mentions some possibilities, but
they should probably not being relied upon.


PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)
From: Jukka K. Korpela on
Hans-Georg Michna wrote:

> I'm having a problem with a UTF-8 HTML page containing a
> <script> tag that calls in a JavaScript file that is also
> encoded in UTF-8.

Technically, a <script> tag with a src attribute refers to an external file
for use via inclusion mechanism. It's nothing comparable to a subroutine
mechanism, so the word "call" is not quite adequate. But it seems that you
have have already got your share of besserwisserism, nitpicking, and
pointless lecturing (on wrong topics) from the resident troll, so let's
discuss in common terms and expressions.

The encoding of Javascript files is tricky issue. There are no defined
defaults and no well-defined mechanism for specifying the encoding.

The common browser default seems to be that the encoding of the referring
HTML document is used. This sounds natural especially if you think of a
<script> element as simple inclusion mechanism (for Javascript files). There
is no standard or law on this, but if it does not work that way in your case
(in which browser[s]?), you should really post the URL for analysis. In
general, it is best to post a URL for a starter. It won't prevent trolling,
but at least people who are willing to help have a sporting chance of being
able to see the problem.

> Am I right in assuming that a JavaScript file inserted by means
> of the <script> tag is interpreted as being encoded in the same
> character set as the HTML page itself?

I'd say that it's the expectable behavior, but we make our assumptions at
our own risk. It seems that the authors of HTML specifications just didn't
think of this issue.

On general grounds, we can expect that browsers honor the encoding
information in HTTP headers. However, if the Javascript resource is served
as application/javascript, then it's supposed to be binary, with all
encoding issues resolved within the binary format.

There's RFC 4329, "Scripting media types", but it is classified as
Informational, despite its language that refers to "requirements" and even
uses "MUST". And it illogically defines a charset parameter for
application/javascript. Oh well.

For text/javascript, declared as "obsolete" by the informational RFC, the
charset parameter is much more logical. And it seems that browsers honor it.

On the practical side, if you work with a typical Apache server environment,
you should put e.g.
AddType text/javascript;charset=utf-8 js
or
AddType application/javascript;charset=utf-8 js
in the .htaccess file in the directory that contains your .js file, if they
are actually utf-8 encoded. (Note that Ascii files are trivially a special
case of utf-8 encoded files, but ISO-8859-1 files containing any non-Ascii
data are not.)

Using the charset="..." parameter in the <script> file is possible, too, but
it cannot override the encoding information in HTTP headers, if present. On
the other hand, it can be useful when a page has been saved locally - so
that when the page is opened, there will be no HTTP headers.

(Note that <meta> tags can only specify the encoding on an HTML document. If
this affects the default encoding used for an external resource, then that's
something in the realm of actual browser behavior, not specifications.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/

From: Dr J R Stockton on
In comp.lang.javascript message <ev1v26tr7fv7buh7j0herp1gvs6a5ljlm8(a)4ax.
com>, Sat, 3 Jul 2010 21:03:39, Hans-Georg Michna <hans-
georgNoEmailPlease(a)michna.com> posted:

>
>The JavaScript program, among other things, contains a string
>literal, which contains an umlaut, and dynamically puts the
>string into an HTML tag. But the umlaut is not displayed
>properly and displays as a little square box instead. What could
>be the cause of this problem?

Is that a naked umlaut, or is it sitting over a well-known vowel?

You can add test code to the page, with charCodeAt, to see exactly what
is delivered to the browser (refer to the UniCode site for official
tables); if that delivery is wrong, you can encode the character as
\uhhhh or like &auml; in your source script.

Perhaps more likely, your viewer does not have that character in the
current font; add test code to display the string also in popular fonts.

If all else fails, consult
<URL:http://www.merlyn.demon.co.uk/quotings.HTM#FredHoyle>
- it has clearly worked with you !

--
(c) John Stockton, nr London UK. ?@merlyn.demon.co.uk Turnpike v6.05 MIME.
Web <URL:http://www.merlyn.demon.co.uk/> - FAQish topics, acronyms, & links.
Proper <= 4-line sig. separator as above, a line exactly "-- " (RFCs 5536/7)
Do not Mail News to me. Before a reply, quote with ">" or "> " (RFCs 5536/7)
From: Hans-Georg Michna on
On Sat, 3 Jul 2010 20:36:00 +0100, Richard Cornford wrote:

>Hans-Georg Michna wrote:

>> I'm having a problem with a UTF-8 HTML page containing a
>> <script> tag that calls in a JavaScript file that is also
>> encoded in UTF-8.
>>
>> The JavaScript program, among other things, contains a
>> string literal, which contains an umlaut, and dynamically
>> puts the string into an HTML tag. But the umlaut is not
>> displayed properly and displays as a little square box
>> instead. What could be the cause of this problem?
>>
>> Am I right in assuming that a JavaScript file inserted by
>> means of the <script> tag is interpreted as being encoded
>> in the same character set as the HTML page itself?

>Without a reference to an HTML spec saying as much that would be an
>assumption, although not an unreasonable one as it would be a sensible
>strategy. Though I would expect the above description to assert that you
>have examined the HTML traffic (using an HTTP monitor/proxy such as
>Fiddler or Charles) and verified first that the javascript is being
>served to appropriate content type headers (either asserting UTF-8, or
>at least not contradicting it),

Thanks for responding!

I've looked at the HTTP header from the server, using Firebug,
and it specifies UTF-8.

>and second, that the actual bytes being
>sent includes the correct sequence of bytes for the UTF-8 encoding of
>the offending character (by looking at the hex representation of the
>resource in the HTTP monitor).

Have yet to do this.

>If nothing else, trying the SCRIPT element with an explicit CHARSET
>attribute (asserting UTF-8) might prove instructive.

Will try that too, although it seems strange that I should have
to do that.

Hans-Georg
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4
Prev: Sound-synched movie
Next: Energy Saving Tips