From: Mark Space on
Stefan Ram wrote:
> ram(a)zedat.fu-berlin.de (Stefan Ram) writes:
>> new java.io.InputStreamReader
>> ( httpURLConnection.getInputStream(), "UTF-8" );
>
> A more specific question:
>
> Shouldn't I use the document encoding instead of �UTF-8�?

The default for HTTP is "8859_1" (that's the Java charset name).
There's a special protocol for negotiating a different charset, which
you won't support because your get is to primitive.

The server will either send you 8859.1 if it can, or it'll close the
connection, I think.
From: Mark Space on
Mark Space wrote:
> Stefan Ram wrote:
>> ram(a)zedat.fu-berlin.de (Stefan Ram) writes:
>>> new java.io.InputStreamReader
>>> ( httpURLConnection.getInputStream(), "UTF-8" );
>>
>> A more specific question:
>>
>> Shouldn't I use the document encoding instead of �UTF-8�?
>
> The default for HTTP is "8859_1" (that's the Java charset name). There's
> a special protocol for negotiating a different charset, which you won't
> support because your get is to primitive.
>
> The server will either send you 8859.1 if it can, or it'll close the
> connection, I think.

P.S. the openStream() method for URL seems to open the type of
connection you need directly.

BufferedReader bin = null;

URL url = new URL( arg[0] );
bin = new BufferedReader(
new InputStreamReader( url.openStream() ));


I think. Better check that. It's fewer lines though.
From: Arne Vajhøj on
Mark Space wrote:
> Stefan Ram wrote:
>> ram(a)zedat.fu-berlin.de (Stefan Ram) writes:
>>> new java.io.InputStreamReader
>>> ( httpURLConnection.getInputStream(), "UTF-8" );
>>
>> A more specific question:
>>
>> Shouldn't I use the document encoding instead of �UTF-8�?
>
> The default for HTTP is "8859_1" (that's the Java charset name). There's
> a special protocol for negotiating a different charset, which you won't
> support because your get is to primitive.
>
> The server will either send you 8859.1 if it can, or it'll close the
> connection, I think.

What ?

HttpURLConnection and its InputStream fetches bytes from the
server. No negotiations possible.

When the client needs to interpret the bytes it needs to
decide on an encoding.

The code snippet above creates an InputStreamReader expecting
UTF-8 encoding.

If it is known that is the encoding then it is fine. If the encoding
is unknown it should be based on HTTP header and HTML META tag info.

There are no default ISO-8859-1 in neither HTTP or Java. HTTP is
always explicit and Java default is system specific.

Arne
From: Mark Space on
Arne Vajh�j wrote:

>
> HttpURLConnection and its InputStream fetches bytes from the
> server. No negotiations possible.

I think that's what I'm saying. Although I'm no longer sure that
HttpURLConnection doesn't fully support HTTP character sets. It might.


> There are no default ISO-8859-1 in neither HTTP or Java. HTTP is
> always explicit and Java default is system specific.

For a socket, yes, there is no default encoding. For HTTP, I think that
is not true. 8859-1 is the default if nothing is specified, and it is
legal to leave out the charset encoding -- in both the GET and the response.

I think, anyway. I could be all wrong about that.

Stefan has a valid question: If the content type isn't specified until
you read the header, and you don't know the content type, how do you
know what to open the stream as? The answer I think is that it's
defined to be 8859-1 by default.

Let me see if I can dig something up...

Content Negotiation for HTTP:
<http://en.wikipedia.org/wiki/Content_negotiation>

Some info on "Missing Charset" in the RFC:
<http://tools.ietf.org/html/rfc2616>
Search for 8859.


Back to Java: Also, URLConnection() looks like it will allow one to read
things like the content type and mime type before getting a Java
InputStream to the content:

URLConnection c = url.openConnection();
String mimeType = c.getContentType();
System.out.println( mimeType );

And similarly for getContentEncoding();

I gotta run. I hope I didn't booger things up too badly replying to
Stefan. Apologies if I did.
From: Arne Vajhøj on
Mark Space wrote:
> Arne Vajh�j wrote:
>> There are no default ISO-8859-1 in neither HTTP or Java. HTTP is
>> always explicit and Java default is system specific.
>
> For a socket, yes, there is no default encoding. For HTTP, I think that
> is not true. 8859-1 is the default if nothing is specified, and it is
> legal to leave out the charset encoding -- in both the GET and the
> response.

> Let me see if I can dig something up...
>
> Content Negotiation for HTTP:
> <http://en.wikipedia.org/wiki/Content_negotiation>
>
> Some info on "Missing Charset" in the RFC:
> <http://tools.ietf.org/html/rfc2616>
> Search for 8859.

You are right. If nothing is specified it means ISO-8859-1. Which
is rather bad since the world is moving from ISO-8859-1 to UTF-8.

> Stefan has a valid question: If the content type isn't specified until
> you read the header, and you don't know the content type, how do you
> know what to open the stream as? The answer I think is that it's
> defined to be 8859-1 by default.
>
> Back to Java: Also, URLConnection() looks like it will allow one to read
> things like the content type and mime type before getting a Java
> InputStream to the content:
>
> URLConnection c = url.openConnection();
> String mimeType = c.getContentType();
> System.out.println( mimeType );
>
> And similarly for getContentEncoding();

Encoding in HTTP header is easy, because the headers are US-ASCII, so
the client can read the headers and determine the encoding before
reading the body.

Encoding in HTML META tag is not so nice.

Arne