From: Mark Space on
Arne Vajh�j wrote:

>
> Encoding in HTTP header is easy, because the headers are US-ASCII, so
> the client can read the headers and determine the encoding before
> reading the body.
>
> Encoding in HTML META tag is not so nice.

Yes, HTML != HTTP. Sorry if the original question was about HTML
instead of HTTP, I may be out in left field here.
From: Mark Space on
Stefan Ram wrote:

> Shouldn't I use the document encoding instead of �UTF-8�?
>
> But I will only know this after I have read the response!
> (Or, at least part of it.)

So I'm no expert, and I hope I'm not wasting your time by blathering,
but the question is interesting to me so I did a bit of work on it.
Here's what I have so far.


static void method4() throws MalformedURLException, IOException {
String TEST_URL =
"http://cnn.com";
URL url = new URL(TEST_URL);
URLConnection c = url.openConnection();
String type = c.getContentType();
System.out.println("Mime type: " + type );
if( type == null || type.contains("text") )
{
String enc = c.getContentEncoding();
System.out.println( "Encoding: " + enc );
if( enc == null )
{
enc = "ISO-8859-1";
}
InputStreamReader inr = new InputStreamReader(

c.getInputStream(),
enc ); // I have no idea if http encoding
strings // will work here
List<CharBuffer> result = new ArrayList<CharBuffer>();
int byteCount = 0;
for( ;; )
{
int read;
CharBuffer cb = CharBuffer.allocate( 4 * 1024 );
if( ( read = inr.read( cb )) != -1 )
{
byteCount += read;
result.add( cb );
}
else
{
break;
}
}
System.out.println( "Read: " + byteCount );
}
else // binary
{
System.out.println("binary...");
}
}

Some other thoughts:

1. If the URL string depends on user input, you may have to use
URLEncoder if the user input goes in the parameter part of the URL.

2. Don't forget that other protocols besides HTTP exist. The Java API
also supports FTP and JAR I believe. You might get one of those instead
of HTTP. You may wish to check the protocol expressly if you don't set
it yourself.

3. Both mime type and the character encoding may be null. The defaults
are "text" and ISO-8859-1 respectively, but there are also "guess"
methods in the URLConnection object.

4. If you don't have text, you might have an image. It might be nice to
return an Image in that case. I didn't get that far though.

5. I can't find any expandable buffers for Java. StringBuilder or
StringWriter seem like a good idea. I made my own by stuffing
CharBuffers into a List. The idea is to avoid testing each character
for an end-of-line, which readLine() must do. Hopefully the CharBuffer
is faster.

6. You could also read the data raw (ByteBuffer) and decide what to do
with it later. This might be more in the spirit of a "slurp" operation.

7. I looked for a way to get a channel from the URLConnection and didn't
find one. I think this is a defect in the Java API, myself. Using
direct buffers might be a big performance win here. You'll need a raw
socket for that I guess.
From: Tom Anderson on
On Sat, 19 Jul 2008, Mark Space wrote:

> Mark Space wrote:
>> Stefan Ram wrote:
>>> ram(a)zedat.fu-berlin.de (Stefan Ram) writes:
>>>> new java.io.InputStreamReader
>>>> ( httpURLConnection.getInputStream(), "UTF-8" );
>>>
>>> A more specific question:
>>>
>>> Shouldn't I use the document encoding instead of �UTF-8�?
>>
>> The default for HTTP is "8859_1" (that's the Java charset name).
>> There's a special protocol for negotiating a different charset, which
>> you won't support because your get is to primitive.
>>
>> The server will either send you 8859.1 if it can, or it'll close the
>> connection, I think.

My understanding is that the server may, in pretty much any situation,
send whatever charset it likes, as long as it declares it in the
content-type header.

> P.S. the openStream() method for URL seems to open the type of connection
> you need directly.
>
> BufferedReader bin = null;
>
> URL url = new URL( arg[0] );
> bin = new BufferedReader(
> new InputStreamReader( url.openStream() ));
>
> I think. Better check that.

You're absolutely right.

A slightly more correct approach (which might have been expounded
downthread already) would be to use a URLConnection, get the content-type,
parse it to identify a charset, and then use that to configure the
InputStreamReader correctly.

Sadly, and shockingly, there doesn't seem to be anything to parse
content-type headers in the standard library. There is a
javax.mail.internet.ContentType in J2EE, though, and it's not too hard to
write yourself.

There's also an intriguing getContent() method that sounds like it should
be even closer to what Stefan wanted - it downloads the bytes, then uses
the content-type to convert them into an object. However, it's not
entirely clear exactly what kind of object you're supposed to get, which
makes it more or less useless. In practice, getting HTML text gives you an
InputStream, and getting an image gives you a
java.awt.image.ImageProducer. That's not enormously useful here.

tom

--
Sometimes it takes a madman like Iggy Pop before you can SEE the logic
really working.
From: Mark Space on
Stefan Ram wrote:
> Mark Space <markspace(a)sbc.global.net> writes:
>> String enc = c.getContentEncoding();
>> System.out.println( "Encoding: " + enc );
>> if( enc == null )
>> {
>> enc = "ISO-8859-1";
>
> In spite of its name, getContentEncoding() does /not/
> designate the content character encoding.

Yup, I shoulda read the docs better. I'll correct my example, thanks.
From: Arne Vajhøj on
Mark Space wrote:
> So I'm no expert, and I hope I'm not wasting your time by blathering,
> but the question is interesting to me so I did a bit of work on it.
> Here's what I have so far.
>
> static void method4() throws MalformedURLException, IOException {
> String TEST_URL =
> "http://cnn.com";
> URL url = new URL(TEST_URL);
> URLConnection c = url.openConnection();
> String type = c.getContentType();
> System.out.println("Mime type: " + type );
> if( type == null || type.contains("text") )
> {
> String enc = c.getContentEncoding();
> System.out.println( "Encoding: " + enc );
> if( enc == null )
> {
> enc = "ISO-8859-1";
> }
> InputStreamReader inr = new InputStreamReader(
> c.getInputStream(),
> enc ); // I have no idea if http encoding
> strings // will work here
> List<CharBuffer> result = new ArrayList<CharBuffer>();
> int byteCount = 0;
> for( ;; )
> {
> int read;
> CharBuffer cb = CharBuffer.allocate( 4 * 1024 );
> if( ( read = inr.read( cb )) != -1 )
> {
> byteCount += read;
> result.add( cb );
> }
> else
> {
> break;
> }
> }
> System.out.println( "Read: " + byteCount );
> }
> else // binary
> {
> System.out.println("binary...");
> }
> }

You need to also handle the META HTTP-EQUIV way of specifying charset.

My suggestion for code:

import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HttpDownloadCharset {
private static Pattern encpat =
Pattern.compile("charset=([A-Za-z0-9-]+)", Pattern.CASE_INSENSITIVE);
private static String parseContentType(String contenttype) {
Matcher m = encpat.matcher(contenttype);
if(m.find()) {
return m.group(1);
} else {
return "ISO-8859-1";
}
}
private static Pattern metaencpat =
Pattern.compile("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
Pattern.CASE_INSENSITIVE);
private static String parseMetaContentType(String html, String
defenc) {
Matcher m = metaencpat.matcher(html);
if(m.find()) {
return parseContentType(m.group(1));
} else {
return defenc;
}
}
private static final int DEFAULT_BUFSIZ = 1000000;
public static String download(String urlstr) throws IOException {
URL url = new URL(urlstr);
HttpURLConnection con = (HttpURLConnection)url.openConnection();
con.connect();
if (con.getResponseCode() == HttpURLConnection.HTTP_OK) {
String enc = parseContentType(con.getContentType());
int bufsiz = con.getContentLength();
if(bufsiz < 0) {
bufsiz = DEFAULT_BUFSIZ;
}
byte[] buf = new byte[bufsiz];
InputStream is = con.getInputStream();
int ix = 0;
int n;
while((n = is.read(buf, ix, buf.length - ix)) > 0) {
ix += n;
}
is.close();
con.disconnect();
String temp = new String(buf, "US-ASCII");
enc = parseMetaContentType(temp, enc);
return new String(buf, enc);
} else {
con.disconnect();
throw new IllegalArgumentException("URL " + urlstr + "
returned " + con.getResponseMessage());
}
}
}

Arne