|
Prev: bubblesort question
Next: (www.stefsclothes.com)supply t-shirt,Sandal,slipper,jersey,nike shoes(paypal accept)
From: Mark Space on 19 Jul 2008 23:02 Arne Vajh�j wrote: > > Encoding in HTTP header is easy, because the headers are US-ASCII, so > the client can read the headers and determine the encoding before > reading the body. > > Encoding in HTML META tag is not so nice. Yes, HTML != HTTP. Sorry if the original question was about HTML instead of HTTP, I may be out in left field here.
From: Mark Space on 20 Jul 2008 16:20 Stefan Ram wrote: > Shouldn't I use the document encoding instead of �UTF-8�? > > But I will only know this after I have read the response! > (Or, at least part of it.) So I'm no expert, and I hope I'm not wasting your time by blathering, but the question is interesting to me so I did a bit of work on it. Here's what I have so far. static void method4() throws MalformedURLException, IOException { String TEST_URL = "http://cnn.com"; URL url = new URL(TEST_URL); URLConnection c = url.openConnection(); String type = c.getContentType(); System.out.println("Mime type: " + type ); if( type == null || type.contains("text") ) { String enc = c.getContentEncoding(); System.out.println( "Encoding: " + enc ); if( enc == null ) { enc = "ISO-8859-1"; } InputStreamReader inr = new InputStreamReader( c.getInputStream(), enc ); // I have no idea if http encoding strings // will work here List<CharBuffer> result = new ArrayList<CharBuffer>(); int byteCount = 0; for( ;; ) { int read; CharBuffer cb = CharBuffer.allocate( 4 * 1024 ); if( ( read = inr.read( cb )) != -1 ) { byteCount += read; result.add( cb ); } else { break; } } System.out.println( "Read: " + byteCount ); } else // binary { System.out.println("binary..."); } } Some other thoughts: 1. If the URL string depends on user input, you may have to use URLEncoder if the user input goes in the parameter part of the URL. 2. Don't forget that other protocols besides HTTP exist. The Java API also supports FTP and JAR I believe. You might get one of those instead of HTTP. You may wish to check the protocol expressly if you don't set it yourself. 3. Both mime type and the character encoding may be null. The defaults are "text" and ISO-8859-1 respectively, but there are also "guess" methods in the URLConnection object. 4. If you don't have text, you might have an image. It might be nice to return an Image in that case. I didn't get that far though. 5. I can't find any expandable buffers for Java. StringBuilder or StringWriter seem like a good idea. I made my own by stuffing CharBuffers into a List. The idea is to avoid testing each character for an end-of-line, which readLine() must do. Hopefully the CharBuffer is faster. 6. You could also read the data raw (ByteBuffer) and decide what to do with it later. This might be more in the spirit of a "slurp" operation. 7. I looked for a way to get a channel from the URLConnection and didn't find one. I think this is a defect in the Java API, myself. Using direct buffers might be a big performance win here. You'll need a raw socket for that I guess.
From: Tom Anderson on 22 Jul 2008 15:08 On Sat, 19 Jul 2008, Mark Space wrote: > Mark Space wrote: >> Stefan Ram wrote: >>> ram(a)zedat.fu-berlin.de (Stefan Ram) writes: >>>> new java.io.InputStreamReader >>>> ( httpURLConnection.getInputStream(), "UTF-8" ); >>> >>> A more specific question: >>> >>> Shouldn't I use the document encoding instead of �UTF-8�? >> >> The default for HTTP is "8859_1" (that's the Java charset name). >> There's a special protocol for negotiating a different charset, which >> you won't support because your get is to primitive. >> >> The server will either send you 8859.1 if it can, or it'll close the >> connection, I think. My understanding is that the server may, in pretty much any situation, send whatever charset it likes, as long as it declares it in the content-type header. > P.S. the openStream() method for URL seems to open the type of connection > you need directly. > > BufferedReader bin = null; > > URL url = new URL( arg[0] ); > bin = new BufferedReader( > new InputStreamReader( url.openStream() )); > > I think. Better check that. You're absolutely right. A slightly more correct approach (which might have been expounded downthread already) would be to use a URLConnection, get the content-type, parse it to identify a charset, and then use that to configure the InputStreamReader correctly. Sadly, and shockingly, there doesn't seem to be anything to parse content-type headers in the standard library. There is a javax.mail.internet.ContentType in J2EE, though, and it's not too hard to write yourself. There's also an intriguing getContent() method that sounds like it should be even closer to what Stefan wanted - it downloads the bytes, then uses the content-type to convert them into an object. However, it's not entirely clear exactly what kind of object you're supposed to get, which makes it more or less useless. In practice, getting HTML text gives you an InputStream, and getting an image gives you a java.awt.image.ImageProducer. That's not enormously useful here. tom -- Sometimes it takes a madman like Iggy Pop before you can SEE the logic really working.
From: Mark Space on 22 Jul 2008 16:16 Stefan Ram wrote: > Mark Space <markspace(a)sbc.global.net> writes: >> String enc = c.getContentEncoding(); >> System.out.println( "Encoding: " + enc ); >> if( enc == null ) >> { >> enc = "ISO-8859-1"; > > In spite of its name, getContentEncoding() does /not/ > designate the content character encoding. Yup, I shoulda read the docs better. I'll correct my example, thanks.
From: Arne Vajhøj on 27 Jul 2008 18:05
Mark Space wrote: > So I'm no expert, and I hope I'm not wasting your time by blathering, > but the question is interesting to me so I did a bit of work on it. > Here's what I have so far. > > static void method4() throws MalformedURLException, IOException { > String TEST_URL = > "http://cnn.com"; > URL url = new URL(TEST_URL); > URLConnection c = url.openConnection(); > String type = c.getContentType(); > System.out.println("Mime type: " + type ); > if( type == null || type.contains("text") ) > { > String enc = c.getContentEncoding(); > System.out.println( "Encoding: " + enc ); > if( enc == null ) > { > enc = "ISO-8859-1"; > } > InputStreamReader inr = new InputStreamReader( > c.getInputStream(), > enc ); // I have no idea if http encoding > strings // will work here > List<CharBuffer> result = new ArrayList<CharBuffer>(); > int byteCount = 0; > for( ;; ) > { > int read; > CharBuffer cb = CharBuffer.allocate( 4 * 1024 ); > if( ( read = inr.read( cb )) != -1 ) > { > byteCount += read; > result.add( cb ); > } > else > { > break; > } > } > System.out.println( "Read: " + byteCount ); > } > else // binary > { > System.out.println("binary..."); > } > } You need to also handle the META HTTP-EQUIV way of specifying charset. My suggestion for code: import java.io.IOException; import java.io.InputStream; import java.net.HttpURLConnection; import java.net.URL; import java.util.regex.Matcher; import java.util.regex.Pattern; public class HttpDownloadCharset { private static Pattern encpat = Pattern.compile("charset=([A-Za-z0-9-]+)", Pattern.CASE_INSENSITIVE); private static String parseContentType(String contenttype) { Matcher m = encpat.matcher(contenttype); if(m.find()) { return m.group(1); } else { return "ISO-8859-1"; } } private static Pattern metaencpat = Pattern.compile("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>", Pattern.CASE_INSENSITIVE); private static String parseMetaContentType(String html, String defenc) { Matcher m = metaencpat.matcher(html); if(m.find()) { return parseContentType(m.group(1)); } else { return defenc; } } private static final int DEFAULT_BUFSIZ = 1000000; public static String download(String urlstr) throws IOException { URL url = new URL(urlstr); HttpURLConnection con = (HttpURLConnection)url.openConnection(); con.connect(); if (con.getResponseCode() == HttpURLConnection.HTTP_OK) { String enc = parseContentType(con.getContentType()); int bufsiz = con.getContentLength(); if(bufsiz < 0) { bufsiz = DEFAULT_BUFSIZ; } byte[] buf = new byte[bufsiz]; InputStream is = con.getInputStream(); int ix = 0; int n; while((n = is.read(buf, ix, buf.length - ix)) > 0) { ix += n; } is.close(); con.disconnect(); String temp = new String(buf, "US-ASCII"); enc = parseMetaContentType(temp, enc); return new String(buf, enc); } else { con.disconnect(); throw new IllegalArgumentException("URL " + urlstr + " returned " + con.getResponseMessage()); } } } Arne |