Decompressing a file retrieved by URL seems too complex [Python]

Prev: EXOR or symmetric difference for the Counter class
Next: Floating numbers

From: John Nagle on 12 Aug 2010 16:40

(Repost with better indentation)
I'm reading a URL which is a .gz file, and decompressing
it. This works, but it seems far too complex. Yet
none of the "wrapping" you might expect to work
actually does. You can't wrap a GzipFile around
an HTTP connection, because GzipFile, reasonably enough,
needs random access, and tries to do "seek" and "tell".
Nor is the output descriptor from gzip general; it fails
on "readline", but accepts "read". (No good reason
for that.) So I had to make a second copy.

John Nagle

def readurl(url) :
if url.endswith(".gz") :
nd = urllib2.urlopen(url,timeout=TIMEOUTSECS)
td1 = tempfile.TemporaryFile() # compressed file
td1.write(nd.read()) # fetch and copy file
nd.close() # done with network
td2 = tempfile.TemporaryFile() # decompressed file
td1.seek(0) # rewind
gd = gzip.GzipFile(fileobj=td1, mode="rb") # wrap unzip
td2.write(gd.read()) # decompress file
td1.close() # done with compressed copy
td2.seek(0) # rewind
return(td2) # return file object for compressed object
else :
return(urllib2.urlopen(url,timeout=TIMEOUTSECS))

From: Thomas Jollans on 12 Aug 2010 17:40

On Thursday 12 August 2010, it occurred to John Nagle to exclaim:
> (Repost with better indentation)

Good, good.

>
> def readurl(url) :
> if url.endswith(".gz") :

The file name could be anything. You should be checking the reponse Content-
Type header -- that's what it's for.

> nd = urllib2.urlopen(url,timeout=TIMEOUTSECS)
> td1 = tempfile.TemporaryFile() # compressed file

You can keep the whole thing in memory by using StringIO.

> td1.write(nd.read()) # fetch and copy file

You're reading the entire fire into memory anyway ;-)

> nd.close() # done with network
> td2 = tempfile.TemporaryFile() # decompressed file

Okay, maybe there is somthing missing from GzipFile -- but still you could use
StringIO again, I expect.

> Nor is the output descriptor from gzip general; it fails
> on "readline", but accepts "read".

>>> from gzip import GzipFile
>>> GzipFile.readline
<unbound method GzipFile.readline>
>>> GzipFile.readlines
<unbound method GzipFile.readlines>
>>> GzipFile.__iter__
<unbound method GzipFile.__iter__>
>>>

What exactly is it that's failing, and how?

> td1.seek(0) # rewind
> gd = gzip.GzipFile(fileobj=td1, mode="rb") # wrap unzip
> td2.write(gd.read()) # decompress file
> td1.close() # done with compressed copy
> td2.seek(0) # rewind
> return(td2) # return file object for compressed object
> else :
> return(urllib2.urlopen(url,timeout=TIMEOUTSECS))

From: Aahz on 12 Aug 2010 23:24

In article <4c645c39$0$1595$742ec2ed(a)news.sonic.net>,
John Nagle <nagle(a)animats.com> wrote:
>
>I'm reading a URL which is a .gz file, and decompressing it. This
>works, but it seems far too complex. Yet none of the "wrapping"
>you might expect to work actually does. You can't wrap a GzipFile
>around an HTTP connection, because GzipFile, reasonably enough, needs
>random access, and tries to do "seek" and "tell". Nor is the output
>descriptor from gzip general; it fails on "readline", but accepts
>"read". (No good reason for that.) So I had to make a second copy.

Also consider using zlib directly.
--
Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/

"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box." --Cliff Wells

|
Pages: 1
Prev: EXOR or symmetric difference for the Counter class
Next: Floating numbers