write a 20GB file [Python]

Prev: d-cm Controll Manager
Next: cmd app and xml

From: Jackie Lee on 14 May 2010 05:40

Hello there,

I have a 22 GB binary file, a want to change values of specific
positions. Because of the volume of the file, I doubt my code a
efficient one:

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
f.write(struct.pack('>h',1))
f.seek(212,1)
f.seek(ns*4,1)

f.close()

From: Dave Angel on 14 May 2010 07:04

Jackie Lee wrote:
> Hello there,
>
> I have a 22 GB binary file, a want to change values of specific
> positions. Because of the volume of the file, I doubt my code a
> efficient one:
>
> #! /usr/bin/env python
> #coding=utf-8
> import sys
> import struct
>
> try:
> f=open(sys.argv[1],'rb+')
> except (IOError,Exception):
> print '''usage:
> scriptname segyfilename
> '''
> sys.exit(1)
>
> #skip EBCDIC header
> try:
> f.seek(3200)
> except Exception:
> print 'Oops! your file is broken..'
>
> #read binary header
> binhead = f.read(400)
> ns = struct.unpack('>h',binhead[20:22])[0]
> if ns < 0:
> print 'file read error'
> sys.exit(1)
>
> #read trace header
> while True:
> f.seek(28,1)
> f.write(struct.pack('>h',1))
> f.seek(212,1)
> f.seek(ns*4,1)
>
> f.close()
>
>
I don't see a question anywhere. So perhaps you just want comments on
your code.

1) How do you plan to test this?
2) Consider doing a lot more checking to see that you have in fact a
file of the right type.
3) Fix indentation - perhaps you've accidentally used a tab in the source.
4) Provide a termination condition for the while True loop, which
currently will (I think) go forever, or perhaps until the disk fills up.
5) Depending on the purpose of this file, you should consider making the
changes on a copy, then deleting and renaming. As it stands, if the
program gets aborted part way through, there's no way to know how far it
got. Since it's just clobbering bytes, it would be safe to rerun the
same program again, but many times that's not the case. And this
program clearly isn't finished yet, so perhaps it's not true here either.
6) I don't see anything inefficient about it. The nature of the problem
is going to be very slow (for small values of ns), but I don't know what
your code could do to speed it up. Perhaps make sure the file is on a
fast drive, and not RAID 5.

DaveA

From: Jackie Lee on 14 May 2010 07:32

Thx, Dave,

The code works fine. I just don't know how f.write works. It says that
file.write won't write the file until file.close or file.flush. So I
don't know if the following one is more efficient (sorry I forget to
add condition to break the loop):

#! /usr/bin/env python
#coding=utf-8
import sys
import struct

try:
f=open(sys.argv[1],'rb+')
except (IOError,Exception):
print '''usage:
scriptname segyfilename
'''
sys.exit(1)

#skip EBCDIC header
try:
f.seek(3200)
except Exception:
print 'Oops! your file is broken..'

#read binary header
binhead = f.read(400)
ns = struct.unpack('>h',binhead[20:22])[0]
if ns < 0:
print 'file read error'
sys.exit(1)

#read trace header
while True:
f.seek(28,1)
if f.read(2) == '':
break
f.seek(-2,1)
f.write(struct.pack('>h',1))
f.seek(210,1)
f.seek(ns*4,1)

f.close()

On Fri, May 14, 2010 at 6:04 PM, Dave Angel <davea(a)ieee.org> wrote:
> Jackie Lee wrote:
>>
>> Hello there,
>>
>> I have a 22 GB binary file, a want to change values of specific
>> positions. Because of the volume of the file, I doubt my code a
>> efficient one:
>>
>> #! /usr/bin/env python
>> #coding=utf-8
>> import sys
>> import struct
>>
>> try:
>> f=open(sys.argv[1],'rb+')
>> except (IOError,Exception):
>> print '''usage:
>> scriptname segyfilename
>> '''
>> sys.exit(1)
>>
>> #skip EBCDIC header
>> try:
>> f.seek(3200)
>> except Exception:
>> print 'Oops! your file is broken..'
>>
>> #read binary header
>> binhead = f.read(400)
>> ns = struct.unpack('>h',binhead[20:22])[0]
>> if ns < 0:
>> print 'file read error'
>> sys.exit(1)
>>
>> #read trace header
>> while True:
>> f.seek(28,1)
>> f.write(struct.pack('>h',1))
>> f.seek(212,1)
>> f.seek(ns*4,1)
>>
>> f.close()
>>
>>
>
> I don't see a question anywhere. So perhaps you just want comments on your
> code.
>
> 1) How do you plan to test this?
> 2) Consider doing a lot more checking to see that you have in fact a file of
> the right type.
> 3) Fix indentation - perhaps you've accidentally used a tab in the source..
> 4) Provide a termination condition for the while True loop, which currently
> will (I think) go forever, or perhaps until the disk fills up.
> 5) Depending on the purpose of this file, you should consider making the
> changes on a copy, then deleting and renaming. As it stands, if the program
> gets aborted part way through, there's no way to know how far it got. Since
> it's just clobbering bytes, it would be safe to rerun the same program
> again, but many times that's not the case. And this program clearly isn't
> finished yet, so perhaps it's not true here either.
> 6) I don't see anything inefficient about it. The nature of the problem is
> going to be very slow (for small values of ns), but I don't know what your
> code could do to speed it up. Perhaps make sure the file is on a fast
> drive, and not RAID 5.
>
> DaveA
>
>

--
Jackie

From: J on 14 May 2010 10:50

On Fri, May 14, 2010 at 07:32, Jackie Lee <jackie.space(a)gmail.com> wrote:
> Thx, Dave,
>
> The code works fine. I just don't know how f.write works. It says that
> file.write won't write the file until file.close or file.flush. So I
> don't know if the following one is more efficient (sorry I forget to
> add condition to break the loop):

someone smarter than me can correct me, but file.write() will write
when it's buffer is filled, or close() or flush() are called.
I don't know what the default buffer size for file.write() is though.
close() flushes the buffer before closing the file, and flush()
flushes the buffer and leaves the file open for further writing.

> try:
> f=open(sys.argv[1],'rb+')
> except (IOError,Exception):
> print '''usage:
> scriptname segyfilename
> '''

You can just add a f.flush() every time you write to the file, but, I
tend to open files with 0 buffer size like this:

f = open(filename,"rb+",0)

Then again, I don't deal with files of that size, so there could be a
problem with my way once you start scaling up to the 20GB or larger
that you're working with.

Again, I could be wrong about all of that, so if so, I hope someone
will correct me and fix my understanding...

Cheers,

Jeff

From: Martin v. Loewis on 14 May 2010 11:07

> The code works fine. I just don't know how f.write works. It says that
> file.write won't write the file until file.close or file.flush.

You are misinterpreting the documentation. It certainly won't keep the
entire file in memory. Instead, it has a fixed-size buffer (something
like 8kiB or 32kiB) in which it writes and which it flushes when that
buffer is full.

The comment about flush and close merely refers to the problem that some
data may still be in the buffer at any point in time, unless you just
called close or flush.

HTH,
Martin

| Next | Last
Pages: 1 2 3
Prev: d-cm Controll Manager
Next: cmd app and xml