parsing tab and newline delimited text [Python]

Prev: None is negative?
Next: Loading pyd fails with "The operating system cannot run %1"

From: elsa on 3 Aug 2010 22:14

Hi,

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

So, an individual entry might have this form (in printed form):

Title date position data

with each field separated by tabs, and a newline at the end of data.
So, I thought I could simply open a file, read each line in in turn,
and parse it....

f=open('MyFile')
line=f.readline()
parts=line.split('\t')

etc...

However, 'data' is a fairly random string of characters. Because the
files I'm processing are large, there is a good chance that in every
file, there is a data field that might look like this:

899998dlKKlS\lk3#kdf\nllllKK99

or like this:

LLLSDKJJJdkkf334$\ttttks)))K99

so, you see the random strings '\n' and '\t' are stopping me from
being able to parse my file correctly. Any
suggestions on how to overcome this problem would be greatly
appreciated.

Many thanks,

Elsa

From: James Mills on 3 Aug 2010 22:32

On Wed, Aug 4, 2010 at 12:14 PM, elsa <kerensaelise(a)hotmail.com> wrote:
> I have a large file of text I need to parse. Individual 'entries' are
> separated by newline characters, while fields within each entry are
> separated by tab characters.

Sounds to me like a job of the csv module.

cheers
James

--
-- James Mills
--
-- "Problems are solved by method"

From: Tim Chase on 3 Aug 2010 22:49

On 08/03/10 21:14, elsa wrote:
> I have a large file of text I need to parse. Individual 'entries' are
> separated by newline characters, while fields within each entry are
> separated by tab characters.
>
> So, an individual entry might have this form (in printed form):
>
> Title date position data
>
> with each field separated by tabs, and a newline at the end of data.
> So, I thought I could simply open a file, read each line in in turn,
> and parse it....
>
> f=open('MyFile')
> line=f.readline()
> parts=line.split('\t')
>
> etc...
>
> However, 'data' is a fairly random string of characters. Because the
> files I'm processing are large, there is a good chance that in every
> file, there is a data field that might look like this:
>
> 899998dlKKlS\lk3#kdf\nllllKK99

My first question is whether the line contains actual newline/tab
characters within the field data, or the string-representation of
the line. For one of the lines in question, what does

print repr(line)

(or "print line.encode('hex')") produce? If the line has extra
literal tabs, then you may be stuck; if the line has escaped text
(a backslash followed by an "n" or "t", i.e. 2 characters) then
it's pretty straight-forward. Ideally, you'd see something like

>>> print repr(line)
'MyTitle\t2010-08-02\t42\t89998dlKKlS\\lk3#kdf\\nlllKK99'
^tab ^tab ^tab ^backslash^

where the backslashes are literal.

If you know that it's the last ("data") field that can contain
such characters, you can at least catch non-newline characters by
only splitting the first N splits:

parts = line.split('\t', 3)

That doesn't solve the newline problem, but your file's
definition prevents you from being able to discern

filedata = 'title1\tdate1\tpos1\tdata1\nxxxx\tyyyy\tzzzz\twwww\n'

Would xxxx/yyyy/zzzz/wwww be a continuation of data1 or are they
the items in the next row?

-tkc

From: MRAB on 3 Aug 2010 23:05

elsa wrote:
> Hi,
>
> I have a large file of text I need to parse. Individual 'entries' are
> separated by newline characters, while fields within each entry are
> separated by tab characters.
>
> So, an individual entry might have this form (in printed form):
>
> Title date position data
>
> with each field separated by tabs, and a newline at the end of data.
> So, I thought I could simply open a file, read each line in in turn,
> and parse it....
>
> f=open('MyFile')
> line=f.readline()
> parts=line.split('\t')
>
> etc...
>
> However, 'data' is a fairly random string of characters. Because the
> files I'm processing are large, there is a good chance that in every
> file, there is a data field that might look like this:
>
> 899998dlKKlS\lk3#kdf\nllllKK99
>
> or like this:
>
> LLLSDKJJJdkkf334$\ttttks)))K99
>
> so, you see the random strings '\n' and '\t' are stopping me from
> being able to parse my file correctly. Any
> suggestions on how to overcome this problem would be greatly
> appreciated.
>
When you say random strings '\n', etc, are they the backslash character
\ followed by the letter n? If so, then you don't have a problem. They
are \ followed by n.

If, on the other hand, by '\n' you mean the newline character, then,
well, that's a newline character, and there's (probably) nothing you can
do about it.

From: elsa on 3 Aug 2010 23:35

On Aug 4, 12:49 pm, Tim Chase <python.l...(a)tim.thechases.com> wrote:
> On 08/03/10 21:14, elsa wrote:
>
>
>
> > I have a large file of text I need to parse. Individual 'entries' are
> > separated by newline characters, while fields within each entry are
> > separated by tab characters.
>
> > So, an individual entry might have this form (in printed form):
>
> > Title date position data
>
> > with each field separated by tabs, and a newline at the end of data.
> > So, I thought I could simply open a file, read each line in in turn,
> > and parse it....
>
> > f=open('MyFile')
> > line=f.readline()
> > parts=line.split('\t')
>
> > etc...
>
> > However, 'data' is a fairly random string of characters. Because the
> > files I'm processing are large, there is a good chance that in every
> > file, there is a data field that might look like this:
>
> > 899998dlKKlS\lk3#kdf\nllllKK99
>
> My first question is whether the line contains actual newline/tab
> characters within the field data, or the string-representation of
> the line. For one of the lines in question, what does
>
> print repr(line)

here is what I get at the interactive prompt:

>>> line = """IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55
.... :E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99"""

>>> line
'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

>>> print repr(line)
'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

basically this is numeric values encoded into ASCII symbols. So '\' is
a value, 'n' is a value, 'E' is a value etc... it's
all part of the same data field. It's just unfortunate that '\' and
'n' have ended up together. (I didn't design this file,
btw, I'm just expected to process it!)

Elsa.

| Next | Last
Pages: 1 2
Prev: None is negative?
Next: Loading pyd fails with "The operating system cannot run %1"