Reading messy files with Fortran [Fortran]

Prev: reading complex data using implied do loops
Next: FTP libraries

From: Arjen Markus on 29 Jan 2010 10:49

On 29 jan, 15:33, nos...(a)see.signature (Richard Maine) wrote:
> Arjen Markus <arjen.markus...(a)gmail.com> wrote:
> > But a / not enclosed in ' or " in the input for a list-directed read is
> > defined to stop the input! That may be the cause for the Fortran program
> > to indicate an end-of-record.
>
> No, that would not constitute and end-of-record to Fortran. In fact, you
> can't get an end-of-record with list-directed input at all. Might be a
> good guess for user confusions with sloppy terminology, but not for a
> literal end-of-record.
>
> I'm not going to try to speculate about the problem from the data given.
>
> --
> Richard Maine | Good judgment comes from experience;
> email: last name at domain . net | experience comes from bad judgment.
> domain: summertriangle | -- Mark Twain

Oops, you are right. It was the slash and a discussion about using
list-directed
reads to read incomplete CSV records plus the unusual appearance of
HTML tags
that must have put me off guard.

It remains a mystery then - perhaps it is a UTF-8 character sequence
that does this ...

Regards,

Arjen

From: analyst41 on 29 Jan 2010 21:15

On Jan 29, 9:44 am, dpb <n...(a)non.net> wrote:
> analys...(a)hotmail.com wrote:
> > On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:
> >> On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
> >> wrote:
>
> >>> I posted on this topic before and this is my latest take on it:
> >>> (1) In my case the messy files are csv extracts from a database (whose
> >>> character encoding is Unicode - I don't know if it has anything to do
> >>> with the problem).
> >>> (2) I discovered that Fortran sees spurious EOR markers within
> >>> character fields and I couldn't see a rhyme or reason why.
> >>> (3) But since I control the input - I inserted row numbers at the
> >>> beginning and end of each row extracted from the database and I added
> >>> 2000000000 to the row number make sure its unlikely that this data
> >>> would show up naturally.
> >>> (4) I then read each record and make sure that it has at least 18
> >>> characters (if not it is simply concatenated to cum_buffer - see
> >>> below).
> >>> I use the statement (adapted from Cooper Redwine's book)
> >>> read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size
> >>> = num_chars) buffer
> >>> you must have EOR or EOF or error on each read - otherwise the buffer
> >>> is too small and the program has to be halted.
> >>> I then check if the record number is showing up at the end which is
> >>> the same as the one on the left. If yes, you have a complete record -
> >>> if not - you have a spurious EOR and and simply concatenate the buffer
> >>> to another buffer called cum_buffer.
> >>> when cum_buffer looks like
> >>> 2000000127stuff2000000127
> >>> You have a facsimile of a row 127 from the database.
> >>> You might still have to struggle with separating 'stuff' into fields -
> >>> but thats a purely programming task having nothing to do with the file
> >>> system or operating system or character encoding schemes.
> >>> I hope others find this useful and suggestions for improvements would
> >>> be good.
> >> I do not remember your previous postings, but I am curious about these
> >> end-of-records. Can you send me an example? (I want to look at CSV
> >> files
> >> more closely, as I recently was confronted with some of their nastier
> >> aspects
> >> in the context of my Flibs project -http://flibs.sf.net).
>
> >> Regards,
>
> >> Arjen- Hide quoted text -
>
> >> - Show quoted text -
>
> > I'd love to given you actual files that show fake EORs - but it is
> > copyright/proprietary data and I din't have the time to clean it up
> > from that stand point.
>
> > But here are three cases( the occurrence of these strings causes
> > Fortran to see a fake EOR - LF95 running on windows):
>
> > <br />
>
> > </STRONG>
>
> > </B>
>
> > These seem to be terminators of HTML phrases - I don't know why
> > Fortran thinks these are EORs. Excel would trip up similarly as would
> > the language R - in fact, Fortran, R and Excel may see a different
> > number of rows in the same csv file.
>
> Can you post a short section of the file surrounding the offending
> characters as seen by a hex dump program so can see what's actually in
> the data stream?
>
> Do these strings fail when read on their own in any length record or
> only in the generated output file from the database?
>
> If you can make it fail repeatedly it should be quite simple to at least
> figure out what is the root cause and whether that is a data problem or
> a bug in the particular compiler i/o library.
>
> Which raises a point of what happens w/ another compiler?
>
> --- Hide quoted text -
>
> - Show quoted text -

I can tell you that its not a Fortran issue. Notepad, Excel and the R
language are unable to split the file up into records so that the
records correspond to rows in the database.

I actually don;t know the Windows/DOS command to produce a HEX dump -
if someone knows it - please post it. I have reduced the problem
row=set to a few rows - it should be possible to post the entire data
here as a HEX dump.

From: Dr Ivan D. Reid on 30 Jan 2010 07:15

On Fri, 29 Jan 2010 18:15:06 -0800 (PST), analyst41(a)hotmail.com
<analyst41(a)hotmail.com>
wrote in <a31f5cab-b0b1-4cdf-ab66-ed1432409861(a)g39g2000vba.googlegroups.com>:

> I actually don;t know the Windows/DOS command to produce a HEX dump -
> if someone knows it - please post it. I have reduced the problem
> row=set to a few rows - it should be possible to post the entire data
> here as a HEX dump.

Use debug in a DOS window. Example (on the first short file I saw):

C:\cygwin\home\Compaq_Owner>cat undupe
#! /bin/bash
export DUPE=$1
export WIN_NT='$WIN_NT'
find $DUPE -type f -ls | gawk -f finddupe.awk

C:\cygwin\home\Compaq_Owner>debug undupe
-d
1554:0100 23 21 20 2F 62 69 6E 2F-62 61 73 68 0A 65 78 70 #! /bin/bash.exp
1554:0110 6F 72 74 20 44 55 50 45-3D 24 31 0A 65 78 70 6F ort DUPE=$1.expo
1554:0120 72 74 20 57 49 4E 5F 4E-54 3D 27 24 57 49 4E 5F rt WIN_NT='$WIN_
1554:0130 4E 54 27 0A 66 69 6E 64-20 24 44 55 50 45 20 2D NT'.find $DUPE -
1554:0140 74 79 70 65 20 66 20 2D-6C 73 20 7C 20 67 61 77 type f -ls | gaw
1554:0150 6B 20 2D 66 20 66 69 6E-64 64 75 70 65 2E 61 77 k -f finddupe.aw
1554:0160 6B 0A 0A 00 00 00 00 00-00 00 00 00 00 00 00 00 k...............
1554:0170 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
-

--
Ivan Reid, School of Engineering & Design, _____________ CMS Collaboration,
Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".

From: glen herrmannsfeldt on 30 Jan 2010 13:46

analyst41(a)hotmail.com <analyst41(a)hotmail.com> wrote:
(big snip)

> I actually don;t know the Windows/DOS command to produce a HEX dump -
> if someone knows it - please post it. I have reduced the problem
> row=set to a few rows - it should be possible to post the entire data
> here as a HEX dump.

The dos DEBUG command is still available, but only for files that
it can fit into memory.

A better choice is the port of the GNU file utilities, including
the od command (with the -x option) or xd. I believe if you
search for UNXUTILS at sourceforge you can find them.

That includes some very useful utilities such as grep and diff.

-- glen

From: robin on 31 Jan 2010 06:35

"Arjen Markus" <arjen.markus895(a)gmail.com> wrote in message
news:89ef5ea7-4e37-4232-bf9c-3e4c446777ee(a)g1g2000yqi.googlegroups.com...
On 29 jan, 00:39, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
wrote:
> On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:

>> > > read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size
>> > > = num_chars) buffer

>But a / not enclosed in ' or " in the input for a list-directed read
>is defined
>to stop the input! That may be the cause for the Fortran program to
>indicate an end-of-record.

No, because he is using formatted READ (see his READ statement).

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: reading complex data using implied do loops
Next: FTP libraries