Reading messy files with Fortran [Fortran]

Prev: reading complex data using implied do loops
Next: FTP libraries

From: Arjen Markus on 29 Jan 2010 02:52

On 29 jan, 00:39, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
wrote:
> On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:
>
>
>
>
>
> > On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
> > wrote:
>
> > > I posted on this topic before and this is my latest take on it:
>
> > > (1) In my case the messy files are csv extracts from a database (whose
> > > character encoding is Unicode - I don't know if it has anything to do
> > > with the problem).
>
> > > (2) I discovered that Fortran sees spurious EOR markers within
> > > character fields and I couldn't see a rhyme or reason why.
>
> > > (3) But since I control the input - I inserted row numbers at the
> > > beginning and end of each row extracted from the database and I added
> > > 2000000000 to the row number make sure its unlikely that this data
> > > would show up naturally.
>
> > > (4) I then read each record and make sure that it has at least 18
> > > characters (if not it is simply concatenated to cum_buffer - see
> > > below).
>
> > > I use the statement (adapted from Cooper Redwine's book)
>
> > > read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size
> > > = num_chars) buffer
>
> > > you must have EOR or EOF or error on each read - otherwise the buffer
> > > is too small and the program has to be halted.
>
> > > I then check if the record number is showing up at the end which is
> > > the same as the one on the left. If yes, you have a complete record -
> > > if not - you have a spurious EOR and and simply concatenate the buffer
> > > to another buffer called cum_buffer.
>
> > > when cum_buffer looks like
>
> > > 2000000127stuff2000000127
>
> > > You have a facsimile of a row 127 from the database.
>
> > > You might still have to struggle with separating 'stuff' into fields -
> > > but thats a purely programming task having nothing to do with the file
> > > system or operating system or character encoding schemes.
>
> > > I hope others find this useful and suggestions for improvements would
> > > be good.
>
> > I do not remember your previous postings, but I am curious about these
> > end-of-records. Can you send me an example? (I want to look at CSV
> > files
> > more closely, as I recently was confronted with some of their nastier
> > aspects
> > in the context of my Flibs project -http://flibs.sf.net).
>
> > Regards,
>
> > Arjen- Hide quoted text -
>
> > - Show quoted text -
>
> I'd love to given you actual files that show fake EORs - but it is
> copyright/proprietary data and I din't have the time to clean it up
> from that stand point.
>
> But here are three cases( the occurrence of these strings causes
> Fortran to see a fake EOR - LF95 running on windows):
>
> 
>
> 
>
> 
>
> These seem to be terminators of HTML phrases - I don't know why
> Fortran thinks these are EORs. Excel would trip up similarly as would
> the language R - in fact, Fortran, R and Excel may see a different
> number of rows in the same csv file.- Tekst uit oorspronkelijk bericht niet weergeven -
>
> - Tekst uit oorspronkelijk bericht weergeven -

Hm, that seems rather unusual for a CSV file.

But a / not enclosed in ' or " in the input for a list-directed read
is defined
to stop the input! That may be the cause for the Fortran program to
indicate
an end-of-record.

One way forward would be to read the entire line into a buffer and
replace
the / by some other character that does not belong to the regular
data,
perhaps a ~ so that list-directed reads are not interpreting it.

Regards,

Arjen

From: analyst41 on 29 Jan 2010 07:39

On Jan 29, 2:52 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:
> On 29 jan, 00:39, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
> wrote:
>
>
>
>
>
> > On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:
>
> > > On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
> > > wrote:
>
> > > > I posted on this topic before and this is my latest take on it:
>
> > > > (1) In my case the messy files are csv extracts from a database (whose
> > > > character encoding is Unicode - I don't know if it has anything to do
> > > > with the problem).
>
> > > > (2) I discovered that Fortran sees spurious EOR markers within
> > > > character fields and I couldn't see a rhyme or reason why.
>
> > > > (3) But since I control the input - I inserted row numbers at the
> > > > beginning and end of each row extracted from the database and I added
> > > > 2000000000 to the row number make sure its unlikely that this data
> > > > would show up naturally.
>
> > > > (4) I then read each record and make sure that it has at least 18
> > > > characters (if not it is simply concatenated to cum_buffer - see
> > > > below).
>
> > > > I use the statement (adapted from Cooper Redwine's book)
>
> > > > read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size
> > > > = num_chars) buffer
>
> > > > you must have EOR or EOF or error on each read - otherwise the buffer
> > > > is too small and the program has to be halted.
>
> > > > I then check if the record number is showing up at the end which is
> > > > the same as the one on the left. If yes, you have a complete record -
> > > > if not - you have a spurious EOR and and simply concatenate the buffer
> > > > to another buffer called cum_buffer.
>
> > > > when cum_buffer looks like
>
> > > > 2000000127stuff2000000127
>
> > > > You have a facsimile of a row 127 from the database.
>
> > > > You might still have to struggle with separating 'stuff' into fields -
> > > > but thats a purely programming task having nothing to do with the file
> > > > system or operating system or character encoding schemes.
>
> > > > I hope others find this useful and suggestions for improvements would
> > > > be good.
>
> > > I do not remember your previous postings, but I am curious about these
> > > end-of-records. Can you send me an example? (I want to look at CSV
> > > files
> > > more closely, as I recently was confronted with some of their nastier
> > > aspects
> > > in the context of my Flibs project -http://flibs.sf.net).
>
> > > Regards,
>
> > > Arjen- Hide quoted text -
>
> > > - Show quoted text -
>
> > I'd love to given you actual files that show fake EORs - but it is
> > copyright/proprietary data and I din't have the time to clean it up
> > from that stand point.
>
> > But here are three cases( the occurrence of these strings causes
> > Fortran to see a fake EOR - LF95 running on windows):
>
> > 
>
> > 
>
> > 
>
> > These seem to be terminators of HTML phrases - I don't know why
> > Fortran thinks these are EORs. Excel would trip up similarly as would
> > the language R - in fact, Fortran, R and Excel may see a different
> > number of rows in the same csv file.- Tekst uit oorspronkelijk bericht niet weergeven -
>
> > - Tekst uit oorspronkelijk bericht weergeven -
>
> Hm, that seems rather unusual for a CSV file.
>
> But a / not enclosed in ' or " in the input for a list-directed read
> is defined
> to stop the input! That may be the cause for the Fortran program to
> indicate
> an end-of-record.
>
> One way forward would be to read the entire line into a buffer and
> replace
> the / by some other character that does not belong to the regular
> data,
> perhaps a ~ so that list-directed reads are not interpreting it.
>
> Regards,
>
> Arjen- Hide quoted text -
>
> - Show quoted text -

But I am reading the file only with A format = a straight read into a
buffer and I attempt field separation only by reading the buffer.
There are other situations when a space causes an EOR. If I get a
chance I'll make a list of all cases that I saw.

Is there a read method that ignores EOR markers (get me up to 10000
characters, and tell me how many you got, ignoring special chars of
any kind) and lets the programmer decide EORs.

From: Richard Maine on 29 Jan 2010 09:33

Arjen Markus <arjen.markus895(a)gmail.com> wrote:

> But a / not enclosed in ' or " in the input for a list-directed read is
> defined to stop the input! That may be the cause for the Fortran program
> to indicate an end-of-record.

No, that would not constitute and end-of-record to Fortran. In fact, you
can't get an end-of-record with list-directed input at all. Might be a
good guess for user confusions with sloppy terminology, but not for a
literal end-of-record.

I'm not going to try to speculate about the problem from the data given.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain

From: dpb on 29 Jan 2010 09:40

Richard Maine wrote:
> Arjen Markus <arjen.markus895(a)gmail.com> wrote:
>
>> But a / not enclosed in ' or " in the input for a list-directed read is
>> defined to stop the input! That may be the cause for the Fortran program
>> to indicate an end-of-record.
>
> No, that would not constitute and end-of-record to Fortran. In fact, you
> can't get an end-of-record with list-directed input at all. Might be a
> good guess for user confusions with sloppy terminology, but not for a
> literal end-of-record.
>
> I'm not going to try to speculate about the problem from the data given.

I think the only certain way to have a clue would be to see a section of
the offending file to see the data string itself as the byte string and
a sample program that creates the problem. My supposition would be the
encoding is embedding a control character being interpreted but that's
the murky crystal ball talking, not actual logic... :)

--

From: dpb on 29 Jan 2010 09:44

analyst41(a)hotmail.com wrote:
> On Jan 28, 3:15 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:
>> On 28 jan, 00:51, "analys...(a)hotmail.com" <analys...(a)hotmail.com>
>> wrote:
>>
>>
>>
>>
>>
>>> I posted on this topic before and this is my latest take on it:
>>> (1) In my case the messy files are csv extracts from a database (whose
>>> character encoding is Unicode - I don't know if it has anything to do
>>> with the problem).
>>> (2) I discovered that Fortran sees spurious EOR markers within
>>> character fields and I couldn't see a rhyme or reason why.
>>> (3) But since I control the input - I inserted row numbers at the
>>> beginning and end of each row extracted from the database and I added
>>> 2000000000 to the row number make sure its unlikely that this data
>>> would show up naturally.
>>> (4) I then read each record and make sure that it has at least 18
>>> characters (if not it is simply concatenated to cum_buffer - see
>>> below).
>>> I use the statement (adapted from Cooper Redwine's book)
>>> read (unit = nn, fmt = '(A)', advance = 'no', iostat = read_stat, size
>>> = num_chars) buffer
>>> you must have EOR or EOF or error on each read - otherwise the buffer
>>> is too small and the program has to be halted.
>>> I then check if the record number is showing up at the end which is
>>> the same as the one on the left. If yes, you have a complete record -
>>> if not - you have a spurious EOR and and simply concatenate the buffer
>>> to another buffer called cum_buffer.
>>> when cum_buffer looks like
>>> 2000000127stuff2000000127
>>> You have a facsimile of a row 127 from the database.
>>> You might still have to struggle with separating 'stuff' into fields -
>>> but thats a purely programming task having nothing to do with the file
>>> system or operating system or character encoding schemes.
>>> I hope others find this useful and suggestions for improvements would
>>> be good.
>> I do not remember your previous postings, but I am curious about these
>> end-of-records. Can you send me an example? (I want to look at CSV
>> files
>> more closely, as I recently was confronted with some of their nastier
>> aspects
>> in the context of my Flibs project -http://flibs.sf.net).
>>
>> Regards,
>>
>> Arjen- Hide quoted text -
>>
>> - Show quoted text -
>
> I'd love to given you actual files that show fake EORs - but it is
> copyright/proprietary data and I din't have the time to clean it up
> from that stand point.
>
> But here are three cases( the occurrence of these strings causes
> Fortran to see a fake EOR - LF95 running on windows):
>
> 
>
> 
>
> 
>
> These seem to be terminators of HTML phrases - I don't know why
> Fortran thinks these are EORs. Excel would trip up similarly as would
> the language R - in fact, Fortran, R and Excel may see a different
> number of rows in the same csv file.

Can you post a short section of the file surrounding the offending
characters as seen by a hex dump program so can see what's actually in
the data stream?

Do these strings fail when read on their own in any length record or
only in the generated output file from the database?

If you can make it fail repeatedly it should be quite simple to at least
figure out what is the root cause and whether that is a data problem or
a bug in the particular compiler i/o library.

Which raises a point of what happens w/ another compiler?

--

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: reading complex data using implied do loops
Next: FTP libraries