From: analyst41 on
On Oct 11, 9:20 am, dpb <n...(a)non.net> wrote:
> analys...(a)hotmail.com wrote:
>
> ...
>
> > I am asking this question for a practical reason.  I extract a csv
> > file from a database and apparently their native character encoding
> > isn't 7-bit ASCII.  When I read this extracted file with fortran two
> > thing shappen
>
> > (1) (trivial) some characters are misinterpreted - even excel, notepad
> > etc, do the same thing.
>
> > (2) (non-trivial) spurious "End of record" markers are seen by
> > Fortran, Excel etc. (If I eliminate the character fields from the
> > database extract, this doesn't happen) and the file as read in sees
> > more records than there are rows in the database.
>
> > (3) I posed this problem earlier to the ng. and although I received
> > some suggestions, I still haven't solved the problem.
>
> This doesn't seem to be a Fortran problem, really, but one in the file
> generation from the database.
>
> What hardware, OS, database, etc., ... might lead to somebody having
> input to resolving the problem.
>
> --

The database is Microsoft Windows SQL.

the downlaod Engine is the Microsoft SQL client.

The OS is Windows XP running Lahey Fortran. But you are right - it is
not necessarily a Fortran problem since Excel, Notepad etc. have the
same problem - I thought I might be able to do reconcile rows and
records using "byte by byte" processing using Fortran - but with no
luck so far. The suggestions I received helped me to resolve
delimiters embedded within delimiters (thanks to everybody who
contributed - the SOBs who built the data base even use '|' as a datum
instead of a delimter - and of course commas,spaces and periods within
'"' happen all the time as to HTML type delimiters "</" and "/>") -
but I don't know how to remove spurious EOR markers (even DOS's 'Type"
command sees them. )

I can provide any other info. needed.
From: Dan Nagle on
Hello,

On 2009-10-11 10:20:34 -0400, analyst41(a)hotmail.com said:

> But you are right - it is
> not necessarily a Fortran problem since Excel, Notepad etc. have the
> same problem - I thought I might be able to do reconcile rows and
> records using "byte by byte" processing using Fortran - but with no
> luck so far. The suggestions I received helped me to resolve
> delimiters embedded within delimiters (thanks to everybody who
> contributed - the SOBs who built the data base even use '|' as a datum
> instead of a delimter - and of course commas,spaces and periods within
> '"' happen all the time as to HTML type delimiters "</" and "/>") -
> but I don't know how to remove spurious EOR markers (even DOS's 'Type"
> command sees them. )

If you can find a compiler that supports the f08
i/o encoding= specifier, you might be able to tinker
with the character set (just as another knob to twist).

--
Cheers!

Dan Nagle

From: dpb on
analyst41(a)hotmail.com wrote:
....
>>> (2) (non-trivial) spurious "End of record" markers are seen by
>>> Fortran, Excel etc. (If I eliminate the character fields from the
>>> database extract, this doesn't happen) and the file as read in sees
>>> more records than there are rows in the database.
....
> but I don't know how to remove spurious EOR markers (even DOS's 'Type"
> command sees them. )
....

That implies these are embedded into the character fields' data then?

If so, I would see only two ways to attack--

1) Get the originators of the database to fix the problem (not likely, I
gather)

2) Clean up after their mess (which is obviously what you're trying to do)

I don't think 2) is possible unequivocally unless there is some way to
tell what a record and field length should be a priori and know there
are a fixed number of fields per record. Or, iff the field separators
are reliable, then you should be able to count fields.

If that is the case, then it would seem that only way would be to open
the file first as "binary" (sorry for the vernacular usage, Richard :) )
stream and count field delimiters and simply toss out EOR characters
that don't belong and rewrite the file before processing it as csv.

That seems to me to be doable in theory; whether it would work in
practice I don't know.

--
From: dpb on
dpb wrote:
....
> I don't think 2) is possible unequivocally unless there is some way to
> tell what a record and field length should be a priori and know there
> are a fixed number of fields per record. Or, iff the field separators
> are reliable, then you should be able to count fields.
>
> If that is the case, then it would seem that only way would be to open
> the file first as "binary" (sorry for the vernacular usage, Richard :) )
> stream and count field delimiters and simply toss out EOR characters
> that don't belong and rewrite the file before processing it as csv.

Of course, it would require all your previous logic on parsing character
fields correctly as I presume there will be embedded record delimiters
in them as well so it isn't simply counting their occurrence.

Sounds like a pita, indeed... :(

--
From: analyst41 on
On Oct 11, 10:35 am, dpb <n...(a)non.net> wrote:
> analys...(a)hotmail.com wrote:
>
> ...>>> (2) (non-trivial) spurious "End of record" markers are seen by
> >>> Fortran, Excel etc. (If I eliminate the character fields from the
> >>> database extract, this doesn't happen) and the file as read in sees
> >>> more records than there are rows in the database.
> ...
> > but I don't know how to remove spurious EOR markers (even DOS's 'Type"
> > command sees them. )
>
> ...
>
> That implies these are embedded into the character fields' data then?
>
> If so, I would see only two ways to attack--
>
> 1) Get the originators of the database to fix the problem (not likely, I
> gather)
>
> 2) Clean up after their mess (which is obviously what you're trying to do)
>
> I don't think 2) is possible unequivocally unless there is some way to
> tell what a record and field length should be a priori and know there
> are a fixed number of fields per record.  Or, iff the field separators
> are reliable, then you should be able to count fields.

The problems are caused by large varchar columns - so I don't think
the notion of record length makes sense here.

>
> If that is the case, then it would seem that only way would be to open
> the file first as "binary" (sorry for the vernacular usage, Richard :) )
> stream and count field delimiters and simply toss out EOR characters
> that don't belong and rewrite the file before processing it as csv.

That sounds interesting: We know that "true" EORs can only occur
after the last columnn in the database. So if one sees them "in
between" one can throw them out.

Is there a Windows/DOS tool that will let me see the EOR characters?

I haven't used binary files in ages - any pointers as to how I can do
that would be appreciated and I suppose I can look for "EOR" (real or
spurious) with the IACHAR value of the EOR marker (it is control M on
unix but I don't exactly know what the csv downloader in the database
client puts at the end of records.)

>
> That seems to me to be doable in theory; whether it would work in
> practice I don't know.
>
> --