From: Dave Allured on
Gaetano Esposito wrote:
>
> The problem I am going to detail used to occur with "random" (i.e.
> multiple) combination of compilers and machines/architectures. Because
> of this uncertainty, I switched a while ago to static compilation on
> one machine with the program running on others. At first I was
> satisfied with this solution, but now the problem is back here, and I
> ran out of ideas...
>
> I must use a legacy code which interfaces with other codes through an
> unformatted binary file. The legacy code writes a binary file, which
> in turn is read by another program unit. This is real messy, I tried
> to rewrite everything without the useless binary read/write but the
> legacy code is too massive to be cleanly dealt with, and I am running
> out of time.
>
> The piece of legacy code writing the binary is:
>
> WRITE (LINKTP) VERS, PREC, KERR
> ...
> WRITE (LINKTP) LENIMC, LENRMC, NO, KK, NLITE
> WRITE (LINKTP) PATM, (WT(K), EPS(K), SIG(K), DIP(K),
> 1 POL(K), ZROT(K), NLIN(K), K=1,KK),
> 2 ((COFLAM(N,K),N=1,NO),K=1,KK),
> 4 ((COFETA(N,K),N=1,NO),K=1,KK),
> 5 (((COFD(N,J,K),N=1,NO),J=1,KK),K=1,KK),
> 6 (KTDIF(N),N=1,NLITE),
> 7 (((COFTD(N,J,L),N=1,NO),J=1,KK),L=1,NLITE)
>
> There is a correspondent READ statement in the subroutine supposed to
> handle this:
>
> READ (LINKMC, ERR=999) VERS, PREC, KERR
> READ (LINKMC, ERR=999) LI, LR, NO, NKK, NLITE
> READ (LINKMC) PATMOS, (RMCWRK(NWT+N-1),
> 1 RMCWRK(NEPS+N-1), RMCWRK(NSIG+N-1),
> 2 RMCWRK(NDIP+N-1), RMCWRK(NPOL+N-1), RMCWRK(NZROT+N-1),
> 3 IMCWRK(INLIN+N-1), N=1,NKK),
> 4 (RMCWRK(NLAM+N-1), N=1,NK), (RMCWRK(NETA+N-1), N=1,NK),
> 5 (RMCWRK(NDIF+N-1), N=1,NK2),
> 6 (IMCWRK(IKTDIF+N-1), N=1,NLITE), (RMCWRK(NTDIF+N-1), N=1,NKT)
>
> For a reason that is behind my comprehension, the file created in the
> WRITE statement, cannot be read by the READ statement.
>
> I investigated this problem for a long time comparing a working
> version of the unformatted file (which I had kept from other
> computation) to the ones which now are not working (by "working" I
> mean it is readable by the READ statement).
> It turns out that the two files are **almost** (of course) identical.
> If the unix command "cmp" is unreadable for me, if I dump the binary
> files content using "od -x > linking_file" for both (working and not-
> working) versions and "diff" them, I get just one different line:
> $ diff tplink_working_x tplink_x
> 5c5
> < 0000100 0002 0000 0014 0000 523c 0006 0000 0000
> ---
> > 0000100 0002 0000 0014 0000 0014 0000 0000 0000
>
> Now, I found out that it's in the third record (the third READ) the
> problem for the not working binary: because I am not an expert of
> fortran I/O, I manually added variables to be read in the last record,
> until I got the error (luckily it was happening at the 3rd value!).
>
> Code:
> "
> [first two READ statements OK]
> ...
> read(l2) data1, data2, data3
> ...
> "
>
> Runtime error message:
> "
> forrtl: severe (67): input statement requires too much data [...]
> "
> Therefore, I hypothesize that the problem is in the WRITE statement.
>
> My questions are:
> _ What is causing this erratic behavior in WRITE? Is there a way to
> isolate it?
> _ What could cause a forced end of record instruction to appear?
> _ I understand that sometimes erratic behaviors are associated with
> memory faults. Can this be the case? If yes, note that all the check
> bounds flags, do not give errors. Moreover there are several COMMON
> blocks in the legacy code.
>
> I appreciate your always incredibly competent and spot on comments.

A few things come to mind. First I concur with glen, Steve and others
that the problem is in the binary file, not in the read statements.
This answers your second question, the apparent end of record is caused
by a bad record length in the file. Glen's dissection of the hex dump
was right on.

On your third question, YES, memory faults can cause problems like
this. This is one of my leading candidates for your problem.

FYI the record format of an unformatted sequential binary file on most
systems is [LENGTH] [DATA] [LENGTH], where LENGTH is the number of bytes
in the DATA block. This is usually a 4-byte integer, as seen here.
(Some newer "64-bit" systems use 8-byte integers.) The trailing LENGTH
integer should be an exact copy of the leading LENGTH.

With this in mind, it seems that the leading LENGTH control word of the
third record was somehow changed from 6523C to 00014, as others already
noted. This means that the reported READ error is exactly what it
should be for this condition, "short record". This binary file must be
considered corrupted, at least by the time that it is seen by your third
read statement.

BTW if all of the data in the file is in units that are multiples of 4
bytes; such as normal integers, floats, and doubles; then od -Ad -i
dumps the file with decimal addresses and decimal 4-byte integers. Then
it is much easier to review the LENGTH control words.

A couple respondents speculated about a possible defect in the I/O
subsystem. In my experience these kinds of things are usually (but not
always) NOT in the compiler or runtime, but something subtle in user
code. When I catch myself thinking it might be the compiler, it's time
to look harder at my own code.

Now here is some rampant speculation about possible causes, OTHER THAN
the compiler. Some of these are supposed to be "impossible" yet I have
heard about or seen them crop up on rare occasions. I am tossing these
out casually to provoke thought about the possibilities, because I can't
say exactly where the real problem is.

* Overwriting an old copy of the data file without completely erasing it
first. (This one is in CLT archives.)

* Same file written simultaneously on two unit numbers, or by two
different processes.

* Using the same unit number in two different places when you assumed
they were different.

* Same unit number used intermittently in an obscure subroutine, a
subset of the previous idea.

* Using a unit number with a special meaning on the new system.

* REWIND not working as expected. I have been caught by this one, my
fault in misunderstanding REWIND. I still can't remember just how it
works in some cases. ;-) I think your code has it right, but
something to consider...

* Reading the file back before closing it after writing, which might
result in an unflushed buffer. This does not fit what you have showed
so far, but I never saw a CLOSE statement after writing.

* Fault in system disk buffering, which may be brought out by failure to
close file between write and read.

* Multithreading problem (a long shot).

* The old standby, out of bounds subscripting, which is a form of memory
corruption. Also its relative, storage length mismatches in COMMON.
The extensive use of COMMON in legacy code is fertile ground for length
mismatches and thwarting automatic bounds checking.

Here are some simple things to try, you have probably already done some
of them:

* Rather than REWIND, explicitly close the file after writing, and open
it again each time to read it.

* Manually delete the previous copy of the file, before running the
program.

* Add STATUS='NEW' to the open statement for writing.

* Give a single unique file name to LINKTP/LINKMC, and use it explicitly
with FILE=name in all open statements for both read and write.

* Try writing and reading the file in free format text mode, rather than
binary. Just change mode in OPEN statements, and change WRITE (unit) to
WRITE (unit,*), same for READ's. No need to use format statements if
it's all ordinary numeric variables.

Subdivide the problem:

* Disable unrelated subroutines, if not difficult to do so.

* Close file and halt the program immediately after the SECOND write
statement. Examine the file with "od" and see if it is exactly what it
should be, two records with correct start and end LENGTHs.

* Close and halt immediately after the THIRD write statement, and check
for correctness, three records with correct start and end LENGTHs. If
the file is correct at this point, the fault is probably later in your
program. In that case, determining exactly WHERE the file changes will
be important.

* Remove items from the end of the output list in the THIRD write, halt
program and see if the PARTIAL third record and LENGTHs are correct.
Repeat with more or fewer I/O items until you find the minimal
difference that creates the problem.

* Convert the WRITE statements into an isolated test case as follows.
Copy the write statements and variable declarations to an isolated test
program. Remove COMMON, make all the variables just simple local
variables. Make a second copy of the WRITE statements in front of the
first, and convert them to READ statements. Then, open your old "good"
copy of the binary file and read its contents into the local variables.
Open a new file with a different and unique name, and write this file
using the original WRITE statements. Examine the newly written file for
defects.

This is likely to prove that in an isolated program, your compiler is
capable of correctly writing this exact this file with identical data to
what is desired. Therefore the corruption starts somewhere in the more
complex environment of your real program. That would then point to a
more subtle cause like what I mentioned before, unfortunately harder to
find.

Okay, these are just some ideas and easy stuff to try. Good luck and
write back often. ;-)

--Dave
From: Louis Krupp on
Dave Allured wrote:
<snip>

> BTW if all of the data in the file is in units that are multiples of 4
> bytes; such as normal integers, floats, and doubles; then od -Ad -i
> dumps the file with decimal addresses and decimal 4-byte integers. Then
> it is much easier to review the LENGTH control words.

FWIW, I usually use od -t x1 to dump mystery files. You'll have to
reorder bytes to compute 16- and 32-bit quantities on little-endian
machines, but you won't have od trying to do it for you.

Louis
From: onateag on
On Mar 4, 8:36 pm, Steve Lionel <steve.lio...(a)intel.invalid> wrote:
> On 3/4/2010 6:09 PM, onateag wrote:
>
> > yes, I am using the Intel compiler.
>
> Ah, I see you provided the information in another post.
>
> Which version of Intel Fortran are you using?  Show the output of an
> "ifort -V" command.
>
> --
> Steve Lionel
> Developer Products Division
> Intel Corporation
> Nashua, NH
>
> For email address, replace "invalid" with "com"
>
> User communities for Intel Software Development Products
>    http://software.intel.com/en-us/forums/
> Intel Software Development Products Support
>    http://software.intel.com/sites/support/
> My Fortran blog
>    http://www.intel.com/software/drfortran

Steve, I was using ifort 11.0 installed on the server in my office,
but then I thought the compiler could have played a role in this, so I
downloaded for a test on my personal home machine the last version of
the compiler (11.1), and results were not different:

gle6b(a)gle6b-desktop:~$ ifort -V
Intel(R) Fortran Compiler Professional for applications running on
IA-32, Version 11.1 Build 20091130 Package ID: l_cprof_p_11.1.064
From: Louis Krupp on
onateag wrote:
<snip>
> I tried to quickly put a STOP statement after the last WRITE, but this
> didn't change either.

Exactly what didn't change? If you still get a corrupt file, you might
be closer to a very short program that reproduces the problem. And that
could prove to be a big help.

Louis
From: onateag on
On Mar 5, 2:49 pm, Dave Allured <nos...(a)nospom.com> wrote:
> Gaetano Esposito wrote:
>
> > The problem I am going to detail used to occur with "random" (i.e.
> > multiple) combination of compilers and machines/architectures. Because
> > of this uncertainty, I switched a while ago to static compilation on
> > one machine with the program running on others. At first I was
> > satisfied with this solution, but now the problem is back here, and I
> > ran out of ideas...
>
> > I must use a legacy code which interfaces with other codes through an
> > unformatted binary file. The legacy code writes a binary file, which
> > in turn is read by another program unit. This is real messy, I tried
> > to rewrite everything without the useless binary read/write but the
> > legacy code is too massive to be cleanly dealt with, and I am running
> > out of time.
>
> > The piece of legacy code writing the binary is:
>
> >       WRITE (LINKTP) VERS, PREC, KERR
> >       ...
> >       WRITE (LINKTP) LENIMC, LENRMC, NO, KK, NLITE
> >       WRITE (LINKTP) PATM, (WT(K), EPS(K), SIG(K), DIP(K),
> >      1               POL(K), ZROT(K), NLIN(K), K=1,KK),
> >      2               ((COFLAM(N,K),N=1,NO),K=1,KK),
> >      4               ((COFETA(N,K),N=1,NO),K=1,KK),
> >      5               (((COFD(N,J,K),N=1,NO),J=1,KK),K=1,KK),
> >      6               (KTDIF(N),N=1,NLITE),
> >      7               (((COFTD(N,J,L),N=1,NO),J=1,KK),L=1,NLITE)
>
> > There is a correspondent READ statement in the subroutine supposed to
> > handle this:
>
> >       READ (LINKMC, ERR=999) VERS, PREC, KERR
> >       READ (LINKMC, ERR=999) LI, LR, NO, NKK, NLITE
> >       READ (LINKMC) PATMOS, (RMCWRK(NWT+N-1),
> >      1   RMCWRK(NEPS+N-1), RMCWRK(NSIG+N-1),
> >      2   RMCWRK(NDIP+N-1), RMCWRK(NPOL+N-1), RMCWRK(NZROT+N-1),
> >      3   IMCWRK(INLIN+N-1), N=1,NKK),
> >      4   (RMCWRK(NLAM+N-1), N=1,NK), (RMCWRK(NETA+N-1), N=1,NK),
> >      5   (RMCWRK(NDIF+N-1), N=1,NK2),
> >      6   (IMCWRK(IKTDIF+N-1), N=1,NLITE), (RMCWRK(NTDIF+N-1), N=1,NKT)
>
> > For a reason that is behind my comprehension, the file created in the
> > WRITE statement, cannot be read by the READ statement.
>
> > I investigated this problem for a long time comparing a working
> > version of the unformatted file (which I had kept from other
> > computation) to the ones which now are not working (by "working" I
> > mean it is readable by the READ statement).
> > It turns out that the two files are **almost** (of course) identical.
> > If the unix command "cmp" is unreadable for me, if I dump the binary
> > files content using "od -x > linking_file" for both (working and not-
> > working) versions and "diff" them, I get just one different line:
> > $ diff tplink_working_x tplink_x
> > 5c5
> > < 0000100 0002 0000 0014 0000 523c 0006 0000 0000
> > ---
> > > 0000100 0002 0000 0014 0000 0014 0000 0000 0000
>
> > Now, I found out that it's in the third record (the third READ) the
> > problem for the not working binary: because I am not an expert of
> > fortran I/O, I manually added variables to be read in the last record,
> > until I got the error (luckily it was happening at the 3rd value!).
>
> > Code:
> > "
> > [first two READ statements OK]
> > ...
> > read(l2) data1, data2, data3
> > ...
> > "
>
> > Runtime error message:
> > "
> > forrtl: severe (67): input statement requires too much data [...]
> > "
> > Therefore, I hypothesize that the problem is in the WRITE statement.
>
> > My questions are:
> > _ What is causing this erratic behavior in WRITE? Is there a way to
> > isolate it?
> > _ What could cause a forced end of record instruction to appear?
> > _ I understand that sometimes erratic behaviors are associated with
> > memory faults. Can this be the case? If yes, note that all the check
> > bounds flags, do not give errors. Moreover there are several COMMON
> > blocks in the legacy code.
>
> > I appreciate your always incredibly competent and spot on comments.
>
> A few things come to mind.  First I concur with glen, Steve and others
> that the problem is in the binary file, not in the read statements.
> This answers your second question, the apparent end of record is caused
> by a bad record length in the file.  Glen's dissection of the hex dump
> was right on.
>
> On your third question, YES, memory faults can cause problems like
> this.  This is one of my leading candidates for your problem.
>
> FYI the record format of an unformatted sequential binary file on most
> systems is [LENGTH] [DATA] [LENGTH], where LENGTH is the number of bytes
> in the DATA block.  This is usually a 4-byte integer, as seen here.
> (Some newer "64-bit" systems use 8-byte integers.)  The trailing LENGTH
> integer should be an exact copy of the leading LENGTH.
>
> With this in mind, it seems that the leading LENGTH control word of the
> third record was somehow changed from 6523C to 00014, as others already
> noted.  This means that the reported READ error is exactly what it
> should be for this condition, "short record".  This binary file must be
> considered corrupted, at least by the time that it is seen by your third
> read statement.
>
> BTW if all of the data in the file is in units that are multiples of 4
> bytes; such as normal integers, floats, and doubles; then od -Ad -i
> dumps the file with decimal addresses and decimal 4-byte integers.  Then
> it is much easier to review the LENGTH control words.
>
> A couple respondents speculated about a possible defect in the I/O
> subsystem.  In my experience these kinds of things are usually (but not
> always) NOT in the compiler or runtime, but something subtle in user
> code.  When I catch myself thinking it might be the compiler, it's time
> to look harder at my own code.
>
> Now here is some rampant speculation about possible causes, OTHER THAN
> the compiler.  Some of these are supposed to be "impossible" yet I have
> heard about or seen them crop up on rare occasions.  I am tossing these
> out casually to provoke thought about the possibilities, because I can't
> say exactly where the real problem is.
>
> * Overwriting an old copy of the data file without completely erasing it
> first.  (This one is in CLT archives.)
>
> * Same file written simultaneously on two unit numbers, or by two
> different processes.
>
> * Using the same unit number in two different places when you assumed
> they were different.
>
> * Same unit number used intermittently in an obscure subroutine, a
> subset of the previous idea.
>
> * Using a unit number with a special meaning on the new system.
>
> * REWIND not working as expected.  I have been caught by this one, my
> fault in misunderstanding REWIND.  I still can't remember just how it
> works in some cases.  ;-)   I think your code has it right, but
> something to consider...
>
> * Reading the file back before closing it after writing, which might
> result in an unflushed buffer.  This does not fit what you have showed
> so far, but I never saw a CLOSE statement after writing.
>
> * Fault in system disk buffering, which may be brought out by failure to
> close file between write and read.
>
> * Multithreading problem (a long shot).
>
> * The old standby, out of bounds subscripting, which is a form of memory
> corruption.  Also its relative, storage length mismatches in COMMON.
> The extensive use of COMMON in legacy code is fertile ground for length
> mismatches and thwarting automatic bounds checking.
>
> Here are some simple things to try, you have probably already done some
> of them:
>
> * Rather than REWIND, explicitly close the file after writing, and open
> it again each time to read it.
>
> * Manually delete the previous copy of the file, before running the
> program.
>
> * Add STATUS='NEW' to the open statement for writing.
>
> * Give a single unique file name to LINKTP/LINKMC, and use it explicitly
> with FILE=name in all open statements for both read and write.
>
> * Try writing and reading the file in free format text mode, rather than
> binary.  Just change mode in OPEN statements, and change WRITE (unit) to
> WRITE (unit,*), same for READ's.  No need to use format statements if
> it's all ordinary numeric variables.
>
> Subdivide the problem:
>
> * Disable unrelated subroutines, if not difficult to do so.
>
> * Close file and halt the program immediately after the SECOND write
> statement.  Examine the file with "od" and see if it is exactly what it
> should be, two records with correct start and end LENGTHs.
>
> * Close and halt immediately after the THIRD write statement, and check
> for correctness, three records with correct start and end LENGTHs.  If
> the file is correct at this point, the fault is probably later in your
> program.  In that case, determining exactly WHERE the file changes will
> be important.
>
> * Remove items from the end of the output list in the THIRD write, halt
> program and see if the PARTIAL third record and LENGTHs are correct.
> Repeat with more or fewer I/O items until you find the minimal
> difference that creates the problem.
>
> * Convert the WRITE statements into an isolated test case as follows.
> Copy the write statements and variable declarations to an isolated test
> program.  Remove COMMON, make all the variables just simple local
> variables.  Make a second copy of the WRITE statements in front of the
> first, and convert them to READ statements.  Then, open your old "good"
> copy of the binary file and read its contents into the local variables.
> Open a new file with a different and unique name, and write this file
> using the original WRITE statements.  Examine the newly written file for
> defects.
>
> This is likely to prove that in an isolated program, your compiler is
> capable of correctly writing this exact this file with identical data to
> what is desired.  Therefore the corruption starts somewhere in the more
> complex environment of your real program.  That would then point to a
> more subtle cause like what I mentioned before, unfortunately harder to
> find.
>
> Okay, these are just some ideas and easy stuff to try.  Good luck and
> write back often.   ;-)
>
> --Dave

Dave,
my jaws simply dropped when I read your incredible list of
suggestions!

I had the suspect that all the COMMON block could have caused some
memory faults because that had happened in the past (the unit number
of some output files would just get odd integer values (negative or
huge) and that resulted in runtime faults).
However, as I said in a previous post, beside spending 2 hours of my
life substituting all the COMMON blocks with MODULE statements and
compiling again with all check bounds flags on without anything coming
up, I don't know what else to do.

I have tried much of the quick tests you suggested, and they all
failed to nail the error. (I haven't gone through the more intense
"subdivide the problem" part of them, I will eventually have a look at
them if necessary)

As anticipated, even before your right suggestion, I got everything
working using, instead of the unformatted WRITE/READ, free format text
linking files.
I performed some test cases on my application, and I am getting
consistent results, even with a tiny difference (negligible so far for
my purposes) compared to what I used to get using the unformatted
linking files. I figured that the difference could depend on the
precision I am losing caused by the format of the data, and I can live
with that.

A question that I'd like you guys to speculate on is: why would the
free format file not be affected by the same writing problem with the
unformatted file? Even if the test cases give me some comfort, I have
some trust issues because I cannot compare a "surely working version"
of the ASCII linking file with the new one (as I tried to do with the
unformatted ones).