From: Gaetano Esposito on
The problem I am going to detail used to occur with "random" (i.e.
multiple) combination of compilers and machines/architectures. Because
of this uncertainty, I switched a while ago to static compilation on
one machine with the program running on others. At first I was
satisfied with this solution, but now the problem is back here, and I
ran out of ideas...

I must use a legacy code which interfaces with other codes through an
unformatted binary file. The legacy code writes a binary file, which
in turn is read by another program unit. This is real messy, I tried
to rewrite everything without the useless binary read/write but the
legacy code is too massive to be cleanly dealt with, and I am running
out of time.

The piece of legacy code writing the binary is:

WRITE (LINKTP) VERS, PREC, KERR
...
WRITE (LINKTP) LENIMC, LENRMC, NO, KK, NLITE
WRITE (LINKTP) PATM, (WT(K), EPS(K), SIG(K), DIP(K),
1 POL(K), ZROT(K), NLIN(K), K=1,KK),
2 ((COFLAM(N,K),N=1,NO),K=1,KK),
4 ((COFETA(N,K),N=1,NO),K=1,KK),
5 (((COFD(N,J,K),N=1,NO),J=1,KK),K=1,KK),
6 (KTDIF(N),N=1,NLITE),
7 (((COFTD(N,J,L),N=1,NO),J=1,KK),L=1,NLITE)

There is a correspondent READ statement in the subroutine supposed to
handle this:

READ (LINKMC, ERR=999) VERS, PREC, KERR
READ (LINKMC, ERR=999) LI, LR, NO, NKK, NLITE
READ (LINKMC) PATMOS, (RMCWRK(NWT+N-1),
1 RMCWRK(NEPS+N-1), RMCWRK(NSIG+N-1),
2 RMCWRK(NDIP+N-1), RMCWRK(NPOL+N-1), RMCWRK(NZROT+N-1),
3 IMCWRK(INLIN+N-1), N=1,NKK),
4 (RMCWRK(NLAM+N-1), N=1,NK), (RMCWRK(NETA+N-1), N=1,NK),
5 (RMCWRK(NDIF+N-1), N=1,NK2),
6 (IMCWRK(IKTDIF+N-1), N=1,NLITE), (RMCWRK(NTDIF+N-1), N=1,NKT)


For a reason that is behind my comprehension, the file created in the
WRITE statement, cannot be read by the READ statement.

I investigated this problem for a long time comparing a working
version of the unformatted file (which I had kept from other
computation) to the ones which now are not working (by "working" I
mean it is readable by the READ statement).
It turns out that the two files are **almost** (of course) identical.
If the unix command "cmp" is unreadable for me, if I dump the binary
files content using "od -x > linking_file" for both (working and not-
working) versions and "diff" them, I get just one different line:
$ diff tplink_working_x tplink_x
5c5
< 0000100 0002 0000 0014 0000 523c 0006 0000 0000
---
> 0000100 0002 0000 0014 0000 0014 0000 0000 0000

Now, I found out that it's in the third record (the third READ) the
problem for the not working binary: because I am not an expert of
fortran I/O, I manually added variables to be read in the last record,
until I got the error (luckily it was happening at the 3rd value!).

Code:
"
[first two READ statements OK]
....
read(l2) data1, data2, data3
....
"

Runtime error message:
"
forrtl: severe (67): input statement requires too much data [...]
"
Therefore, I hypothesize that the problem is in the WRITE statement.

My questions are:
_ What is causing this erratic behavior in WRITE? Is there a way to
isolate it?
_ What could cause a forced end of record instruction to appear?
_ I understand that sometimes erratic behaviors are associated with
memory faults. Can this be the case? If yes, note that all the check
bounds flags, do not give errors. Moreover there are several COMMON
blocks in the legacy code.

I appreciate your always incredibly competent and spot on comments.
From: Richard Maine on
Gaetano Esposito <gaetano.esposito(a)gmail.com> wrote:

> The piece of legacy code writing the binary is:
>
> WRITE (LINKTP) VERS, PREC, KERR
> ...
> WRITE (LINKTP) LENIMC, LENRMC, NO, KK, NLITE
> WRITE (LINKTP) PATM, (WT(K), EPS(K), SIG(K), DIP(K),
> 1 POL(K), ZROT(K), NLIN(K), K=1,KK),
> 2 ((COFLAM(N,K),N=1,NO),K=1,KK),
> 4 ((COFETA(N,K),N=1,NO),K=1,KK),
> 5 (((COFD(N,J,K),N=1,NO),J=1,KK),K=1,KK),
> 6 (KTDIF(N),N=1,NLITE),
> 7 (((COFTD(N,J,L),N=1,NO),J=1,KK),L=1,NLITE)
>
> There is a correspondent READ statement in the subroutine supposed to
> handle this:
>
> READ (LINKMC, ERR=999) VERS, PREC, KERR
> READ (LINKMC, ERR=999) LI, LR, NO, NKK, NLITE
> READ (LINKMC) PATMOS, (RMCWRK(NWT+N-1),
> 1 RMCWRK(NEPS+N-1), RMCWRK(NSIG+N-1),
> 2 RMCWRK(NDIP+N-1), RMCWRK(NPOL+N-1), RMCWRK(NZROT+N-1),
> 3 IMCWRK(INLIN+N-1), N=1,NKK),
> 4 (RMCWRK(NLAM+N-1), N=1,NK), (RMCWRK(NETA+N-1), N=1,NK),
> 5 (RMCWRK(NDIF+N-1), N=1,NK2),
> 6 (IMCWRK(IKTDIF+N-1), N=1,NLITE), (RMCWRK(NTDIF+N-1), N=1,NKT)
>
>
> For a reason that is behind my comprehension, the file created in the
> WRITE statement, cannot be read by the READ statement.

Looks like either a typo or a simple confusion to me.

First, note one of the most common FAQs here. Declarations matter. A
lot. You haven't shown them. There is no way that anyone can be sure
that the above code is correct without looking at the declarations. We
might be able to find things wrong with it, but no way that we can
guarantee it is correct. In particular, all the data types better match
exactly in the READ and WRITE code. But that aside...

Just go through the above READ, item at a time and compare it to the
WRITE. It doesn't take long to get to a discrepancy. The first 2 reads
look to correspond (assuming the right data types). But look at the 3rd
one more closely. Let me work through it for you.

The read of patmos correxponds to the write of patm. Ok.

Then you read 7 arrays in an implied DO loop with an index from 1 to
nkk. That's 7*nkk values, where nkk was read from record 2. That looks
to correspond with the 7*kk values written (and the nkk in th eread
corresponds to the kk from the write). No way I can check the array
dimensions with the data given, but I'll ignore that. Ok.

Next is a read with an implied DO look from 1 to NK. And NK is.... what?
There is no hint of where this came from or what value it might have.
Maybe it is a typo for NO, or NKK, or ???

Anyway, you need to go through and compare the read and write piece by
piece like that. You ought to be reading the same number of elements
that you wrote (or at least no more). The above code sure doesn't look
like it does that. If it does, then it depends on other code not shown
to define appropriate values for NK (and NK2 and NKT, which also seemed
to pop up from nowhere).

> [first two READ statements OK]
> ...
> read(l2) data1, data2, data3

No declarations again. Are these arrays? Yes, it matters. A lot.

Also, if KK happens to be 0 in the writing code, there might not be 3
values written.

> _ What could cause a forced end of record instruction to appear?

Nothing. Not a constructive avenue to pursue.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain
From: onateag on
Richard,

thanks for the reply. I understand that often the problem is simpler
than it looks, but before posting, I had checked every declaration and
other basic stuff (I read a lot of your other posts), all of them are
fine. All the integers you see popping up out of nowhere, are
calculated beforehand, and they are definitively not the problem.

Because of that, my focus now is to understand why the old linking
file works and the new one does not. If I understand what is wrong in
the not-working linking file, I could go back to the source of the
error.
For this purpose, I thought it would have been interesting, as you
correctly suggested, to check the READ item by item comparing the two
linking files. Keep in mind that the two linking files were created by
**exactly** the same legacy code as I posted previously, and that is
the puzzling part to me.
I am going to copy all the code I wrote for checking, and the output,
I am sure it will answer most of your doubts.
Code:

program test

implicit none
integer, parameter :: l1=40, l2=41

integer :: ierr

double precision :: data1, data2, data3
integer, dimension(5) :: d1, d2
double precision, dimension(100) :: dd1, dd2

open(l1,form='unformatted',file='tplink_working')
open(l2,form='unformatted',file='tplink')

print*,'first record, l1'
READ (l1) data1, data2, data3
print*, data1, data2, data3

print*,'first record, l2'
READ (l2) data1, data2, data3
print*, data1, data2, data3

print*,'second record, l1'
read(l1) d1
print*,d1

print*,'second record, l2'
read(l2) d2
print*,d2

print*,'third record, l1'
! 100, just for the sake of showing that it goes on reading
read(l1) dd1
print*,dd1

! ATTENTION : If I stop at "data1, data2", no error!
print*,'third record, l2'
read(l2) data1, data2, data3
print*,data1, data2, data3

end program test

Output:
$ ./test.exe
first record, l1
6.013470019174502E-154 6.013470016999068E-154
6.067619314340693E-154
first record, l2
6.013470019174502E-154 6.013470016999068E-154
6.067619314340693E-154
second record, l1
446 237984 4 111 2
second record, l2
446 237984 4 111 2
third record, l1
1013250.00000000 39.9480018615723
136.500000000000
3.33000000000000
[...]
[ lots of data ...]
[...]
0.000000000000000E+000 0.000000000000000E+000
3.785766995733680E-270
-9.255965345320369E+061
third record, l2
forrtl: severe (67): input statement requires too much data, unit 41,
file /gluster/bigtmp/gle6b/Grid-Test/test-tplink/tplink


Something I want to point out is that even if I substitute the last
print with:
print*,'third record, l2'
read(l2) data1, data2
print*,data1, data2
print*,'possible extra record?'
read(l2) data3
print*,data3

I get:
[same outupt as before, before the error]
third record, l2
1013250.00000000 39.9480018615723
possible extra record?
forrtl: severe (39): error during read, unit 41, file /gluster/bigtmp/
gle6b/Grid-Test/test-tplink/tplink


So, I have two unformatted files, same size (I know it doesn't matter,
but it's to underline that I am not writing neither junk nor less data
than expected), smal difference, created by the **same** code. But
there is something I am missing.



On Mar 3, 11:05 pm, nos...(a)see.signature (Richard Maine) wrote:
> Gaetano Esposito <gaetano.espos...(a)gmail.com> wrote:
> > The piece of legacy code writing the binary is:
>
> >       WRITE (LINKTP) VERS, PREC, KERR
> >       ...
> >       WRITE (LINKTP) LENIMC, LENRMC, NO, KK, NLITE
> >       WRITE (LINKTP) PATM, (WT(K), EPS(K), SIG(K), DIP(K),
> >      1               POL(K), ZROT(K), NLIN(K), K=1,KK),
> >      2               ((COFLAM(N,K),N=1,NO),K=1,KK),
> >      4               ((COFETA(N,K),N=1,NO),K=1,KK),
> >      5               (((COFD(N,J,K),N=1,NO),J=1,KK),K=1,KK),
> >      6               (KTDIF(N),N=1,NLITE),
> >      7               (((COFTD(N,J,L),N=1,NO),J=1,KK),L=1,NLITE)
>
> > There is a correspondent READ statement in the subroutine supposed to
> > handle this:
>
> >       READ (LINKMC, ERR=999) VERS, PREC, KERR
> >       READ (LINKMC, ERR=999) LI, LR, NO, NKK, NLITE
> >       READ (LINKMC) PATMOS, (RMCWRK(NWT+N-1),
> >      1   RMCWRK(NEPS+N-1), RMCWRK(NSIG+N-1),
> >      2   RMCWRK(NDIP+N-1), RMCWRK(NPOL+N-1), RMCWRK(NZROT+N-1),
> >      3   IMCWRK(INLIN+N-1), N=1,NKK),
> >      4   (RMCWRK(NLAM+N-1), N=1,NK), (RMCWRK(NETA+N-1), N=1,NK),
> >      5   (RMCWRK(NDIF+N-1), N=1,NK2),
> >      6   (IMCWRK(IKTDIF+N-1), N=1,NLITE), (RMCWRK(NTDIF+N-1), N=1,NKT)
>
> > For a reason that is behind my comprehension, the file created in the
> > WRITE statement, cannot be read by the READ statement.
>
> Looks like either a typo or a simple confusion to me.
>
> First, note one of the most common FAQs here. Declarations matter. A
> lot. You haven't shown them. There is no way that anyone can be sure
> that the above code is correct without looking at the declarations. We
> might be able to find things wrong with it, but no way that we can
> guarantee it is correct. In particular, all the data types better match
> exactly in the READ and WRITE code. But that aside...
>
> Just go through the above READ, item at a time and compare it to the
> WRITE. It doesn't take long to get to a discrepancy. The first 2 reads
> look to correspond (assuming the right data types). But look at the 3rd
> one more closely. Let me work through it for you.
>
> The read of patmos correxponds to the write of patm. Ok.
>
> Then you read 7 arrays in an implied DO loop with an index from 1 to
> nkk. That's 7*nkk values, where nkk was read from record 2. That looks
> to correspond with the 7*kk values written (and the nkk in th eread
> corresponds to the kk from the write). No way I can check the array
> dimensions with the data given, but I'll ignore that. Ok.
>
> Next is a read with an implied DO look from 1 to NK. And NK is.... what?
> There is no hint of where this came from or what value it might have.
> Maybe it is a typo for NO, or NKK, or ???
>
> Anyway, you need to go through and compare the read and write piece by
> piece like that. You ought to be reading the same number of elements
> that you wrote (or at least no more). The above code sure doesn't look
> like it does that. If it does, then it depends on other code not shown
> to define appropriate values for NK (and NK2 and NKT, which also seemed
> to pop up from nowhere).
>
> > [first two READ statements OK]
> > ...
> > read(l2) data1, data2, data3
>
> No declarations again. Are these arrays? Yes, it matters. A lot.
>
> Also, if KK happens to be 0 in the writing code, there might not be 3
> values written.
>
> > _ What could cause a forced end of record instruction to appear?
>
> Nothing. Not a constructive avenue to pursue.
>
> --
> Richard Maine                    | Good judgment comes from experience;
> email: last name at domain . net | experience comes from bad judgment.
> domain: summertriangle           |  -- Mark Twain

From: glen herrmannsfeldt on
Gaetano Esposito <gaetano.esposito(a)gmail.com> wrote:

> The problem I am going to detail used to occur with "random" (i.e.
> multiple) combination of compilers and machines/architectures. Because
> of this uncertainty, I switched a while ago to static compilation on
> one machine with the program running on others. At first I was
> satisfied with this solution, but now the problem is back here, and I
> ran out of ideas...

(snip)

> I investigated this problem for a long time comparing a working
> version of the unformatted file (which I had kept from other
> computation) to the ones which now are not working (by "working" I
> mean it is readable by the READ statement).
> It turns out that the two files are **almost** (of course) identical.
> If the unix command "cmp" is unreadable for me, if I dump the binary
> files content using "od -x > linking_file" for both (working and not-
> working) versions and "diff" them, I get just one different line:
> $ diff tplink_working_x tplink_x
> 5c5
> < 0000100 0002 0000 0014 0000 523c 0006 0000 0000
> ---
>> 0000100 0002 0000 0014 0000 0014 0000 0000 0000

Post the first 20 lines of od -x for each file.

Also, post all statements between the second and third
READ statement in the real program.

-- glen
From: Richard Maine on
onateag <gaetano.esposito(a)gmail.com> wrote:

> thanks for the reply. I understand that often the problem is simpler
> than it looks, but before posting, I had checked every declaration and
> other basic stuff (I read a lot of your other posts), all of them are
> fine. All the integers you see popping up out of nowhere, are
> calculated beforehand, and they are definitively not the problem.

I can only debug what I see. If I'm just assured that everything has
been checked and is fine, even though I don't see any of it, that
doesn't leave me much to go on. When someone assures me that the parts
they didn't show me are all fine, that often tends to make me more
suspicious of those parts, rather than less so. I don't see anything
else in what was posted that I can help with.

From the first post

> $ diff tplink_working_x tplink_x
> 5c5
> < 0000100 0002 0000 0014 0000 523c 0006 0000 0000
> ---
> > 0000100 0002 0000 0014 0000 0014 0000 0000 0000

I am slightly puzzled in that I can't quite match the single octal dump
line shown with the test program. It looks like it might plausibly have
the end of the second record and the beginning of a third. The 2 0 14 0
could plausibly be the 2 at the end of the second record, followed by a
trailing record size (14 hex = 20 dec, which would be right). But it
doesn't look aligned right, unless the first record is longer than the
read suggests... which might be possible; those are funny looking values
in the first record, maybe Hollerith? The first record ought to have
taken 32 bytes (3*8 for the data, and 2*4 for the record header and
trailers, assuming the most common 32-bit structures). Then the second
should have taken 28 bytes (5*4 data plus 2*4 header/trailer). But this
is showing what loks like the end of the second record after 72 bytes,
which seems 12 bytes too far in, if I got all the arithmetic straight.

As Glenn says, maybe a full hex dump of the first bit of the file might
help more; the one line in isolation isn't enough. I'm still not sure
that would tell me enough, but it might help some.

If the above is the end of the second record, then it looks like one of
the files has a longish 3rd record (6523c hex = 414268 decimal bytes),
while the other has a 3rd record with only 20 bytes of data. That would
reasonably well match your observed ability to read 2 8-byte values from
it, but fail reading a third. Why the file would be that way, I have no
data to see.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain