From: Craig Powers on
monir wrote:
>
> 2) Here's again an abbreviated sample code for easy reference:
> (F77, g95)

The problem with the abbreviated sample is that it's so abbreviated, it
cuts out the problem (as has already been said multiple times by
multiple people).

I don't disagree with you that 22k-ish lines is not practical to post.
However, there are a couple of things you can and *should* do:
* Try running it with absolutely every check the compiler offers turned
on. That's spelled "-check all" in ifort. Other compilers may offer a
similar ability with a different spelling, or they may require you to
specify multiple options; RTFM. (I see below you've tried to do this
and it didn't get you anywhere... in that case, see next.)
* Try to cut it down to a manageable size. In the process, maybe you'll
discover what the problem is yourself. If it goes away when you take
out a particular piece, that alone gives you an avenue to pursue in
trying to find your problem. If you succeed in producing a manageable
size example, well, now you've got something to post.


> 3) There appears to be some confusion on when the (current) program
> correctly works and when it doesn't.
> Here's a summary for clarification:
> (ref is to a SINGLE statement in the above abbreviated sample code)
>
> a) with "! pause" and "!! implicit none" NOT activated:
> .......................... program returns x = NaN
> c) with "! pause" NOT activated and "!! implicit none" Activated :
> .......................... program returns x = -1.0676971 (correct)

This is rather interesting. I don't think adding IMPLICIT NONE should
change the meaning of a program that continues to compile successfully.
Most compilers have an option that lets you produce assembly output;
have you tried comparing the results for the routine in question with
and without IMPLICIT NONE?

> 5) Some have indicated that mismatched arguments could have caused the
> error.
> A very valid point, and I've been looking at this for some time now.
> But think about it for a moment. If there are mismatched arguments,
> how would/could inserting a "Pause" statement in one of the routines
> or just adding "implicit none" in another (with no additional
> declarations) correct the mismatch and force the algorithms to work
> "perfectly" producing the correct results throughout ??
> This is the other part of the mystery!

Heisenbugs happen when the effect of the bug is to write to and/or read
from memory that isn't supposed to be written to and/or read from. In
that case, it becomes important how variables are laid out in memory,
what is in the memory before it is read from, and so on. Adding a
statement may cause the code generator to change something which isn't
visible to you and changes the manifestation of the bug.
From: dpb on
monir wrote:
....

> 2) Here's again an abbreviated sample code for easy reference:
> (F77, g95)
>
> PROGRAM main
> ....................
> call dCpZeros()
> ......................
> End main
> ------------------------------------
> SUBROUTINE dCpZeros()
> .....................
> do i=1, 9
> do j=1, 10
> do k=1, 30
> .....................
....
> ..................
> Return
> End Subroutine dCpZeros
> ------------------------------------
> SUBROUTINE Polin2(w1, w2, w3, w4, val)
> !! implicit none
> ....................

Which contains nary a single declaration making it totally useless for
anybody to look at and see what might be the argument or type or
dimension mismatch that is in at least moderately high likelihood the
underlying culprit.... :(

--
From: glen herrmannsfeldt on
monir <monirg(a)mondenet.com> wrote:
(big snip, including points 1 through 4)

> 5) Some have indicated that mismatched arguments could have caused the
> error.
> A very valid point, and I've been looking at this for some time now.
> But think about it for a moment. If there are mismatched arguments,
> how would/could inserting a "Pause" statement in one of the routines
> or just adding "implicit none" in another (with no additional
> declarations) correct the mismatch and force the algorithms to work
> "perfectly" producing the correct results throughout ??
> This is the other part of the mystery!

(snip)

Unfortuanately, fairly easily.

Reminds me of a PL/I program that I wrote a loooong time ago,
which used CONTROLLED variables. (PL/I equivalent to ALLOCATABLE.)

The program was working fine until I changed something and then
it didn't work right anymore. I don't remember how I tracked it
down, but the result was that it was deallocating and later
reallocating arrays of the same size and worked as long as they were
reallocated in the same place! So I was lucky for a while...

Argument mismatch can easily depend on values on the stack not
changing at appropriate points. PAUSE likely does a subroutine
call to the routine implementing the pause operation, which
involves data on the stack.

-- glen
From: aerogeek on
On Apr 1, 12:08 am, Craig Powers <craig.pow...(a)invalid.invalid> wrote:
> monir wrote:
>
> > 2) Here's again an abbreviated sample code for easy reference:
> >    (F77, g95)
>
> The problem with the abbreviated sample is that it's so abbreviated, it
> cuts out the problem (as has already been said multiple times by
> multiple people).
>
> I don't disagree with you that 22k-ish lines is not practical to post.
> However, there are a couple of things you can and *should* do:
> * Try running it with absolutely every check the compiler offers turned
> on.  That's spelled "-check all" in ifort.  Other compilers may offer a
> similar ability with a different spelling, or they may require you to
> specify multiple options; RTFM.  (I see below you've tried to do this
> and it didn't get you anywhere... in that case, see next.)
> * Try to cut it down to a manageable size.  In the process, maybe you'll
> discover what the problem is yourself.  If it goes away when you take
> out a particular piece, that alone gives you an avenue to pursue in
> trying to find your problem.  If you succeed in producing a manageable
> size example, well, now you've got something to post.
>
> > 3) There appears to be some confusion on when the (current) program
> > correctly works and when it doesn't.
> > Here's a summary for clarification:
> > (ref is to a SINGLE statement in the above abbreviated sample code)
>
> > a) with "! pause" and "!! implicit none"  NOT activated:
> > .......................... program returns x = NaN
> > c) with "! pause" NOT activated and "!! implicit none"  Activated :
> > .......................... program returns x = -1.0676971 (correct)
>
> This is rather interesting.  I don't think adding IMPLICIT NONE should
> change the meaning of a program that continues to compile successfully.
>   Most compilers have an option that lets you produce assembly output;
> have you tried comparing the results for the routine in question with
> and without IMPLICIT NONE?
>
> > 5) Some have indicated that mismatched arguments could have caused the
> > error.
> > A very valid point, and I've been looking at this for some time now.
> > But think about it for a moment.  If there are mismatched arguments,
> > how would/could inserting a "Pause" statement in one of the routines
> > or just adding "implicit none" in another (with no additional
> > declarations) correct the mismatch and force the algorithms to work
> > "perfectly" producing the correct results throughout ??
> > This is the other part of the mystery!
>
> Heisenbugs happen when the effect of the bug is to write to and/or read
> from memory that isn't supposed to be written to and/or read from.  In
> that case, it becomes important how variables are laid out in memory,
> what is in the memory before it is read from, and so on.  Adding a
> statement may cause the code generator to change something which isn't
> visible to you and changes the manifestation of the bug.

I had this very specific problem. A non interfering statement like in
your case pause, was causing the same problem for my code.


This code was running perfectly well in windows system but i saw this
problem once i tried the program on a linux system.

So if possible can you try compiling and running your program on a
different system. If possible.

> Heisenbugs happen when the effect of the bug is to write to and/or read
> from memory that isn't supposed to be written to and/or read from. In
> that case, it becomes important how variables are laid out in memory,
> what is in the memory before it is read from, and so on. Adding a
> statement may cause the code generator to change something which isn't
> visible to you and changes the manifestation of the bug.

For me the problem had something to do with incorrect array bounds,
which was not apparant and didn't come to notice untill i used dbx,
the debugger.


So get a debugger and run through the code via a debugger for the
conditions its failing. I am sure you will get to the bottom of the
problem.

cheers
From: monir on
On Apr 1, 1:39 am, aerogeek <sukhbinder.si...(a)gmail.com> wrote:
> On Apr 1, 12:08 am, Craig Powers <craig.pow...(a)invalid.invalid> wrote:

> > monir wrote:

> > > 2) Here's again an abbreviated sample code for easy reference:
> > > (F77, g95)

> > The problem with the abbreviated sample is that it's so abbreviated, it
> > cuts out the problem.

> > I don't disagree with you that 22k-ish lines is not practical to post.
> > However, there are a couple of things you can and *should* do:
> > * Try running it with absolutely every check the compiler offers turned
> > on. (I see below you've tried to do this
> > and it didn't get you anywhere... in that case, see next.)
> > * Try to cut it down to a manageable size. In the process, maybe you'll
> > discover what the problem is yourself. If it goes away when you take
> > out a particular piece, that alone gives you an avenue to pursue in
> > trying to find your problem. If you succeed in producing a manageable
> > size example, well, now you've got something to post.

> > > monir wrote:
> > > 3) There appears to be some confusion on when the (current) program
> > > correctly works and when it doesn't.
> > > Here's a summary for clarification:
> > > (ref is to a SINGLE statement in the above abbreviated sample code)

> > > a) with "! pause" and "!! implicit none" NOT activated:
> > > .......................... program returns x = NaN
> > > c) with "! pause" NOT activated and "!! implicit none" Activated :
> > > .......................... program returns x = -1.0676971 (correct)

> > This is rather interesting. I don't think adding IMPLICIT NONE should
> > change the meaning of a program that continues to compile successfully.
> > Most compilers have an option that lets you produce assembly output;
> > have you tried comparing the results for the routine in question with
> > and without IMPLICIT NONE?

......YES I have many times. ALL Routine works perfectly when tested
in isolation.
......I got the assembly output (~ 2,000 pages), but not sure what to
look for ?
......For example, at the top it displays:
.........................................
.comm _abscisae_, 36000 # 36000
.comm _crt_, 496 # 484
.comm _d2cp_, 144000 # 144000
.comm _d9mach_, 160 # 152
.........................................

......ARE the above pairs of numbers (bytes?) supposed to be the same
or they're ref to something else ??

> > > monir wrote:
> > > 8) Based on my rather limited knowledge of Fortran, here's a thought
> > > for you experts to critique.
> > > As indicated earlier, the code (work-in-progress, ~ 22,000 lines and ~ 80
> > > routines) is mostly in F77, but with some limited patches of F90, e.g.;
> > > use of unlabeled loops, vectors & matrices & array operations, some new
> > > intrinsic functions, one Contains and one explicit Interface, but no
> > > modules, no dynamic arrays, no defined data types, no Pointers, no ...
> > > I've always had some suspicions about such programming practice, even
> > > though the g95 compiler never complained. But it seems reasonable to
> > > expect at some point (depending on the complexity of the code and the
> > > extent of the mix) that there would be a conflict that wouldn't be
> > > detected/resolved by the compiler, leading to possible confusion or
> > > misinterpretation or memory disruption or whatever.
> > > The "g95" compiler, or any other comparable compiler for that matter,
> > > can't possibly detect and resolve each and every conflict that might arise
> > > from a mixed F77+F90 programming. Correct ??
> > > Just a thought! ... you don't have to take it seriously if you don't
> > > want to!

> > > 5) Some have indicated that mismatched arguments could have caused the
> > > error.
> > > A very valid point, and I've been looking at this for some time now.
> > > But think about it for a moment. If there are mismatched arguments,
> > > how would/could inserting a "Pause" statement in one of the routines
> > > or just adding "implicit none" in another (with no additional
> > > declarations) correct the mismatch and force the algorithms to work
> > > "perfectly" producing the correct results throughout ??
> > > This is the other part of the mystery!

> aerogeek wrote:
> I had this very specific problem. A non interfering statement like in
> your case pause, was causing the same problem for my code.

> This code was running perfectly well in windows system but i saw this
> problem once i tried the program on a linux system.

> So if possible can you try compiling and running your program on a
> different system. If possible.

..... UNFORTUNATELY, I don't have access to other systems.

> For me the problem had something to do with incorrect array bounds,
> which was not apparant and didn't come to notice untill i used dbx,
> the debugger.

> So get a debugger and run through the code via a debugger for the
> conditions its failing. I am sure you will get to the bottom of the
> problem.

$$ ===================== $$

NOT being able so far to trap the problem or the code violation, if
any, leaves me with couple of options:

1) POST the entire F77 code:
as a zip file and include the input files to look at.
It is a good idea, but with no documentation it would be extremely
difficult even for you experts to follow the program logic.
And reducing it to a meaningful size for posting while ensuring it
still generates the NaN error is not an easy task, and would still be
considered as an (extended) abbreviated version, and I might in the
process cut out the source of the problem!

2) USE a modern debugger.
In the past I used the MS Fortran metacommand "$DEBUG:" for debugging
(I believe that what it was called!); by inserting it in the source
code (could appear multiple times). It was part of the MS Fortran
compiler.

What modern Fortran Debugger would you recommend (Win XP OS) ??
Is there a connection between the Fortran compiler g95 and the
debugger ? or it works independently ?
Does it matter if the code is F77 or F90 or F77+F90 ??
(I hope it is free!)

3) BACK to the problem in hand.
The general consensus among the responders is that the problem could
be attributed to:
a- declaration issues
b- arrays out of bounds
c- mismatched arguments
d- data on the stack unexpectedly or unintentionally moved around as a
result of a non-interfering statement such as "PAUSE" or "IMPLICIT
NONE"
e- any combinations of the above
f- none of the above!
I'm reasonably confident, after so much re-checking and testing, that
it is NOT a- , b- or c- above, but I could be wrong!

4) I suggested earlier:
>... it seems reasonable to expect at some point
>(depending on the complexity of the code and the
>extent of the mix) that there would be a conflict that wouldn't be
>detected/resolved by the compiler, leading to possible confusion or
>misinterpretation or memory disruption or whatever.
>The "g95" compiler, or any other comparable compiler for that matter,
>can't possibly detect and resolve each and every conflict that might arise
>from a mixed F77+F90 programming.
>Just a thought! ... you don't have to take it seriously if you don't want to!

Richard Main and others responded:
>>... I consider it incorrect to even label it as mixed f77+f90.
>>Almost all of f77 is also part of f95. The very few exceptions are
>>matters of mostly academic interest, as all f95 compilers do them anyway
>>and they are *NOT* things that are prone to obscure interactions. So
>>what you have is just f95 code.

5) OK. Here is my latest attempt:
a- I took a version of the offended code Test1.FOR, and made sure NO
"PAUSE" in Sub dCpzeros() and NO "IMPLICIT NONE" in Sub Polin2()
b- re-compiled and ran the program
....got (as expected) ..... x = NaN
c- renamed the source code (self-contained single file) as Test1F.F90
....The MinGW-g95 manual states: " ... with F90 name extension, the
source code is pre-processed with the C preprocessor."
Not knowing exactly what that means, I took it to imply that something
is done by the g95 compiler when using .F90 extension that otherwise
is NOT done (with .FOR).
Let me try it.

d- changed the F77 style to F90 style throughout, namely:
....replaced "c" in col 1 by "!"
....added "&" for continuation lines and removed char from col 6
....deleted blanks between digits (initially for easy reading/editing
long numbers)
......e.g.; Data GaussWg ( 7) / 0.0910282619 8296364981 1497220702
892 d0 /
...........(which is allowed in *.FOR, but gave DATA syntax error in
*.F90)
.......... was changed to:
...........Data GaussWg ( 7) /
0.091028261982963649811497220702892d0 /
That was all. Nothing else was changed.

e- compiled:
....>g95 -fbounds-check -ftrace=full -o Test1F Test1F.F90
and ran.
PROGRAM Works Fine!!!! returning:
............ x = -1.0676971 (correct)

6) THE above may or may not be the cure, since it does not directly
supports or refutes the earlier suggestion (Item 4 above).
Furthermore, it might be just temporarily masking the problem!

PLEASE provide at your convenience the name of a modern debugger (Item
2 above) and will go through the code line-by-line to identify the
culprit once and for all and get to the bottom of the problem in
Test1.FOR.

Thank you kindly for your patience!
Monir