From: dpb on
monir wrote:
....

> NOT being able so far to trap the problem or the code violation, if
> any, leaves me with couple of options:
>
> 1) POST the entire F77 code:
> as a zip file and include the input files to look at.
> It is a good idea, but with no documentation it would be extremely
> difficult even for you experts to follow the program logic.
> And reducing it to a meaningful size for posting while ensuring it
> still generates the NaN error is not an easy task, and would still be
> considered as an (extended) abbreviated version, and I might in the
> process cut out the source of the problem!

But, you've posted versions before that weren't very big _BUT_ left out
the most important parts to allow anybody to decipher the problems of
mismatched arguments, subscripts, conventions, etc., that experts could
see by inspection if were made available.

As another respondent has said, it's a real puzzle why you're so
reluctant to provide relatively simple information that could at least
eliminate areas of concern in lieu of continuing to protest why you
can't... :(

....

> The general consensus among the responders is that the problem could
> be attributed to:
> a- declaration issues
> b- arrays out of bounds
> c- mismatched arguments
> d- data on the stack unexpectedly or unintentionally moved around as a
> result of a non-interfering statement such as "PAUSE" or "IMPLICIT
> NONE"
> e- any combinations of the above
> f- none of the above!
> I'm reasonably confident, after so much re-checking and testing, that
> it is NOT a- , b- or c- above, but I could be wrong!

Indeed, I don't think you've provided anything here that gives any of
the previous respondents (and surely not this particular one) any
confidence at all that the above confidence is well-founded (not
personal, just we've not seen the evidence and as another respondent
noted, we've seen instances of confusion over syntax that raise serious
concerns that you recognize certain problems when they exist simply
looking at the source code).

> 4) I suggested earlier:
>> ... it seems reasonable to expect at some point
>> (depending on the complexity of the code and the
>> extent of the mix) that there would be a conflict that wouldn't be
>> detected/resolved by the compiler, leading to possible confusion or
>> misinterpretation or memory disruption or whatever.

As RM said, there is no "mix"--that there _could_ be a compiler but is
possible but it's certainly not the place to start until all other
issues have been resolved. The first place to start to eliminate that
is, of course, to ensure you've used the latest compiler release and,
also, as has also been suggested, use another compiler (or two or three)
on the code as each has its own strengths in terms of diagnostics, etc.

....
> 5) OK. Here is my latest attempt:
> a- I took a version of the offended code Test1.FOR, and made sure NO
> "PAUSE" in Sub dCpzeros() and NO "IMPLICIT NONE" in Sub Polin2()
> b- re-compiled and ran the program
> ...got (as expected) ..... x = NaN
> c- renamed the source code (self-contained single file) as Test1F.F90
> ...The MinGW-g95 manual states: " ... with F90 name extension, the
> source code is pre-processed with the C preprocessor."
> Not knowing exactly what that means, I took it to imply that something
> is done by the g95 compiler when using .F90 extension that otherwise
> is NOT done (with .FOR).

Yes, the source file is preprocessed by the C preprocessor (imagine
that! :) ). Won't make any difference whatsoever unless you have
preprocessor directives in the code. If you don't know what a
preprocessor directive is, I'd presume that it's fairly unlikely you're
using them unless this is inherited code.

> d- changed the F77 style to F90 style throughout, namely:
> ...replaced "c" in col 1 by "!"
> ...added "&" for continuation lines and removed char from col 6
> ...deleted blanks between digits (initially for easy reading/editing
> long numbers)...(which is allowed in *.FOR, but gave DATA syntax error in
> *.F90)

Specifically, it is a syntax error in free-format F90+, not F90 per se
as fixed format F90 is also F90. One incompatibility is that in order
to implement free form source, spaces had to become significant which
weren't/aren't in fixed form. But, note that it's the source form
that's the difference (albeit free form source was introduced in F90).

....

>
> e- compiled:
> ...>g95 -fbounds-check -ftrace=full -o Test1F Test1F.F90
> and ran.
> PROGRAM Works Fine!!!! returning:
> ........... x = -1.0676971 (correct)
>
> 6) THE above may or may not be the cure, since it does not directly
> supports or refutes the earlier suggestion (Item 4 above).
> Furthermore, it might be just temporarily masking the problem!

Nor does it refute necessarily the previous a) thru f), particularly if
there are still references of argument association that make bounds
checking difficult or impossible.

The way I'd think you would have the most success would be the (again
previously suggested) method of also including the code in modules for
the generation of interfaces automagically that can uncover many of
these aforementioned issues as a result.

What would be my firstest step after the above, though, would be to
reintroduce "IMPLICIT NONE" and redo the above. If the error is back,
you've got a problem, Houston...

> PLEASE provide at your convenience the name of a modern debugger...

I still think you'll get where you need to get faster by going the
modules route and supplying the information previously requested than
the debugger at this point...I don't think you've done the prerequisite
work to make that fruitful yet.

--
From: Gordon Sande on
On 2010-04-02 16:09:09 -0300, monir <monirg(a)mondenet.com> said:

> On Apr 1, 1:39 am, aerogeek <sukhbinder.si...(a)gmail.com> wrote:
>> On Apr 1, 12:08 am, Craig Powers <craig.pow...(a)invalid.invalid> wrote:
>
>>> monir wrote:
>
>>>> 2) Here's again an abbreviated sample code for easy reference:
>>>> (F77, g95)
>
>>> The problem with the abbreviated sample is that it's so abbreviated, it
>>> cuts out the problem.
>
>>> I don't disagree with you that 22k-ish lines is not practical to post.
>>> However, there are a couple of things you can and *should* do:
>>> * Try running it with absolutely every check the compiler offers turned
>>> on. (I see below you've tried to do this
>>> and it didn't get you anywhere... in that case, see next.)
>>> * Try to cut it down to a manageable size. In the process, maybe you'll
>>> discover what the problem is yourself. If it goes away when you take
>>> out a particular piece, that alone gives you an avenue to pursue in
>>> trying to find your problem. If you succeed in producing a manageable
>>> size example, well, now you've got something to post.
>
>>>> monir wrote:
>>>> 3) There appears to be some confusion on when the (current) program
>>>> correctly works and when it doesn't.
>>>> Here's a summary for clarification:
>>>> (ref is to a SINGLE statement in the above abbreviated sample code)
>
>>>> a) with "! pause" and "!! implicit none" NOT activated:
>>>> .......................... program returns x = NaN
>>>> c) with "! pause" NOT activated and "!! implicit none" Activated :
>>>> .......................... program returns x = -1.0676971 (correct)
>
>>> This is rather interesting. I don't think adding IMPLICIT NONE should
>>> change the meaning of a program that continues to compile successfully.
>>> Most compilers have an option that lets you produce assembly output;
>>> have you tried comparing the results for the routine in question with
>>> and without IMPLICIT NONE?
>
> .....YES I have many times. ALL Routine works perfectly when tested
> in isolation.
> .....I got the assembly output (~ 2,000 pages), but not sure what to
> look for ?
> .....For example, at the top it displays:
> .........................................
> .comm _abscisae_, 36000 # 36000
> .comm _crt_, 496 # 484
> .comm _d2cp_, 144000 # 144000
> .comm _d9mach_, 160 # 152
> .........................................
>
> .....ARE the above pairs of numbers (bytes?) supposed to be the same
> or they're ref to something else ??
>
>>>> monir wrote:
>>>> 8) Based on my rather limited knowledge of Fortran, here's a thought
>>>> for you experts to critique.
>>>> As indicated earlier, the code (work-in-progress, ~ 22,000 lines and ~ 80
>>>> routines) is mostly in F77, but with some limited patches of F90, e.g.;
>>>> use of unlabeled loops, vectors & matrices & array operations, some new
>>>> intrinsic functions, one Contains and one explicit Interface, but no
>>>> modules, no dynamic arrays, no defined data types, no Pointers, no ...
>>>> I've always had some suspicions about such programming practice, even
>>>> though the g95 compiler never complained. But it seems reasonable to
>>>> expect at some point (depending on the complexity of the code and the
>>>> extent of the mix) that there would be a conflict that wouldn't be
>>>> detected/resolved by the compiler, leading to possible confusion or
>>>> misinterpretation or memory disruption or whatever.
>>>> The "g95" compiler, or any other comparable compiler for that matter,
>>>> can't possibly detect and resolve each and every conflict that might arise
>>>> from a mixed F77+F90 programming. Correct ??
>>>> Just a thought! ... you don't have to take it seriously if you don't
>>>> want to!
>
>>>> 5) Some have indicated that mismatched arguments could have caused the
>>>> error.
>>>> A very valid point, and I've been looking at this for some time now.
>>>> But think about it for a moment. If there are mismatched arguments,
>>>> how would/could inserting a "Pause" statement in one of the routines
>>>> or just adding "implicit none" in another (with no additional
>>>> declarations) correct the mismatch and force the algorithms to work
>>>> "perfectly" producing the correct results throughout ??
>>>> This is the other part of the mystery!
>
>> aerogeek wrote:
>> I had this very specific problem. A non interfering statement like in
>> your case pause, was causing the same problem for my code.
>
>> This code was running perfectly well in windows system but i saw this
>> problem once i tried the program on a linux system.
>
>> So if possible can you try compiling and running your program on a
>> different system. If possible.
>
> .... UNFORTUNATELY, I don't have access to other systems.
>
>> For me the problem had something to do with incorrect array bounds,
>> which was not apparant and didn't come to notice untill i used dbx,
>> the debugger.
>
>> So get a debugger and run through the code via a debugger for the
>> conditions its failing. I am sure you will get to the bottom of the
>> problem.
>
> $$ ===================== $$
>
> NOT being able so far to trap the problem or the code violation, if
> any, leaves me with couple of options:
>
> 1) POST the entire F77 code:
> as a zip file and include the input files to look at.
> It is a good idea, but with no documentation it would be extremely
> difficult even for you experts to follow the program logic.
> And reducing it to a meaningful size for posting while ensuring it
> still generates the NaN error is not an easy task, and would still be
> considered as an (extended) abbreviated version, and I might in the
> process cut out the source of the problem!
>
> 2) USE a modern debugger.
> In the past I used the MS Fortran metacommand "$DEBUG:" for debugging
> (I believe that what it was called!); by inserting it in the source
> code (could appear multiple times). It was part of the MS Fortran
> compiler.
>
> What modern Fortran Debugger would you recommend (Win XP OS) ??
> Is there a connection between the Fortran compiler g95 and the
> debugger ? or it works independently ?
> Does it matter if the code is F77 or F90 or F77+F90 ??
> (I hope it is free!)

Silverfrost F95 (Salford F95 as it used to be called) is free for
personal use. It runs under Windows either from its own IDE or a Windows
command line. Salford have an older F77 but you want the F95. There is
no such thing as a mix of F77 and F90 as F77 is a subset of F90 so it
is all F90. It may be true that some of you code would be F77 but when
it is mixed in with stuff that is only in F90 the result is all F90.

Compile the entire program with both subscript and undefined varable
checking and run the result. It does no good to compile with the options
set, then toss the object and finally recompile and run without the options.

Salford has been sugested to you in the past. Is there some reason why
you have to repeatedly ask the same question? You are unlikely to
get differing advice.

> 3) BACK to the problem in hand.
> The general consensus among the responders is that the problem could
> be attributed to:
> a- declaration issues
> b- arrays out of bounds
> c- mismatched arguments
> d- data on the stack unexpectedly or unintentionally moved around as a
> result of a non-interfering statement such as "PAUSE" or "IMPLICIT
> NONE"
> e- any combinations of the above
> f- none of the above!
> I'm reasonably confident, after so much re-checking and testing, that
> it is NOT a- , b- or c- above, but I could be wrong!

You have the classical symptoms of storing outside of array bounds. Often
this will cause terminaltion for invalid code or if you are luckly merely
changing of unexpected variables. The symptom and the cause are often widely
separated. This is called buffer overrun is other circumstances and is
the major
cause of the much publicized browser security problems. You have an "a", "b"
and "c". "d" makes no sense and mostly just shows that you have not understood
a highly technical answer. You asked why the symptom comes and goes and got
an explanation that the code and data layouts will differ when you make minor
changes which can radically change the symptoms of the out of bounds subscript.
Smetime you shoot yourself in the foot and othertimes you miss - the problem
is the shooting and not whether you have slightly moved your foot.

If you have no type mismatchs as you claim and the program runs with subscript
checking the best guess is that you are not describing the array layouts
across calls so that the actual sequence association you get is not what
you think it is. This is something that I recall seeing many questions about
from you. It is unlikely that you have a single mistake but rather several
versions of the same logic error. So you need good debugging tools like the
Salford debugging compiler. It carries much extra information across calls
so it can do full checking. This extra work is not usually done in Fortran
as the programmer is required to get it right so having more efficient object
code is perfectly OK and in fact insisted upon by many users.

The diagnostic that I expect will indicate that some array as declared in
some subroutine is not properly contained in the array passed to the
subroutine.
Ordinary subscript checking will happily allow the subroutine to store into
the elements that are outside the passed array, otherwise known as an array
overrun. The symptoms of this will not be obviously related to the actual error
which is why debugging these errors is difficult.

> 4) I suggested earlier:
>> ... it seems reasonable to expect at some point
>> (depending on the complexity of the code and the
>> extent of the mix) that there would be a conflict that wouldn't be
>> detected/resolved by the compiler, leading to possible confusion or
>> misinterpretation or memory disruption or whatever.
>> The "g95" compiler, or any other comparable compiler for that matter,
>> can't possibly detect and resolve each and every conflict that might arise
>> from a mixed F77+F90 programming.
>> Just a thought! ... you don't have to take it seriously if you don't want to!
>
> Richard Main and others responded:
>>> ... I consider it incorrect to even label it as mixed f77+f90.
>>> Almost all of f77 is also part of f95. The very few exceptions are
>>> matters of mostly academic interest, as all f95 compilers do them anyway
>>> and they are *NOT* things that are prone to obscure interactions. So
>>> what you have is just f95 code.
>
> 5) OK. Here is my latest attempt:
> a- I took a version of the offended code Test1.FOR, and made sure NO
> "PAUSE" in Sub dCpzeros() and NO "IMPLICIT NONE" in Sub Polin2()
> b- re-compiled and ran the program
> ...got (as expected) ..... x = NaN
> c- renamed the source code (self-contained single file) as Test1F.F90
> ...The MinGW-g95 manual states: " ... with F90 name extension, the
> source code is pre-processed with the C preprocessor."
> Not knowing exactly what that means, I took it to imply that something
> is done by the g95 compiler when using .F90 extension that otherwise
> is NOT done (with .FOR).
> Let me try it.
>
> d- changed the F77 style to F90 style throughout, namely:
> ...replaced "c" in col 1 by "!"
> ...added "&" for continuation lines and removed char from col 6
> ...deleted blanks between digits (initially for easy reading/editing
> long numbers)
> .....e.g.; Data GaussWg ( 7) / 0.0910282619 8296364981 1497220702
> 892 d0 /
> ..........(which is allowed in *.FOR, but gave DATA syntax error in
> *.F90)
> ......... was changed to:
> ..........Data GaussWg ( 7) /
> 0.091028261982963649811497220702892d0 /
> That was all. Nothing else was changed.
>
> e- compiled:
> ...>g95 -fbounds-check -ftrace=full -o Test1F Test1F.F90
> and ran.
> PROGRAM Works Fine!!!! returning:
> ........... x = -1.0676971 (correct)
>
> 6) THE above may or may not be the cure, since it does not directly
> supports or refutes the earlier suggestion (Item 4 above).
> Furthermore, it might be just temporarily masking the problem!
>
> PLEASE provide at your convenience the name of a modern debugger (Item
> 2 above) and will go through the code line-by-line to identify the
> culprit once and for all and get to the bottom of the problem in
> Test1.FOR.
>
> Thank you kindly for your patience!
> Monir


From: steve on
On Apr 2, 2:09 pm, monir <mon...(a)mondenet.com> wrote:
> On Apr 1, 1:39 am, aerogeek <sukhbinder.si...(a)gmail.com> wrote:
>
>
>
> > On Apr 1, 12:08 am, Craig Powers <craig.pow...(a)invalid.invalid> wrote:
> > > monir wrote:
> > > > 2) Here's again an abbreviated sample code for easy reference:
> > > >    (F77, g95)
> > > The problem with the abbreviated sample is that it's so abbreviated, it
> > > cuts out the problem.
> > > I don't disagree with you that 22k-ish lines is not practical to post..
> > > However, there are a couple of things you can and *should* do:
> > > * Try running it with absolutely every check the compiler offers turned
> > > on.  (I see below you've tried to do this
> > > and it didn't get you anywhere... in that case, see next.)
> > > * Try to cut it down to a manageable size.  In the process, maybe you'll
> > > discover what the problem is yourself.  If it goes away when you take
> > > out a particular piece, that alone gives you an avenue to pursue in
> > > trying to find your problem.  If you succeed in producing a manageable
> > > size example, well, now you've got something to post.
> > > > monir wrote:
> > > > 3) There appears to be some confusion on when the (current) program
> > > > correctly works and when it doesn't.
> > > > Here's a summary for clarification:
> > > > (ref is to a SINGLE statement in the above abbreviated sample code)
> > > > a) with "! pause" and "!! implicit none"  NOT activated:
> > > > .......................... program returns x = NaN
> > > > c) with "! pause" NOT activated and "!! implicit none"  Activated :
> > > > .......................... program returns x = -1.0676971 (correct)
> > > This is rather interesting.  I don't think adding IMPLICIT NONE should
> > > change the meaning of a program that continues to compile successfully.
> > >   Most compilers have an option that lets you produce assembly output;
> > > have you tried comparing the results for the routine in question with
> > > and without IMPLICIT NONE?
>
> .....YES I have many times.  ALL Routine works perfectly when tested
> in isolation.
> .....I got the assembly output (~ 2,000 pages), but not sure what to
> look for ?
> .....For example, at the top it displays:
>       .........................................
>       .comm      _abscisae_, 36000 # 36000
>       .comm      _crt_, 496   # 484
>       .comm      _d2cp_, 144000    # 144000
>       .comm      _d9mach_, 160     # 152
>       .........................................
>
> .....ARE the above pairs of numbers (bytes?) supposed to be the same
> or they're ref to something else ??
>
>
>
> > > > monir wrote:
> > > > 8) Based on my rather limited knowledge of Fortran, here's a thought
> > > > for you experts to critique.
> > > > As indicated earlier, the code (work-in-progress, ~ 22,000 lines and ~ 80
> > > > routines) is mostly in F77, but with some limited patches of F90, e..g.;
> > > > use of unlabeled loops, vectors & matrices &  array operations, some new
> > > > intrinsic functions, one Contains and one explicit Interface, but no
> > > > modules, no dynamic arrays, no defined data types, no Pointers, no ....
> > > > I've always had some suspicions about such programming practice, even
> > > > though the g95 compiler never complained.  But it seems reasonable to
> > > > expect at some point (depending on the complexity of the code and the
> > > > extent of the mix) that there would be a conflict that wouldn't be
> > > > detected/resolved by the compiler, leading to possible confusion or
> > > > misinterpretation or memory disruption or whatever.
> > > > The "g95" compiler, or any other comparable compiler for that matter,
> > > > can't possibly detect and resolve each and every conflict that might arise
> > > > from a mixed F77+F90 programming.  Correct ??
> > > > Just a thought! ... you don't have to take it seriously if you don't
> > > > want to!
> > > > 5) Some have indicated that mismatched arguments could have caused the
> > > > error.
> > > > A very valid point, and I've been looking at this for some time now..
> > > > But think about it for a moment.  If there are mismatched arguments,
> > > > how would/could inserting a "Pause" statement in one of the routines
> > > > or just adding "implicit none" in another (with no additional
> > > > declarations) correct the mismatch and force the algorithms to work
> > > > "perfectly" producing the correct results throughout ??
> > > > This is the other part of the mystery!
> > aerogeek wrote:
> > I had this very specific problem. A non interfering statement like in
> > your case pause, was causing the same problem for my code.
> > This code was running perfectly well in windows system but i saw this
> > problem once i tried the program on a linux system.
> > So if possible can you try compiling and running your program on a
> > different system. If possible.
>
> .... UNFORTUNATELY, I don't have access to other systems.
>
> > For me the problem had something to do with incorrect array bounds,
> > which was not apparant and didn't come to notice untill i used dbx,
> > the debugger.
> > So get a debugger and run through the code via a debugger for the
> > conditions its failing. I am sure you will get to the bottom of the
> > problem.
>
> $$ ===================== $$
>
> NOT being able so far to trap the problem or the code violation, if
> any, leaves me with couple of options:
>
> 1) POST the entire F77 code:
> as a zip file and include the input files to look at.
> It is a good idea, but with no documentation it would be extremely
> difficult even for you experts to follow the program logic.
> And reducing it to a meaningful size for posting while ensuring it
> still generates the NaN error is not an easy task, and would still be
> considered as an (extended) abbreviated version, and I might in the
> process cut out the source of the problem!
>

I would not call myself an expert in Fortran (I suspect
some here would even endorse that notion :), but I do know
that I can take your problematic code and try

gfortran -fcheck=all -ffpe-trap=invalid -fbacktrace Test1.FOR
../a.out

with at least some expectation of a core dump if NaN occurs.

So, once again, post a URL to a zip archive.

--
steve

From: Craig Powers on
monir wrote:
> .....YES I have many times. ALL Routine works perfectly when tested
> in isolation.
> .....I got the assembly output (~ 2,000 pages), but not sure what to
> look for ?
> .....For example, at the top it displays:
> .........................................
> .comm _abscisae_, 36000 # 36000
> .comm _crt_, 496 # 484
> .comm _d2cp_, 144000 # 144000
> .comm _d9mach_, 160 # 152
> .........................................
>
> .....ARE the above pairs of numbers (bytes?) supposed to be the same
> or they're ref to something else ??

My very specific suggestion with respect to assembly output was to
compare the result with IMPLICIT NONE with the result without; I don't
think there should be any differences at all, but you said you got
different behavior.
From: glen herrmannsfeldt on
Gordon Sande <Gordon.Sande(a)eastlink.ca> wrote:
(really big snip)

> There is
> no such thing as a mix of F77 and F90 as F77 is a subset of F90 so it
> is all F90. It may be true that some of you code would be F77 but when
> it is mixed in with stuff that is only in F90 the result is all F90.

There are a few features in Fortran 77 that were removed in
Fortran 95. Some may have been added in Fortran 77, so didn't
have so long to begin to be used in actual programs.

Fortran 77 added the use of REAL and DOUBLE PRECISION variables
in DO loops. While many problems can be caused through the use
of such variables, I don't see the need to remove them from the
standard. (All the other languages that I know with a looping
statement allow them.)

Next is branching to ENDIF from outside its block. Why that
was added, I don't know. It isn't hard to fix, either, so
this one is fine with me.

Then there is ASSIGN, assigned GOTO, and the use of ASSIGN
with Format statement numbers. ASSIGN goes back to Fortran I,
while the use for FORMAT was only added in Fortran 77.
Reminds me of trying, in a very early program that I wrote, to
use a variable for format statement number in a WRITE statement.
(Fortran 66 allows arrays, but not scalar variables.)

Last is the H format descriptor. I have known Fortran
preprocessors to generate these on output to be strictly
compatible with Fortran 66. Maybe one of the more popular
extensions was allowing apostrophes in FORMAT. I don't know
anyone who misses the H descriptor.

It seems that these were removed in Fortran 95, so technically
all Fortran 77 programs are also Fortran 90 programs, but not
necessarily Fortran 95 programs.

-- glen