From: deltaquattro on
Hi,

I was wondering whether global array operations, introduced in f90,
can have a negative impact on performance. Compare:

do i=1, ntheta
r(0,i) = rhub
r(nr+1,i) = rmax
dt(0,i) = 0.0
dft(0,i) = 0.0
dfr(0,i) = 0.0
dt(nr+1,i) = 0.0
dft(nr+1,i) = 0.0
dfr(nr+1,i) = 0.0
end do

with:

r(0,:) = rhub
r(nr+1,:) = rmax
dt(0,:) = 0.0
dft(0,:) = 0.0
dfr(0,:) = 0.0
dt(nr+1,:) = 0.0
dft(nr+1,:) = 0.0
dfr(nr+1,:) = 0.0

I found the execution time of the latter to be higher than the former,
as if many DO loops were executed instead than just one. Why use
global array operations then? Isn't better to stick to old plain DO
loops? Thanks,

regards,

deltaquattro




From: Dennis Wassel on
On 17 Jun., 17:41, deltaquattro <deltaquat...(a)gmail.com> wrote:
> Hi,
>
> I was wondering whether global array operations, introduced in f90,
> can have a negative impact on performance.
>
> [snip]
>
> I found the execution time of the latter to be higher than the former,
> as if many DO loops were executed instead than just one. Why use
> global array operations then? Isn't better to stick to old plain DO
> loops? Thanks,
>
> regards,
>
> deltaquattro

This is quite a strange observation and raises some questions:

1) What optimisation options did you use?

2) Which compiler did you use?
The gcc 4.0 and 4.1 Fortran compilers for instance are still pretty
much in their infancy, so one would expect bugs and strange behaviour
there. Use 4.2 or 4.3 instead, if you use gfortran.

3) How did you measure execution time?
I find that accuarate timing on a computer is a nontrivial task. The
'time' command on my machine shows up to 200% variance. I can only
assume you used some clever and appropriately precise way of
measuring.

I'm not a compiler specialist but AFAIK, array operations should not
usually be slower than explicit loop constructs.

Why? When using array operations like -say- x = MATMUL(A,b) in
contrast to two nested DO-loops, the compiler has a greater amount of
information at hand about what it is you want to do, which allows it
to use more aggressive optimisation methods to generate code, or to
generate calls to (more or less optimised) runtime libraries; the
latter is done by all compilers I know.
Additionally, the gfortran compiler has the '-fexternal-blas' option
which tells the compiler to automagically generate calls to an
optimised vendor BLAS for certain array operations, instead of using
the runtime library. I've never tried this, but using a tuned ATLAS
library will surely speed things up nicely.

A second benefit of array operations is their conciseness. Take the
MATMUL example again: A single call opposed to two nested DO-loops. Or
think of copying part of an array into another array, anything really!
IMHO, a lot of scientific code completely disregards maintainability
issues for the sake of the highest possible degree of code
optimisation.
Using array operations makes your code more concise, more readable and
therefore easier to maintain in the long run! It *should* also
improve, or at least not hurt, performance.

Cheers,
Dennis
From: James Van Buskirk on
"deltaquattro" <deltaquattro(a)gmail.com> wrote in message
news:9c706700-2861-4d17-a3b8-7e2291fa0b5f(a)2g2000hsn.googlegroups.com...

> do i=1, ntheta
> r(0,i) = rhub
> r(nr+1,i) = rmax
> dt(0,i) = 0.0
> dft(0,i) = 0.0
> dfr(0,i) = 0.0
> dt(nr+1,i) = 0.0
> dft(nr+1,i) = 0.0
> dfr(nr+1,i) = 0.0
> end do

> r(0,:) = rhub
> r(nr+1,:) = rmax
> dt(0,:) = 0.0
> dft(0,:) = 0.0
> dfr(0,:) = 0.0
> dt(nr+1,:) = 0.0
> dft(nr+1,:) = 0.0
> dfr(nr+1,:) = 0.0

> I found the execution time of the latter to be higher than the former,
> as if many DO loops were executed instead than just one. Why use
> global array operations then? Isn't better to stick to old plain DO
> loops? Thanks,

Normally an initialization loop like this one would be faster as
separate loops than one fused loop because it's faster to access
memory consecutively rather than jumping around as implied by the
fused loop. However in this case the loops appear to be setting
boundary values so they are traversing rows rather than columns of
the arrays. As a consequence the code jumps around in memory no
matter what the compiler does and loop fusion can win out because
it implies less loop overhead which otherwise would be of negligible
importance compared to memory access considerations (assuming that
the data set is too large to fit in cache).

One thing to investigate is whether the r(i,j), dt(i,j), dft(i,j),
and dfr(i,j) always get accessed together. If so, you could group
them as a derived type and the above loop could go 4X as fast as
the structure of arrays code listed above.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


From: Richard Maine on
deltaquattro <deltaquattro(a)gmail.com> wrote:

> I was wondering whether global array operations, introduced in f90,
> can have a negative impact on performance. Compare:
>
[elided initialization with DO loops and whole array operations]

> I found the execution time of the latter to be higher than the former,
> as if many DO loops were executed instead than just one. Why use
> global array operations then? Isn't better to stick to old plain DO
> loops? Thanks,

Note that the usual terminology is something more like "whole array
operations" or even just an unmodified "array operations" instead of
"global".

The main reasons are clarity and conciseness. If it doesn't help clarity
and conciseness, don't do it. That is, no doubt, an oversimplification;
there are exceptions, etc. But its a good first approximation. Every
once in a while they might also get you faster execution, but if that is
your primary reason for using them, and you don't have specific
knowledge of exactly why to expect faster execution from your paticular
case, then your efforts are probably misplaced.

Execution time is actually not the sole measure of code "goodness". In
many cases, it isn't even particularly high on the list of important
things. Sometimes it isn't on the list at all. Other times it is at the
very top of the list. All generalizations are false, your mileage may
vary, etc.

In that regard, is execution time of an initialization such as this
actually significant in your code? While possible, that would be
unusual, and might suggest that the choice of algorithms is less than
ideal. There can be efficient algorithms like that, but they are rare.
As Dennis says, it can be tricky to even measure execution times
precisely enough to time initializations like this. I'm supposing that
perhaps you are just using this as an example of more "interesting"
cases.

In answer to Dennis, by the way, it is *VERY* common for whole array
operations to be slower than DO loops. No, it is not all all strange. It
is much closer to the usual state of afairs. There are a whole host of
reasons.

1. Compilers have had over 5 decades of time to develop techniques of
optimizing loops. Progress has been made in that time. There has only
been about a decade or two (some work preceeded the f90 standard; other
compilers didn't really start until later) of significant work on
optimizing array expressions. Things have improved and are still
improving, but it just is not at the level of experience of DO-loop
optimization.

2. Array temporaries are often a big deal in whole-array expressions. A
naive (aka straightforward) applicaion of the rules very often involves
such array temporaries, which are expensive in time. The compiler has to
do a fair amount of work to figure out whether they can be elided. See
point 1. That's probably not the case for your example, but it is a
common one.

3. Your example illustrates the problem of "loop fusion". The naive
(again aka straightforward) application of the rules for your code
example *DOES* imply separate loop for each array operation (complete
with all loop overhead). That's how the operations are defined. It is an
optimization for the compiler to recognize when it can usefully fuse
these multiple loops. See point 1.

--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain
From: Dennis Wassel on
On 17 Jun., 20:31, nos...(a)see.signature (Richard Maine) wrote:
>
> [snipety-snip]
>
> In answer to Dennis, by the way, it is *VERY* common for whole array
> operations to be slower than DO loops. No, it is not all all strange. It
> is much closer to the usual state of afairs. There are a whole host of
> reasons.

Beats me.
After all, when using array operations you have additional information
either readily available, or you are able to extract it fairly easy,
which is not always the case in DO loops. Talk about aliasing,
strides, contingent memory locations etc. But then again, "See point
1".
I actually find this hard to believe, but given that I'm rather new to
the delightful post-77 Fortran world and haven't really done any
serious benchmarking on array operations vs DO loops, I'll gladly
trust your judgement on this. Thanks for enlightening us!

> 1. Compilers have had over 5 decades of time to develop techniques of
> optimizing loops. Progress has been made in that time. There has only
> been about a decade or two (some work preceeded the f90 standard; other
> compilers didn't really start until later) of significant work on
> optimizing array expressions. Things have improved and are still
> improving, but it just is not at the level of experience of DO-loop
> optimization.

OK, here's my newfound corner of gcc development that I feel like
doing, as soon as I have more time on my hands than right now. After
all, despite my earlier ramblings about conciseness and
maintainability, performance DOES matter in many cases that are
relevant to me :)

> 2. Array temporaries are often a big deal in whole-array expressions. A
> naive (aka straightforward) applicaion of the rules very often involves
> such array temporaries, which are expensive in time. The compiler has to
> do a fair amount of work to figure out whether they can be elided. See
> point 1. That's probably not the case for your example, but it is a
> common one.

The Intel compiler (10.1, maybe earlier versions as well) throws a
warning at runtime if it finds itself needing to create an array
temporary; I found myself changing pieces of my code due to those
warnings.
Gonna have a look if the gfortran guys already have a feature request
about this...

> 3. Your example illustrates the problem of "loop fusion". The naive
> (again aka straightforward) application of the rules for your code
> example *DOES* imply separate loop for each array operation (complete
> with all loop overhead). That's how the operations are defined. It is an
> optimization for the compiler to recognize when it can usefully fuse
> these multiple loops. See point 1.
>
> --
> Richard Maine | Good judgement comes from experience;
> email: last name at domain . net | experience comes from bad judgement.
> domain: summertriangle | -- Mark Twain

 |  Next  |  Last
Pages: 1 2 3 4
Prev: write
Next: Fortran 'read' statement question