From: Hifi-Comp on
On Jun 19, 12:24 am, steve <kar...(a)comcast.net> wrote:
> On Jun 18, 8:22 pm, Hifi-Comp <wenbinyu.hea...(a)gmail.com> wrote:
>
> > It is a revisit a problem I posted some time ago. However, the problem
> > is not completely resolved.
> > I have a code as follows to test the efficiency of OO.
>
> Oddly, you haven't shown a -O0 result.
>
> (Code elided)
>
>
>
>
>
> > When I put all the code in one single file named test.f90 and compile
> > it using gfortran -O3 -ffast-math -march=native -fwhole-file
> > test.f90, I obtain excellent efficience for operator overload:
> > Analysis runs for 0.141 sec and DNAD runs for 0.062 sec.
> > However However, when I split it into three separate files (put
> > program test in main.f90, CPUtime model in CPUtime.f90, and DNAD
> > module in DNAD.f90), and use the following series of command:
> > gfortran -c -O3 -ffast-math -march=native -fwhole-file  CPUtime.f90
> > gfortran -c -O3 -ffast-math -march=native -fwhole-file  DNAD.f90
> > gfortran -c -O3 -ffast-math -march=native -fwhole-file  main.f90
>
> > gfortran -o -O3 -ffast-math -march=native -fwhole-file  CPUtime.o
> > DNAD.o main.o
>
> > I lost much of the efficiency, now the Analysis runs for 0.156 sec and
> > DNAD runs for 1.25 sec.
>
> > Any hints on how to optimize the multiple file code is greatly
> > appreciated.
>
> Read the documentation?
>
> Hopefully, google-group does mess up the formatting.
>
> % cat run
> #! /bin/csh
>
> echo "Case 1"
> gfc4x -O3 -march=native -fwhole-program -ffast-math \
>   -funroll-loops -ftree-vectorize -o z a.f90
> ./z
> echo "Case 2"
> gfc4x -O3 -march=native -fwhole-file -ffast-math \
>   -funroll-loops -ftree-vectorize -c c.f90
> gfc4x -O3 -march=native -fwhole-file -ffast-math \
>   -funroll-loops -ftree-vectorize -c d.f90
> gfc4x -O3 -march=native -fwhole-program -ffast-math \
>   -funroll-loops -ftree-vectorize -o z b.f90 d.o c.o
> ./z
>
> echo "Case 3"
> gfc4x -flto -O3 -march=native -fwhole-file -ffast-math \
>   -funroll-loops -ftree-vectorize -c c.f90
> gfc4x -flto -O3 -march=native -fwhole-file -ffast-math  \
>   -funroll-loops -ftree-vectorize -c d.f90
> gfc4x -flto -O3 -march=native -fwhole-program -ffast-math \
>  -funroll-loops -ftree-vectorize -o z b.f90 d.o c.o
> ./z
>
> % ./run
> Case 1
>  Analysis Runs for    7.50000030E-02  Seconds.
>   -19999999.999023605
>  DNAD Runs for    7.59999976E-02  Seconds.
>   -19999999.999023605        24999999.999999996
> Case 2
>  Analysis Runs for    7.50000030E-02  Seconds.
>   -19999999.999023605
>  DNAD Runs for     1.5360000      Seconds.
>   -19999999.999023605        24999999.999999996
> Case 3
>  Analysis Runs for    7.59999976E-02  Seconds.
>   -19999999.999023605
>  DNAD Runs for    7.50000030E-02  Seconds.
>   -19999999.999023605        24999999.999999996
>
> --
> steve- Hide quoted text -
>
> - Show quoted text -

I am using gfortran on windows. -flto is not available for win32 gcc
version 4.6.0 20100524.
From: Hifi-Comp on
It is a revisit a problem I posted some time ago. However, the problem
is not completely resolved.
I have a code as follows to test the efficiency of OO.

MODULE CPUTime
IMPLICIT NONE
PRIVATE
PUBLIC TIC, TOC
INTEGER::start, rate, finish
CONTAINS
SUBROUTINE TIC
CALL SYSTEM_CLOCK(start,rate)
END SUBROUTINE TIC

FUNCTION TOC() RESULT(sec)
REAL::sec
CALL SYSTEM_CLOCK(finish)
IF(finish>start) THEN
sec=REAL(finish-start)/REAL(rate)
ELSE
sec=0.0
ENDIF
END FUNCTION TOC
END MODULE CPUTime


MODULE DNAD
IMPLICIT NONE
PRIVATE

TYPE,PUBLIC:: DUAL_NUM
REAL(8)::x_ad_
REAL(8)::xp_ad_
END TYPE DUAL_NUM

PUBLIC OPERATOR (-)
INTERFACE OPERATOR (-)
MODULE PROCEDURE MINUS_DD
END INTERFACE

PUBLIC OPERATOR (*)
INTERFACE OPERATOR (*)
MODULE PROCEDURE MULT_DD
END INTERFACE

PUBLIC OPERATOR (/)
INTERFACE OPERATOR (/)
MODULE PROCEDURE DIV_DD
END INTERFACE

CONTAINS
ELEMENTAL FUNCTION MINUS_DD(u,v) RESULT(res)
TYPE (DUAL_NUM), INTENT(IN)::u,v
TYPE (DUAL_NUM)::res
res%x_ad_ = u%x_ad_-v%x_ad_
res%xp_ad_= u%xp_ad_-v%xp_ad_
END FUNCTION MINUS_DD


ELEMENTAL FUNCTION MULT_DD(u,v) RESULT(res)
TYPE (DUAL_NUM), INTENT(IN)::u,v
TYPE (DUAL_NUM)::res
res%x_ad_ = u%x_ad_*v%x_ad_
res%xp_ad_= u%xp_ad_*v%x_ad_ + u%x_ad_*v%xp_ad_
END FUNCTION MULT_DD


ELEMENTAL FUNCTION DIV_DD(u,v) RESULT(res)
TYPE (DUAL_NUM), INTENT(IN)::u,v
REAL(8)::tmp
TYPE (DUAL_NUM)::res
tmp=1.D0/v%x_ad_
res%x_ad_ = u%x_ad_*tmp
res%xp_ad_ =(u%xp_ad_- res%x_ad_*v%xp_ad_)*tmp
END FUNCTION DIV_DD
END MODULE DNAD

PROGRAM Test
USE DNAD
USE CPUTime
IMPLICIT NONE
REAL(8):: x_,y_,z_,f_,ftot_
TYPE(DUAL_NUM):: x,y,z,f,ftot
INTEGER:: I

x_=1.0d0;y_=2.0d0;z_=0.3d0
ftot_=0.0d0
CALL TIC

DO i=1,50000000
f_=x_-y_*z_/x_
ftot_ = ftot_ - f_

ENDDO
WRITE(*,*)'Analysis Runs for ', TOC(),' Seconds.'
write(*,*)ftot_

x=DUAL_NUM(1.0d0,0.1D0);y=DUAL_NUM(2.0d0,0.2D0);z=DUAL_NUM(0.3d0,0.3D0)
ftot=DUAL_NUM(0.0d0,0.0D0)
CALL TIC

DO i=1,50000000
f=x-y*z/x
ftot = ftot - f

ENDDO
WRITE(*,*)'DNAD Runs for ', TOC(),' Seconds.'
write(*,*)ftot
END PROGRAM Test

When I put all the code in one single file named test.f90 and compile
it using gfortran -O3 -ffast-math -march=native -fwhole-file
test.f90, I obtain excellent efficience for operator overload:
Analysis runs for 0.141 sec and DNAD runs for 0.062 sec.

However However, when I split it into three separate files (put
program test in main.f90, CPUtime model in CPUtime.f90, and DNAD
module in DNAD.f90), and use the following series of command:
gfortran -c -O3 -ffast-math -march=native -fwhole-file CPUtime.f90
gfortran -c -O3 -ffast-math -march=native -fwhole-file DNAD.f90
gfortran -c -O3 -ffast-math -march=native -fwhole-file main.f90

gfortran -o -O3 -ffast-math -march=native -fwhole-file CPUtime.o
DNAD.o main.o

I lost much of the efficiency, now the Analysis runs for 0.156 sec and
DNAD runs for 1.25 sec.

Any hints on how to optimize the multiple file code is greatly
appreciated.

From: yaqi on
On Jun 18, 9:22 pm, Hifi-Comp <wenbinyu.hea...(a)gmail.com> wrote:
> It is a revisit a problem I posted some time ago. However, the problem
> is not completely resolved.
> I have a code as follows to test the efficiency of OO.
>
> MODULE CPUTime
> IMPLICIT NONE
> PRIVATE
> PUBLIC TIC, TOC
> INTEGER::start, rate, finish
> CONTAINS
>         SUBROUTINE TIC
>                 CALL SYSTEM_CLOCK(start,rate)
>         END SUBROUTINE TIC
>
>         FUNCTION TOC() RESULT(sec)
>                 REAL::sec
>                 CALL SYSTEM_CLOCK(finish)
>                 IF(finish>start) THEN
>                         sec=REAL(finish-start)/REAL(rate)
>                 ELSE
>                         sec=0.0
>                 ENDIF
>         END FUNCTION TOC
> END MODULE CPUTime
>
> MODULE DNAD
> IMPLICIT NONE
> PRIVATE
>
> TYPE,PUBLIC:: DUAL_NUM
>         REAL(8)::x_ad_
>         REAL(8)::xp_ad_
> END TYPE DUAL_NUM
>
> PUBLIC OPERATOR (-)
> INTERFACE OPERATOR (-)
>         MODULE PROCEDURE MINUS_DD
> END INTERFACE
>
> PUBLIC OPERATOR (*)
> INTERFACE OPERATOR (*)
>         MODULE PROCEDURE MULT_DD
> END INTERFACE
>
> PUBLIC OPERATOR (/)
> INTERFACE OPERATOR (/)
>         MODULE PROCEDURE DIV_DD
> END INTERFACE
>
> CONTAINS
>   ELEMENTAL FUNCTION MINUS_DD(u,v) RESULT(res)
>          TYPE (DUAL_NUM), INTENT(IN)::u,v
>          TYPE (DUAL_NUM)::res
>          res%x_ad_ = u%x_ad_-v%x_ad_
>          res%xp_ad_= u%xp_ad_-v%xp_ad_
>   END FUNCTION MINUS_DD
>
>   ELEMENTAL FUNCTION MULT_DD(u,v) RESULT(res)
>          TYPE (DUAL_NUM), INTENT(IN)::u,v
>          TYPE (DUAL_NUM)::res
>          res%x_ad_ = u%x_ad_*v%x_ad_
>             res%xp_ad_= u%xp_ad_*v%x_ad_ + u%x_ad_*v%xp_ad_
>   END FUNCTION MULT_DD
>
>   ELEMENTAL FUNCTION DIV_DD(u,v) RESULT(res)
>          TYPE (DUAL_NUM), INTENT(IN)::u,v
>          REAL(8)::tmp
>          TYPE (DUAL_NUM)::res
>          tmp=1.D0/v%x_ad_
>              res%x_ad_ = u%x_ad_*tmp
>          res%xp_ad_ =(u%xp_ad_- res%x_ad_*v%xp_ad_)*tmp
>   END FUNCTION DIV_DD
> END MODULE  DNAD
>
> PROGRAM Test
> USE DNAD
> USE CPUTime
> IMPLICIT NONE
> REAL(8):: x_,y_,z_,f_,ftot_
> TYPE(DUAL_NUM):: x,y,z,f,ftot
> INTEGER:: I
>
> x_=1.0d0;y_=2.0d0;z_=0.3d0
> ftot_=0.0d0
> CALL TIC
>
> DO i=1,50000000
> f_=x_-y_*z_/x_
> ftot_ = ftot_ - f_
>
> ENDDO
> WRITE(*,*)'Analysis Runs for  ', TOC(),' Seconds.'
> write(*,*)ftot_
>
> x=DUAL_NUM(1.0d0,0.1D0);y=DUAL_NUM(2.0d0,0.2D0);z=DUAL_NUM(0.3d0,0.3D0)
> ftot=DUAL_NUM(0.0d0,0.0D0)
> CALL TIC
>
> DO i=1,50000000
> f=x-y*z/x
> ftot = ftot - f
>
> ENDDO
> WRITE(*,*)'DNAD Runs for  ', TOC(),' Seconds.'
> write(*,*)ftot
> END PROGRAM Test
>
> When I put all the code in one single file named test.f90 and compile
> it using gfortran -O3 -ffast-math -march=native -fwhole-file
> test.f90, I obtain excellent efficience for operator overload:
> Analysis runs for 0.141 sec and DNAD runs for 0.062 sec.
>
> However However, when I split it into three separate files (put
> program test in main.f90, CPUtime model in CPUtime.f90, and DNAD
> module in DNAD.f90), and use the following series of command:
> gfortran -c -O3 -ffast-math -march=native -fwhole-file  CPUtime.f90
> gfortran -c -O3 -ffast-math -march=native -fwhole-file  DNAD.f90
> gfortran -c -O3 -ffast-math -march=native -fwhole-file  main.f90
>
> gfortran -o -O3 -ffast-math -march=native -fwhole-file  CPUtime.o
> DNAD.o main.o
>
> I lost much of the efficiency, now the Analysis runs for 0.156 sec and
> DNAD runs for 1.25 sec.
>
> Any hints on how to optimize the multiple file code is greatly
> appreciated.

Hi Hifi-Comp,

I tested your code with Intel Visual Fortran. It matters when I turn
the Interprocedural optimization to Multi-file (/Qipo). Single file
optimization does not give the good performance. With /Qipo, time is
0.125s, without, 2.45s.

Not quite sure if gfortran can do the similar thing. If not, you may
consider to switch to another compiler.

Anyway we are assured this optimization can be done by compilers.

yaqi
From: glen herrmannsfeldt on
Hifi-Comp <wenbinyu.heaven(a)gmail.com> wrote:
(big snip)

> When I put all the code in one single file named test.f90 and compile
> it using gfortran -O3 -ffast-math -march=native -fwhole-file
> test.f90, I obtain excellent efficience for operator overload:
> Analysis runs for 0.141 sec and DNAD runs for 0.062 sec.

> However However, when I split it into three separate files (put
> program test in main.f90, CPUtime model in CPUtime.f90, and DNAD
(snip)

> I lost much of the efficiency, now the Analysis runs for 0.156 sec and
> DNAD runs for 1.25 sec.

> Any hints on how to optimize the multiple file code is greatly
> appreciated.

There is a story about a guy who goes to the doctor, complaining
that it hurts if I go like this, what should I do?

The doctor says, don't go like that.

Optimizing over the whole program allows it to inline the call.

Without it, there is at least the subroutine call overhead.
There is no way to inline the called routine if the compiler
can't see it at compile time, no matter how many times you ask.

Some calling conventions are more efficient than others, but
none are more efficient than not doing a call.

-- glen
From: steve on
On Jun 18, 8:22 pm, Hifi-Comp <wenbinyu.hea...(a)gmail.com> wrote:
> It is a revisit a problem I posted some time ago. However, the problem
> is not completely resolved.
> I have a code as follows to test the efficiency of OO.

Oddly, you haven't shown a -O0 result.

(Code elided)

> When I put all the code in one single file named test.f90 and compile
> it using gfortran -O3 -ffast-math -march=native -fwhole-file
> test.f90, I obtain excellent efficience for operator overload:
> Analysis runs for 0.141 sec and DNAD runs for 0.062 sec.

> However However, when I split it into three separate files (put
> program test in main.f90, CPUtime model in CPUtime.f90, and DNAD
> module in DNAD.f90), and use the following series of command:
> gfortran -c -O3 -ffast-math -march=native -fwhole-file  CPUtime.f90
> gfortran -c -O3 -ffast-math -march=native -fwhole-file  DNAD.f90
> gfortran -c -O3 -ffast-math -march=native -fwhole-file  main.f90
>
> gfortran -o -O3 -ffast-math -march=native -fwhole-file  CPUtime.o
> DNAD.o main.o
>
> I lost much of the efficiency, now the Analysis runs for 0.156 sec and
> DNAD runs for 1.25 sec.
>
> Any hints on how to optimize the multiple file code is greatly
> appreciated.

Read the documentation?

Hopefully, google-group does mess up the formatting.

% cat run
#! /bin/csh

echo "Case 1"
gfc4x -O3 -march=native -fwhole-program -ffast-math \
-funroll-loops -ftree-vectorize -o z a.f90
../z
echo "Case 2"
gfc4x -O3 -march=native -fwhole-file -ffast-math \
-funroll-loops -ftree-vectorize -c c.f90
gfc4x -O3 -march=native -fwhole-file -ffast-math \
-funroll-loops -ftree-vectorize -c d.f90
gfc4x -O3 -march=native -fwhole-program -ffast-math \
-funroll-loops -ftree-vectorize -o z b.f90 d.o c.o
../z

echo "Case 3"
gfc4x -flto -O3 -march=native -fwhole-file -ffast-math \
-funroll-loops -ftree-vectorize -c c.f90
gfc4x -flto -O3 -march=native -fwhole-file -ffast-math \
-funroll-loops -ftree-vectorize -c d.f90
gfc4x -flto -O3 -march=native -fwhole-program -ffast-math \
-funroll-loops -ftree-vectorize -o z b.f90 d.o c.o
../z

% ./run
Case 1
Analysis Runs for 7.50000030E-02 Seconds.
-19999999.999023605
DNAD Runs for 7.59999976E-02 Seconds.
-19999999.999023605 24999999.999999996
Case 2
Analysis Runs for 7.50000030E-02 Seconds.
-19999999.999023605
DNAD Runs for 1.5360000 Seconds.
-19999999.999023605 24999999.999999996
Case 3
Analysis Runs for 7.59999976E-02 Seconds.
-19999999.999023605
DNAD Runs for 7.50000030E-02 Seconds.
-19999999.999023605 24999999.999999996

--
steve