From: alex-lurk on
I have made a simple fortran test programm to check/test the
performance of OpenMP.
In the following you can see the souce code without OpenMP.
This test program (without OpenMP) needs about 35 seconds.
----------------------------------------
C
PROGRAM OPENMP
C
IMPLICIT NONE

INTEGER TICK, STARTTIME, STOPTIME, TIME

INTEGER N, I,
1I1, I2, I3, I4,
2J1, J2, J3, J4
PARAMETER (N=10000000)
REAL A1(N), A2(N), A3(N)
REAL B1(N), B2(N), B3(N)
REAL C1(N), C2(N), C3(N)
REAL D1(N), D2(N), D3(N)
REAL parallel_time_begin, parallel_time_end
REAL section1_time_begin, section1_time_end
REAL section2_time_begin, section2_time_end
REAL section3_time_begin, section3_time_end
REAL section4_time_begin, section4_time_end
real sum

PRINT *, '----- Serial Start -----'

CALL SYSTEM_CLOCK(COUNT_RATE = TICK)
CALL SYSTEM_CLOCK (COUNT = STARTTIME)

! Some initializations

DO I = 1, N
A1(I) = I + 1.5
A2(I) = I + 22.35
B1(I) = I + 1.5
B2(I) = I + 22.35
C1(I) = I + 1.5
C2(I) = I + 22.35
D1(I) = I + 1.5
D2(I) = I + 22.35
ENDDO

PRINT *, '***** Serial Start *****'

PRINT *, '***** 1. Section Start'
DO J1 = 1, 400
DO I1 = 1, N
A3(I1) = A1(I1) + A2(I1)
ENDDO
ENDDO
PRINT *, '***** 1. Section End'

PRINT *, '***** 2. Section Start'
DO J2 = 1, 400
DO I2 = 1, N
B3(I2) = B1(I2) + B2(I2)
ENDDO
ENDDO
PRINT *, '***** 2. Section End'

PRINT *, '***** 3. Section Start'
DO J3 = 1, 400
DO I3 = 1, N
C3(I3) = C1(I3) + C2(I3)
ENDDO
ENDDO
PRINT *, '***** 3. Section End'

PRINT *, '***** 4. Section Start'
DO J4 = 1, 400
DO I4 = 1, N
D3(I4) = D1(I4) + D2(I4)
ENDDO
ENDDO
PRINT *, '***** 4. Section End'

sum = 0
do i4 = 1,N
sum = sum + a3(i4) + b3(i4) + c3(i4) + d3(i4)
enddo
print*,'Sum = ',sum

PRINT *, '***** Serial End *****'

CALL SYSTEM_CLOCK (COUNT = STOPTIME)
TIME = REAL(STOPTIME-STARTTIME) / REAL(TICK)
PRINT *, '>>>>> time of Serial was ',
1TIME, ' seconds <<<<<'
PRINT *, '----- Serial End -----'

END
----------------------------------------
----------------------------------------


Now I have parallelized this test programm with OpenMP on 3 ways.

1.) With the directive SECTIONS
I have divided the program in 4 sections.
The number of threads is 4.
This test program needs about 34 seconds.
In the following you can find the used code:
----------------------------------------
C
PROGRAM OPENMP
C
IMPLICIT NONE

REAL omp_get_wtime

INTEGER TICK, STARTTIME, STOPTIME, TIME

INTEGER N, I,
1I1, I2, I3, I4,
2J1, J2, J3, J4
PARAMETER (N=10000000)
REAL A1(N), A2(N), A3(N)
REAL B1(N), B2(N), B3(N)
REAL C1(N), C2(N), C3(N)
REAL D1(N), D2(N), D3(N)
REAL parallel_time_begin, parallel_time_end
REAL section1_time_begin, section1_time_end
REAL section2_time_begin, section2_time_end
REAL section3_time_begin, section3_time_end
REAL section4_time_begin, section4_time_end
INTEGER NTHREADS
real sum

PRINT *, '----- Parallel start -----'
CALL SYSTEM_CLOCK(COUNT_RATE = TICK)
CALL SYSTEM_CLOCK (COUNT = STARTTIME)

! Some initializations

DO I = 1, N
A1(I) = I + 1.5
A2(I) = I + 22.35
B1(I) = I + 1.5
B2(I) = I + 22.35
C1(I) = I + 1.5
C2(I) = I + 22.35
D1(I) = I + 1.5
D2(I) = I + 22.35
ENDDO
NTHREADS = 4

PRINT *, '***** Parallel Start *****'
parallel_time_begin = omp_get_wtime()

CALL omp_set_num_threads(NTHREADS)

C$OMP SECTIONS

C$OMP SECTION
PRINT *, '***** 1. Section Start'
section1_time_begin = omp_get_wtime()
DO J1 = 1, 400
DO I1 = 1, N
A3(I1) = A1(I1) + A2(I1)
ENDDO
ENDDO
section1_time_end = omp_get_wtime()
PRINT *, '====> Time of 1. Section was ',
1section1_time_end - section1_time_begin, ' seconds <===='
PRINT *, '***** 1. Section End'

C$OMP SECTION
PRINT *, '***** 2. Section Start'
section2_time_begin = omp_get_wtime()
DO J2 = 1, 400
DO I2 = 1, N
B3(I2) = B1(I2) + B2(I2)
ENDDO
ENDDO
section2_time_end = omp_get_wtime()
PRINT *, '====> Time of 2. Section was ',
1section2_time_end - section2_time_begin, ' seconds <===='
PRINT *, '***** 2. Section End'

C$OMP SECTION
PRINT *, '***** 3. Section Start'
section3_time_begin = omp_get_wtime()
DO J3 = 1, 400
DO I3 = 1, N
C3(I3) = C1(I3) + C2(I3)
ENDDO
ENDDO
section3_time_end = omp_get_wtime()
PRINT *, '====> Time of 3. Section was ',
1section3_time_end - section3_time_begin, ' seconds <===='
PRINT *, '***** 3. Section End'

C$OMP SECTION
PRINT *, '***** 4. Section Start'
section4_time_begin = omp_get_wtime()
DO J4 = 1, 400
DO I4 = 1, N
D3(I4) = D1(I4) + D2(I4)
ENDDO
ENDDO
section4_time_end = omp_get_wtime()
PRINT *, '====> Time of 4. Section was ',
1section4_time_end - section4_time_begin, ' seconds <===='
PRINT *, '***** 4. Section End'

C$OMP END SECTIONS NOWAIT

sum = 0
do i4 = 1,n
sum = sum + A3(i4) + B3(i4) + C3(i4) + D3(i4)
enddo
print*,'Sum = ',sum

parallel_time_end = omp_get_wtime()
PRINT *, '====> Time of Parallel was ',
1parallel_time_end - parallel_time_begin, ' seconds <===='

PRINT *, '***** Parallel end *****'

CALL SYSTEM_CLOCK (COUNT = STOPTIME)
TIME = REAL(STOPTIME-STARTTIME) / REAL(TICK)
PRINT *, '>>>>> time of Parallel was ',
1TIME, ' seconds <<<<<'
PRINT *, '----- Parallel End -----'

END
----------------------------------------
----------------------------------------


2.) With the directive PARALLEL SECTIONS
I have divided the program in 4 sections too but I also have used the
directive PARALLEL.
The number of threads is 4.
This test program needs about 12 seconds.
In the following you can find the used code:
----------------------------------------
C
PROGRAM OPENMP
C
IMPLICIT NONE

REAL omp_get_wtime

INTEGER TICK, STARTTIME, STOPTIME, TIME

INTEGER N, I,
1I1, I2, I3, I4,
2J1, J2, J3, J4
PARAMETER (N=10000000)
REAL A1(N), A2(N), A3(N)
REAL B1(N), B2(N), B3(N)
REAL C1(N), C2(N), C3(N)
REAL D1(N), D2(N), D3(N)
REAL parallel_time_begin, parallel_time_end
REAL section1_time_begin, section1_time_end
REAL section2_time_begin, section2_time_end
REAL section3_time_begin, section3_time_end
REAL section4_time_begin, section4_time_end
INTEGER NTHREADS
real sum

PRINT *, '----- Parallel start -----'
CALL SYSTEM_CLOCK(COUNT_RATE = TICK)
CALL SYSTEM_CLOCK (COUNT = STARTTIME)

! Some initializations

DO I = 1, N
A1(I) = I + 1.5
A2(I) = I + 22.35
B1(I) = I + 1.5
B2(I) = I + 22.35
C1(I) = I + 1.5
C2(I) = I + 22.35
D1(I) = I + 1.5
D2(I) = I + 22.35
ENDDO
NTHREADS = 4

PRINT *, '***** Parallel Start *****'
parallel_time_begin = omp_get_wtime()

CALL omp_set_num_threads(NTHREADS)


C$OMP PARALLEL SECTIONS

C$OMP SECTION
PRINT *, '***** 1. Section Start'
section1_time_begin = omp_get_wtime()
DO J1 = 1, 400
DO I1 = 1, N
A3(I1) = A1(I1) + A2(I1)
ENDDO
ENDDO
section1_time_end = omp_get_wtime()
PRINT *, '====> Time of 1. Section was ',
1section1_time_end - section1_time_begin, ' seconds <===='
PRINT *, '***** 1. Section End'

C$OMP SECTION
PRINT *, '***** 2. Section Start'
section2_time_begin = omp_get_wtime()
DO J2 = 1, 400
DO I2 = 1, N
B3(I2) = B1(I2) + B2(I2)
ENDDO
ENDDO
section2_time_end = omp_get_wtime()
PRINT *, '====> Time of 2. Section was ',
1section2_time_end - section2_time_begin, ' seconds <===='
PRINT *, '***** 2. Section End'

C$OMP SECTION
PRINT *, '***** 3. Section Start'
section3_time_begin = omp_get_wtime()
DO J3 = 1, 400
DO I3 = 1, N
C3(I3) = C1(I3) + C2(I3)
ENDDO
ENDDO
section3_time_end = omp_get_wtime()
PRINT *, '====> Time of 3. Section was ',
1section3_time_end - section3_time_begin, ' seconds <===='
PRINT *, '***** 3. Section End'

C$OMP SECTION
PRINT *, '***** 4. Section Start'
section4_time_begin = omp_get_wtime()
DO J4 = 1, 400
DO I4 = 1, N
D3(I4) = D1(I4) + D2(I4)
ENDDO
ENDDO
section4_time_end = omp_get_wtime()
PRINT *, '====> Time of 4. Section was ',
1section4_time_end - section4_time_begin, ' seconds <===='
PRINT *, '***** 4. Section End'

C$OMP END PARALLEL SECTIONS

sum = 0
do i4 = 1,n
sum = sum + A3(i4) + B3(i4) + C3(i4) + D3(i4)
enddo
print*,'Sum = ',sum

parallel_time_end = omp_get_wtime()
PRINT *, '====> Time of Parallel was ',
1parallel_time_end - parallel_time_begin, ' seconds <===='

PRINT *, '***** Parallel end *****'

CALL SYSTEM_CLOCK (COUNT = STOPTIME)
TIME = REAL(STOPTIME-STARTTIME) / REAL(TICK)
PRINT *, '>>>>> time of Parallel was ',
1TIME, ' seconds <<<<<'
PRINT *, '----- Parallel End -----'

END
----------------------------------------
----------------------------------------

3.) With the directive PARALLEL
Here I have parallized the 4 double DO-loops.
For this test program I have worked with several threads:
- For 1 threads the test program needs about 19 seconds.
- For 2 and 3 threads it needs about 22 seconds.
- And for 4 threads it needs about 23 seconds.
In the following you can find the used code:
----------------------------------------
C
PROGRAM OPENMP
C
IMPLICIT NONE

REAL omp_get_wtime

INTEGER TICK, STARTTIME, STOPTIME, TIME

INTEGER N, I,
1I1, I2, I3, I4,
2J1, J2, J3, J4
PARAMETER (N=10000000)
REAL A1(N), A2(N), A3(N)
REAL B1(N), B2(N), B3(N)
REAL C1(N), C2(N), C3(N)
REAL D1(N), D2(N), D3(N)
REAL parallel_time_begin, parallel_time_end
REAL section1_time_begin, section1_time_end
REAL section2_time_begin, section2_time_end
REAL section3_time_begin, section3_time_end
REAL section4_time_begin, section4_time_end
INTEGER NTHREADS
real sum

PRINT *, '----- Parallel start -----'
CALL SYSTEM_CLOCK(COUNT_RATE = TICK)
CALL SYSTEM_CLOCK (COUNT = STARTTIME)

! Some initializations

DO I = 1, N
A1(I) = I + 1.5
A2(I) = I + 22.35
B1(I) = I + 1.5
B2(I) = I + 22.35
C1(I) = I + 1.5
C2(I) = I + 22.35
D1(I) = I + 1.5
D2(I) = I + 22.35
ENDDO
NTHREADS = 1
C NTHREADS = 2
C NTHREADS = 3
C NTHREADS = 4

PRINT *, '***** Parallel Start *****'
parallel_time_begin = omp_get_wtime()

C CALL omp_set_num_threads(NTHREADS)


PRINT *, '***** 1. Section Start'
section1_time_begin = omp_get_wtime()
C$OMP PARALLEL
DO J1 = 1, 400
DO I1 = 1, N
A3(I1) = A1(I1) + A2(I1)
ENDDO
ENDDO
C$OMP END PARALLEL
section1_time_end = omp_get_wtime()
PRINT *, '====> Time of 1. Section was ',
1section1_time_end - section1_time_begin, ' seconds <===='
PRINT *, '***** 1. Section End'

PRINT *, '***** 2. Section Start'
section2_time_begin = omp_get_wtime()
C$OMP PARALLEL
DO J2 = 1, 400
DO I2 = 1, N
B3(I2) = B1(I2) + B2(I2)
ENDDO
ENDDO
C$OMP END PARALLEL
section2_time_end = omp_get_wtime()
PRINT *, '====> Time of 2. Section was ',
1section2_time_end - section2_time_begin, ' seconds <===='
PRINT *, '***** 2. Section End'

PRINT *, '***** 3. Section Start'
section3_time_begin = omp_get_wtime()
C$OMP PARALLEL
DO J3 = 1, 400
DO I3 = 1, N
C3(I3) = C1(I3) + C2(I3)
ENDDO
ENDDO
C$OMP END PARALLEL
section3_time_end = omp_get_wtime()
PRINT *, '====> Time of 3. Section was ',
1section3_time_end - section3_time_begin, ' seconds <===='
PRINT *, '***** 3. Section End'

PRINT *, '***** 4. Section Start'
section4_time_begin = omp_get_wtime()
C$OMP PARALLEL
DO J4 = 1, 400
DO I4 = 1, N
D3(I4) = D1(I4) + D2(I4)
ENDDO
ENDDO
C$OMP END PARALLEL

section4_time_end = omp_get_wtime()
PRINT *, '====> Time of 4. Section was ',
1section4_time_end - section4_time_begin, ' seconds <===='
PRINT *, '***** 4. Section End'

sum = 0
do i4 = 1,n
sum = sum + A3(i4) + B3(i4) + C3(i4) + D3(i4)
enddo
print*,'Sum = ',sum

parallel_time_end = omp_get_wtime()
PRINT *, '====> Time of Parallel was ',
1parallel_time_end - parallel_time_begin, ' seconds <===='

PRINT *, '***** Parallel end *****'

CALL SYSTEM_CLOCK (COUNT = STOPTIME)
TIME = REAL(STOPTIME-STARTTIME) / REAL(TICK)
PRINT *, '>>>>> time of Parallel was ',
1TIME, ' seconds <<<<<'
PRINT *, '----- Parallel End -----'

END
----------------------------------------
----------------------------------------


In the following you can find some basic informations:
- Fortran Compiler: pgf95 9.0-4 64-bit target on x86-64 Linux -tp
nehalem-64
- OS: Suse Linux
- 4 CPUs


Now my questions:

a.) For test program 1.) (see above)
Here it is interesting that the time for parallizing the program is
nearly the same like without using OpenMP.
Has someone an idea why?
I thought/hoped the parallel version is much quicker.

b.) For test program 2.) (see above)
Is it correct that for the directive "PARALLEL SECTIONS" on the one
hand the 4 sections will be parallized, that means every section will
run alone on one CPU (as one thread) and on the other hand the DO-
loops within the 4 sections will be parallized too?

c.) For test program 3.) (see above)
Here I think the time by using several threads (2, 3 and 4) is slower
than using only one thread because the overhead of OpenMP to parallize
the DO-loops is too big. Is this correct?

Thanks a lot for your help,
Alex
From: alex-lurk on
Hi Tim,

thanks a lot for you hint, but I don't understand why I can't learn
much from my example.
Could you explain your hint in more detail?

I forgot to say that I compiled all 4 examples (without and with
OpenMP) with the compiler optimization "-O3" like in the following:
Without OpenMP: CFLAGS=-c -O3
With OpenMP: CFLAGS=-c -O3 -mp

Thanks a lot,
Alex

On 11 Dez., 23:05, Tim Prince <TimothyPri...(a)sbcglobal.net> wrote:
> Depending on your compiler, it may be capable of optimizing away the
> loops in the non-OpenMP case.  You can't learn much from this example.

From: Mark Morss on
On Dec 12, 12:44 pm, alex-lurk <alex.l...(a)googlemail.com> wrote:
> Hi Tim,
>
> thanks a lot for you hint, but I don't understand why I can't learn
> much from my example.
> Could you explain your hint in more detail?
>
> I forgot to say that I compiled all 4 examples (without and with
> OpenMP) with the compiler optimization "-O3" like in the following:
> Without OpenMP: CFLAGS=-c -O3
> With OpenMP: CFLAGS=-c -O3 -mp
>
> Thanks a lot,
> Alex
>
> On 11 Dez., 23:05, Tim Prince <TimothyPri...(a)sbcglobal.net> wrote:
>
> > Depending on your compiler, it may be capable of optimizing away the
> > loops in the non-OpenMP case.  You can't learn much from this example..
>
>

I've been doing a lot of parallel processing on an AIX 5.3 server with
20 ppc processors and the xlf compiler, using the openmp directives.
An example that works is of course useful for learning how to use
openmp. The reason I would have said that you'll learn little, beyond
that, from any simple example is that there is always a tradeoff
between the overhead necessary to manage multiple threads and the
direct gain from using them. Whether this works out in your favor
depends on the specifics of your case, and may vary even for a given
application as your input data varies. To find out whether it's worth
parallelizing code there really is no substitute for just doing it and
comparing the difference between what you get and what happens with
highly optimized by not parallelized code. You have to pay attention
to the structure of your problem and be alert for the possibility that
with some data your application may run slower because you've
parallelized it.

If I may digress into the realm of general advice, it's quite
important to specify as private all the variables that you actually
want to be private within given threads. Failure to do this will
produce totally fouled up results. Further, with xlf, my experience
has been that if you have an allocatable array which is allocated
before a parallel code block and then declared private, >>this array
will nevertheless be treated as shared<<. I had to discover this by
experience, though perhaps the xlf manual has something about it. You
have to allocate the private object in each thread, making sure of
course to deallocate it also in each thread.

Also unless you're working on some sort of mega-computer, I don't
think you'll miss the absence of nested openmp functionality very
much. In general there is scant gain from having more active threads
than the number of processors on your machine.
From: alex-lurk on
On 13 Dez., 23:02, Tim Prince <tpri...(a)computer.org> wrote:
> DO loops will not be parallelized without the OMP DO directive.

Hi Tim,

thanks for your hint.
I thought using the PARALLEL directive alone is enough.
Now I have added the DO directive and it works.
In the following you can find the source code (as example only the
first section):
----------------------------------------
....
....
....
PRINT *, '***** 1. Section Start'
!$OMP PARALLEL
!$OMP DO
DO K1 = 1, O1
DO J1 = 1, N1
DO I1 = 1, M1
IF ((A1(I1,J1,K1).GT.0.0).AND.(A2(I1,J1,K1).GT.0.0)) THEN
A3(I1,J1,K1) = (SQRT((A1(I1,J1,K1)/A2(I1,J1,K1))))
1 * (SQRT((A2(I1,J1,K1)/A1(I1,J1,K1))))
ELSE
A3(I1,J1,K1) = (SQRT((A1(I1,J1,K1)*A2(I1,J1,K1))))
1 * (SQRT((A1(I1,J1,K1)*A2(I1,J1,K1))))
ENDIF
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
PRINT *, '***** 1. Section End'
....
....
....
----------------------------------------
From: alex-lurk on
On 14 Dez., 19:36, Mark Morss <mfmo...(a)aep.com> wrote:
> An example that works is of course useful for learning how to use
> openmp.  The reason I would have said that you'll learn little, beyond
> that, from any simple example is that there is always a tradeoff
> between the overhead necessary to manage multiple threads and the
> direct gain from using them.  

Dear Mark,

thanks a lot for your hints.
Yes, I start with an easy example to make my first experiences with
OpenMP.
The fortran program which I have to parallelize is very complicated.
It is a very old fortran modul.
The next days I will start to parallelize it with the help of OpenMP.
I will keep you informed.

Many greetings
Alex