OpenMP & MPI [Fortran]

Prev: From Scripting to Scaling: Multi-core is challenging even the most battle-scared programmer
Next: Coarray Fortran

From: Tobias Burnus on 14 Jul 2010 17:54

Am 14.07.2010 22:23, schrieb nmm1(a)cam.ac.uk:
> In article <1jlmjr4.1h9muj0pihqb1N%see(a)sig.for.address>,
> Victor Eijkhout <see(a)sig.for.address> wrote:
>>
>>>>> OpenMP uses a
>>>>> "shared memory" model, which is harder to implement on a cluster
>>>>> architecture. But it has been done too.
>>>>
>>>> What are you thinking of?
>>>
>>> Intel Cluster Tools.
>>
>> Hm. I watched a video on the Intel site, and there is no hint of
>> distributed shared memory. (MPI, Tracers, math libraries, but nothing
>> deeper.)
>>
>> http://software.intel.com/en-us/intel-cluster-toolkit/
>>
>> Can you give me a more specific pointer?

See:
http://software.intel.com/en-us/articles/cluster-openmp-for-intel-compilers/

"Cluster OpenMP is now included with version 11 of the Intel compilers."
- However, one seems to need a special licence.

Tobias

From: gmail-unlp on 16 Jul 2010 17:21

Just a few thoughts:
1) Parallel programming forgetting (parallel) performance issues is a
problem. And OpenMP helps, in some way, to forget important
performance details such as pipelining, memory hierarchy, cache
coherence, etc. However, if you remember you are parallelizing to
improve performance I think you will not forget performance penalties
and implicitly or explicitly optimize data traffic, for example.
2) If you have a legacy application with more than a few thousands
lines it's possible you will start with OpenMP because of the large
work required to re-code the same application with MPI. However, it's
likely you will have to learn MPI or at least have in mind the
distributed memory parallel architectures if you need to process more
data, or more accurately, or ...
3) Extending the shared memory in a distributed memory architecture
would help in maintaining the shared memory model (i.e. maintaining
OpenMP), but I think that the risk on hiding strong performance
penalties is too high... I don't know very much "things" like that
from Intel, there is some help (tool/methodology) for analyzing and
solving performance issues?
4) I'm rather convinced that MPI is the best at long term focusing
performance, but I'm working with a legacy application of about 100k
lines of (sequential) code and I understand those suggesting OpenMP,
and I'm using OpenMP and looking for ways to distribute data for
distributing computing in a distributed memory architecture. If I have
to program from scratch, I would use MPI from the beginning.
5) I suggest learning both: OpenMP and MPI, and this is not only to
have all options to make clear (the best) choices, both are simple
enough to learn, at least to know the focus/ideas/etc. of each one. I
suggest looking for tutorials (maybe two or three of each one) and
follow them carefully. Again: you will not learn all of the details,
but the interesting one to make your own choices. Neither OpenMP nor
MPI are a big deal for scientific programmers.
6) An alternative way: just identify BLAS and/or LAPACK subroutines/
functions and use shared and/or distributed memory libraries to call
for computing. Both libraries are implemented for shared as well as
distributed memory parallel computing.

Hope this helps,

Fernando.

From: gmail-unlp on 16 Jul 2010 21:08

Just a few thoughts:
1) Making oarallel programs while forgetting (parallel) performance
issues is a problem. And OpenMP helps, in some way, to forget
important performance details such as pipelining, memory hierarchy,
cache coherence, etc. However, if you remember you are parallelizing
to improve performance I think you will not forget performance
penalties and implicitly or explicitly optimize data traffic, for
example.
2) If you have a legacy application with more than a few thousands
lines it's possible you will start with OpenMP because of the large
work required to re-code the same application with MPI. However, it's
likely you will have to learn MPI or at least have in mind the
distributed memory parallel architectures if you need to process more
data, or more accurately, or more...
3) Extending the shared memory in a distributed memory architecture
would help in maintaining the shared memory model (i.e. maintaining
OpenMP), but I think that the risk of hiding strong performance
penalties is too high... I don't know very much "things" like that
from Intel, is there some help (tool/methodology) for analyzing and
solving performance issues?
4) I'm rather convinced that MPI is "the best" at long term when
focusing performance, but I'm working with a legacy application of
about 100k lines of (sequential) code and I understand those
suggesting OpenMP, and I'm using OpenMP and looking for ways to
distribute data for distributing computing in a distributed memory
architecture. If I have to program from scratch, I would use MPI
from the beginning.
5) I suggest learning both: OpenMP and MPI, and this is not only to
have all options to make clear (the best) choices, both are simple
enough to learn, at least to know the focus/ideas/etc. of each one. I
suggest looking for tutorials (maybe two or three of each one) and
follow them carefully. Again: you will not learn all of the details,
but the interesting ones to make your own choices. Neither OpenMP nor
MPI are a big deal for scientific programmers.
6) An alternative way: just identify BLAS and/or LAPACK subroutines/
functions and use shared and/or distributed memory libraries to call
for computing. Both libraries are implemented for shared as well as
distributed memory parallel computing.

Hope this helps,

Fernando.

From: sturlamolden on 17 Jul 2010 14:01

On 17 Jul, 03:08, gmail-unlp <ftine...(a)gmail.com> wrote:

> 1) Making oarallel programs while forgetting (parallel) performance
> issues is a problem. And OpenMP helps, in some way, to forget
> important performance details such as pipelining, memory hierarchy,
> cache coherence, etc. However, if you remember you are parallelizing
> to improve performance I think you will not forget performance
> penalties and implicitly or explicitly optimize data traffic, for
> example.

We should not forget that OpenMP is often used on "multi-core
processors". These are rather primitive parallel devices, they e.g.
have shared cache. Data traffic due to OpenMP can therefore be
minimal, because a dirty cache line need not be communicated. So if
the target is common desktop computers with quadcore Intel or AMD
CPUs, OpenMP can be perfectly fine. And this is the common desktop
computer these days. So for small scale parallelization on modern
desktop computers, OpenMP can be very good. But on large servers with
multiple processors, OpenMP can generate excessive data traffic and
scale very badly.

> 6) An alternative way: just identify BLAS and/or LAPACK subroutines/
> functions and use shared and/or distributed memory libraries to call
> for computing.

This is very important. GotoBLAS and Intel MKL have BLAS and LAPACK
optimized for SMP servers. FFTW and MLK have parallel FFTs. But look
at the majority of today's 'system developers': They hardly know any
math, neither linear algebra nor calculus. They would not recognize a
linear system of equations or a convolution if they saw it. So why
would they use LAPACK or an FFT? A website for Norwegian IT
specialists (digi.no), once had a quiz that claimed LAPACK is a
program for "testing the speed of computers". They are on a different
planet.

The sadness of this is that if we scientists want programs that run
fast, we have to write them ourselves. Those educated to do so are too
dumb, even with a computer, nor do they understand the problem they
are requested to solve. But scientists who write computer programs are
not educated to do so, nor is this the major focus of our jobs.

P.S. It is a common misconception, particularly among computer science
scholars, that "shared memory" means no data traffic, and that threads
are better then processes for that matter. I.e. they can see that IPC
has a cost, and thus conclude that threads must be more efficient and
scale better. The lack of a native fork() on Windows has also thought
many of them to think in terms of threads rather than processes. The
use of MPI seem to be limited to scientists and engineers, the
majority of computer scientists don't even know what it is.
Concurrency to them means threads, and particularly C++ classes that
wrap threads. Most of the expect i/o bound programs that use threads
to be faster on multi-core computers, and they wonder why parallel
programming is so hard.

From: sturlamolden on 17 Jul 2010 14:40

On 17 Jul, 20:01, sturlamolden <sturlamol...(a)yahoo.no> wrote:

> Concurrency to them means threads, and particularly C++ classes that
> wrap threads. Most of the expect i/o bound programs that use threads
> to be faster on multi-core computers, and they wonder why parallel
> programming is so hard.

This is e.g. a common complaint about Python's GIL (global interpreter
lock) on comp.lang.python:

- As Python's interpreter has as global lock, Python programs
cannot exploit multicore computers.

The common answer to this is:

- You don't get a faster network connection by using multiple
processors.

This is too hard for most IT developers to understand. But if they do
understand this, we can ask them this instead:

- Why do you accept the 100x speed penalty from using Python, but
complain about not being allowed to use more than one core?

If they have a reasonable answer to this as well, such as hating C++
immensely, we can tell them the real story:

- Any mutex (like Python's GIL) can be released. Python threads not
using the interpreter can run simultaneously (e.g. they might be
waiting for i/o or a library call to return). Libraries can use as
many threads as they want internally. And processes can of course be
spawned and forked.

It is really sad to see how badly educated many so-called "IT
specialists" actually are. If we ask them to solve a problem, chances
are they will spend all the time writing yet another web XML framework
in C#, without even touching the real problem.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: From Scripting to Scaling: Multi-core is challenging even the most battle-scared programmer
Next: Coarray Fortran