From: GaryScott on
Comments Please:

"A group at Rice University is pursuing an alternate vision of coarray
extensions for the Fortran language. Their perspective is that the
Fortran 2008 standards committee's design choices were more shaped
more by the desire to introduce as few modifications to the language
as possible than to assemble the best set of extensions to support
parallel programming. They don't believe that the set of extensions
agreed upon by the committee are the right ones. In their view, both
Numrich and Reid's original design and the coarray extensions proposed
for Fortran 2008, suffer from the following shortcomings:

There is no support for processor subsets; for instance, coarrays must
be allocated over all images.
Coarrays must be declared as global variables; one cannot dynamically
allocate a coarray into a locally scoped variable.
The coarray extensions lack any notion of global pointers, which are
essential for creating and manipulating any kind of linked data
structure.
Reliance on named critical sections for mutual exclusion hinders
scalable parallelism by associating mutual exclusion with code regions
rather than data objects.
Fortran 2008's sync images statement doesn't provide a safe
synchronization space. As a result, synchronization operations in
user's code that are pending when a library call is made can interfere
with synchronization in the library call.
There are no mechanisms to avoid or tolerate latency when manipulating
data on remote images.
There is no support for collective communication.
To address these shortcomings, Rice University is developing a clean-
slate redesign of the Coarray Fortran programming model. Rice's new
design for Coarray Fortran, which they call Coarray Fortran 2.0, is an
expressive set of coarray-based extensions to Fortran designed to
provide a productive parallel programming model. Compared to the
emerging Fortran 2008, Rice's new coarray-based language extensions
include some additional features:

process subsets known as teams, which support coarrays, collective
communication, and relative indexing of process images for pair-wise
operations,
topologies, which augment teams with a logical communication
structure,
dynamic allocation/deallocation of coarrays and other shared data,
local variables within subroutines: declaration and allocation of
coarrays within the scope of a procedure is critical for library based-
code,
team-based coarray allocation and deallocation,
global pointers in support of dynamic data structures, and
enhanced support for synchronization for fine control over program
execution,
safe and scalable support for mutual exclusion, including locks and
lock sets; and
events, which provide a safe space for point-to-point
synchronization."
From: nmm1 on
In article <d7fe8a45-0cc4-4231-8af0-6ffb690a2ac0(a)x27g2000yqb.googlegroups.com>,
Ian Bush <ianbush.throwaway.account(a)googlemail.com> wrote:
>>
>> > for instance, coarrays must
>> > be allocated over all images.
>>
>> This is intentional, and motivated by performance. =A0Team-based coarrays
>> are either very hard to implement (requiring significant system
>> changes), or will perform no better than allocatable components of
>> coarrays which we already support. =A0Perhaps performance is not a primar=
>y
>> requirement for the Rice proposal, but it was for J3. =A0The feedback fro=
>m
>> users is that coarray performance has to be competitive with MPI. =A0If
>> not, many people will not use it in their codes. =A0A lot of effort was
>> put into the Fortran 2008 spec to avoid features that forced reduced
>> performance.
>>
>I'm sorry, I don't get this. Why are team based co-arrays so
>difficult? Their MPI equivalent, use of multiple MPI communicators,
>are the basis of many large scale applications today, so why the
>problem?

Er, no. Far fewer than you might think. Of the dozen MPI applications
I looked at, only two used them (and one was an MPI tester of mine).
On this matter, I should be very interested to know the MPI calls
that CRYSTAL (and DL_POLY_3 and GAMESS-UK) makes, so that I can
update my table and potentially modify my course. I have a script
for source scanning.

There are two problems that I know of:

1) Specification. That's soluble, but making the standardese
watertight is not easy. MPI put a lot of effort into that, and had
a much simpler task than Fortran does. As others have said, it's
being tackled.

2) Implementation. It's NOT easy to implement such things either
reliably or efficiently - gang scheduling for all processes is one
thing, and multiple, potentially interacting gangs is another. In
MPI, it is quite common for the insertion of barriers on COMM_WORLD
to improve performance considerably.


Regards,
Nick Maclaren.
From: nmm1 on
In article <1653902b-410c-4e6a-ada1-c950c87599c9(a)d16g2000yqb.googlegroups.com>,
GaryScott <garylscott(a)sbcglobal.net> wrote:
>Comments Please:

They should learn from the experiences of the past. Hoare found out
just how hard it was to teach and use general parallelism, so he
developed BSP. All right, he did throw the baby out with the
bathwater, and BSP has never taken off, but all experience is that
it is easy to teach, use and implement - and very efficient if your
problem fits its model.

What they are proposing is FAR too complicated for use by almost
all programmers. The vast majority of MPI users use only a small
subset of MPI, which they can get their head around, and OpenMP in
all its glory has probably never been implemented reliably enough
to use - almost everyone uses a small subset.

I could go on, and further, but shall remain polite. Perhaps I
should point out that all of WG5 are people with a lot of experience
in using, implementing and supporting Fortran for scientific and
other purposes. Whether the Rice team has is less clear, given
that it seems to be composed of computer scientists.


Regards,
Nick Maclaren.
From: Ian Bush on

Hi Nick,

On 12 July, 12:37, n...(a)cam.ac.uk wrote:
> In article <d7fe8a45-0cc4-4231-8af0-6ffb690a2...(a)x27g2000yqb.googlegroups..com>,
> Ian Bush  <ianbush.throwaway.acco...(a)googlemail.com> wrote:
>
>
>
>
>
> >> > for instance, coarrays must
> >> > be allocated over all images.
>
> >> This is intentional, and motivated by performance. =A0Team-based coarrays
> >> are either very hard to implement (requiring significant system
> >> changes), or will perform no better than allocatable components of
> >> coarrays which we already support. =A0Perhaps performance is not a primar=
> >y
> >> requirement for the Rice proposal, but it was for J3. =A0The feedback fro=
> >m
> >> users is that coarray performance has to be competitive with MPI. =A0If
> >> not, many people will not use it in their codes. =A0A lot of effort was
> >> put into the Fortran 2008 spec to avoid features that forced reduced
> >> performance.
>
> >I'm sorry, I don't get this. Why are team based co-arrays so
> >difficult? Their MPI equivalent, use of multiple MPI communicators,
> >are the basis of many large scale applications today, so why the
> >problem?
>
> Er, no.  Far fewer than you might think.  Of the dozen MPI applications
> I looked at, only two used them (and one was an MPI tester of mine).
> On this matter, I should be very interested to know the MPI calls
> that CRYSTAL (and DL_POLY_3 and GAMESS-UK) makes, so that I can
> update my table and potentially modify my course.  I have a script
> for source scanning.
>

Here's the list for DL_POLY_3, CRYSTAL and CASTEP. I don't work on
GAMESS-UK
anymore and don't have the code easily to hand, but I can dig it out
if you are
interested, the list will be similar to CRYSTAL. VASP is another
example. I don't
have the code here, but I would be surprised if it is markedly
different from CASTEP.

DL_POLY_3:

MPI_ABORT
MPI_ALLGATHER
MPI_ALLREDUCE
MPI_ALLTOALL
MPI_ALLTOALLV
MPI_BARRIER
MPI_BCAST
MPI_COMM_DUP
MPI_COMM_FREE
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_COMM_SPLIT
MPI_FILE_CLOSE
MPI_FILE_GET_VIEW
MPI_FILE_OPEN
MPI_FILE_READ_AT
MPI_FILE_SET_VIEW
MPI_FILE_WRITE_AT
MPI_FINALIZE
MPI_GATHERV
MPI_GET_COUNT
MPI_INIT
MPI_IRECV
MPI_ISEND
MPI_RECV
MPI_SCATTER
MPI_SCATTERV
MPI_SEND
MPI_TYPE_COMMIT
MPI_TYPE_CONTIGUOUS
mpi_type_create_f90_real
MPI_TYPE_FREE
MPI_WAIT

CRYSTAL:

MPI_Abort
MPI_ALLREDUCE
mpi_alltoall
mpi_barrier
MPI_BCAST
mpi_comm_free
MPI_COMM_RANK
MPI_COMM_SIZE
MPI_COMM_SPLIT
MPI_Finalize
mpi_gatherv
MPI_INIT
mpi_irecv
MPI_RECV
MPI_REDUCE
MPI_SEND
mpi_wait

CASTEP

MPI_abort
MPI_allgather
MPI_allreduce
MPI_AllToAll
MPI_AllToAllV
MPI_barrier
MPI_bcast
MPI_comm_free
MPI_comm_rank
MPI_comm_size
MPI_comm_split
MPI_finalize
MPI_gather
MPI_gatherv
MPI_init
MPI_recv
MPI_scatter
MPI_scatterv
MPI_send

Ian
From: nmm1 on
In article <6956b18f-159d-469c-948c-eb62fc79b051(a)d16g2000yqb.googlegroups.com>,
Ian Bush <ianbush.throwaway.account(a)googlemail.com> wrote:
>
>Here's the list for DL_POLY_3, CRYSTAL and CASTEP. I don't work on
>GAMESS-UK
>anymore and don't have the code easily to hand, but I can dig it out
>if you are
>interested, the list will be similar to CRYSTAL. VASP is another
>example. I don't
>have the code here, but I would be surprised if it is markedly
>different from CASTEP.

Thanks very much. Upon updating the table, I realise that I misspoke
earlier - there were in fact FOUR applications that used multiple
communicators - two used groups not MPI_Comm_split. I also seem
to have had bad data for CASTEP, so my remark was a bit off-beam
about the frequency of use for the major applications. Sorry about
that ....

The list won't post, as it is too wide, but please Email me if you
want to see it.


Regards,
Nick Maclaren.