From: jonathan on
On Feb 2, 8:52 am, Jean-Pierre Rosen <ro...(a)adalog.fr> wrote:
> Jerry a écrit :> I've never understood why Ada does not allow slicing in
> > multidimensional arrays. What are the safety issues involved? And how
> > is it safe to force the programmer into ad hoc methods?
>
> One-dimensional slices are simple and efficient. Multidimensional slices
> are a can of worms.
>
> I guess you are thinking about rectangular slices. But why stop there? A
> slice comprising the main diagonal and the diagonal above and below can
> be very useful for some calculations. Or a slice which is the triangular
> part of the upper half...
>
> AFAICT, Fortran-99 does provide this - and the syntax is so complicated
> that nobody uses it. And implementation is also a nightmare.
>
> When designing a programming language, you have to stop at some point.
> The ratio (cost of implementation) / usefulness is a good measure for
> this. I think the ratio was simply to high for this feature.
>
> --
> ---------------------------------------------------------
>            J-P. Rosen (ro...(a)adalog.fr)
> Visit Adalog's web site athttp://www.adalog.fr


I've never felt the need for two dimensional (or higher
dimensional) slicing. It's partly a performance issue: if
you make the data storage matrix as small as possible (ie
make it exactly the same size as the matrix-transformation
you are doing) then you sometimes get a disastrous loss of
efficiency. If on the other hand you always make the data storage
matrix larger than necessary (so that your transformation will
always be performed on a sub-matrix of the data storage matrix),
then you always have the option of avoiding these efficiency
losses. Once you write the transformation routine so that it
operates on sub-matrices of the data storage matrix, then
you usually don't have to slice or copy a sub-matrix from
the data storage matrix in order to transform it.

Here are some Ada and Fortran examples of the
problem on various sized data storage arrays.
I used some bits and pieces of routines from
http://web.am.qub.ac.uk/users/j.parker/miscellany/

First example: we eigen-decompose an N x N = 2048 x 2048 matrix.
The data storage matrix is M x M = (1024+Padding) x (1024+Padding)
Here is the running time in seconds of an iterative jacobi
eigen-decomposition:

2048x2048: 322 sec, gnat (Padding=24)
2048x2048: 1646 sec, gnat (Padding=0)
2048x2048: 1632 sec, gfortran (Padding=0)
2048x2048: 1492 sec, ifort (Padding=0)

The observed 500% slowdown in the absence of padded arrays is
unacceptable, even if it is a rare event (occurring only on 2**p sized
data arrays). In fact it's not all that rare ... more comments on
that below. (BTW, ifort is the INTEL fortran compiler, all
optimizations at Max. gfortran is the gcc variant.)

By the way, I never bothered to write a paddable Fortran routine,
but here are some more timings near a power-of-2 array bounds:

1023x1023: 33.5 sec, gnat (Padding=9)
1023x1023: 42.4 sec, gfortran (Padding=0)
1023x1023: 37.3 sec, ifort (Padding=0)

1022x1022: 33.5 sec, gnat (Padding=10)
1022x1022: 30.2 sec, gfortran (Padding=0)
1022x1022: 28.3 sec, ifort (Padding=0)

1024x1024: 33.2 sec, gnat (Padding=8)
1024x1024: 96.0 sec, gnat (Padding=0)
1024x1024: 116 sec, gfortran (Padding=0)
1024x1024: 43 sec, ifort (Padding=0)

There is one puzzle here I don't have time solve. Normally, a good
fortran will automatically pad the array for you ... I recall that
happening in the past. This time it seems to have slipped its fortran
mind. The ifort man pages:

-O3 enables "Padding the size of certain power-of-two
arrays to allow more efficient cache use."

But I used -O3 and also -O3 -fast ... maybe did something wrong,
but important lesson: compiler optimization policies change
with time, and of course vary from compiler to compiler.
You can't rely on them or even, amazingly, man pages. It's much
better to write the program in a way that is insensitive to changing
optimization policies.

A web search of "cache thrashing" will reveal much depressing
detail on the subject. The efficiency problems discussed above
occur as our arrays become large and spill out of the L3 cache
(6 Meg in the present case).

Just to demonstrate that these problems show up on all sorts
of arrays, I did some runs in the 2000 to 3000 range, this time
using a householder decomposition scavenged from the Golub
singular value decomposition. Can still find plenty of 500%'s:

2102x2102, 3.93 sec, gnat (Padding = 0)
2102x2102, 3.19 sec, gnat (Padding = 8)

2176x2176, 5.03 sec, gnat (Padding = 0)
2176x2176, 3.42 sec, gnat (Padding = 8)

2304x2304, 8.47 sec, gnat (Padding = 0)
2304x2304, 4.52 sec, gnat (Padding = 8)

2560x2560, 24.1 sec, gnat (Padding = 0)
2560x2560, 5.42 sec, gnat (Padding = 8)

3072x3072, 38.9 sec, gnat (Padding = 0)
3072x3072, 7.76 sec, gnat (Padding = 8)

3584x3584, 53.2 sec, gnat (Padding = 0)
3584x3584, 11.5 sec, gnat (Padding = 8)

Finally, an example on 1-dim arrays, using fast fourier
transforms, FFT. The standard, and the most common FFT is
computed on a power-of-2 length data set: 0 .. 2**p-1. I
timed computation of these FFT's on arrays of length 2**p, and
I compared this with computation on arrays of
length 2**p + Padding, where Padding = 24. The computation
on the padded arrays was faster. The ratio of running times is:

p = 10 ratio = .93
p = 11 ratio = .88
p = 12 ratio = .76
p = 13 ratio = .79
p = 14 ratio = .76
p = 15 ratio = .75
p = 16 ratio = .77
p = 17 ratio = .84
p = 18 ratio = .75
p = 19 ratio = .45
p = 20 ratio = .63
p = 21 ratio = .69
p = 22 ratio = .69
p = 22 ratio = .67
p = 24 ratio = .62
p = 25 ratio = .62

So the problem is more common here, and smaller. (These
efficiency losses I still consider unacceptable, especially
in a routine whose reason for existence is efficiency.)
The problem is still worse when you take FFTs of two
dimensional arrays.

There is another (and entirely independent) reason
I prefer routines that perform their transformations on
arbitrary sub-matrices (or on arbitrary diagonal blocks)
of the data storage matrix. After writing my 1st linear
algebra routine, I was very pleased with myself, but it
didn't really do what I wanted. I realized I needed to
transform the diagonal sub-blocks of a large matrix,
and do it iteratively on arbitrarily sized diagonal
sub-blocks. It was a very simple matter to modify the code to
do this, and the result was so convenient that I've never
considered doing it otherwise.

Jonathan
From: Hibou57 (Yannick Duchêne) on
Le Sun, 14 Feb 2010 01:42:13 +0100, jonathan <johnscpg(a)googlemail.com> a
écrit:
> First example: we eigen-decompose an N x N = 2048 x 2048 matrix.
> The data storage matrix is M x M = (1024+Padding) x (1024+Padding)
> Here is the running time in seconds of an iterative jacobi
> eigen-decomposition:
>
> 2048x2048: 322 sec, gnat (Padding=24)
> 2048x2048: 1646 sec, gnat (Padding=0)
> 2048x2048: 1632 sec, gfortran (Padding=0)
> 2048x2048: 1492 sec, ifort (Padding=0)
>
> The observed 500% slowdown in the absence of padded arrays is
> unacceptable, even if it is a rare event (occurring only on 2**p sized
> data arrays). In fact it's not all that rare ... more comments on
> that below. (BTW, ifort is the INTEL fortran compiler, all
> optimizations at Max. gfortran is the gcc variant.)
So this is mostly about representation clauses finally. Is that it ?

Do not know if you already know this document (as I remember I picked it
up from some one thread at comp.lang.ada), I've talked about on the other
fr.c.l.a :
http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf
I had pointed about frames #17, #18, #19 et #20, which contains good
source of inspiration. Hope this could help you to figure a path.

You've posted a long list of tests-bench and observations. I did not
looked at every thing, but hope I will have a more closer look at it later.

--
No-no, this isn't an oops ...or I hope (TM) - Don't blame me... I'm just
not lucky
From: jonathan on
On Feb 14, 1:54 am, Hibou57 (Yannick Duchêne)
<yannick_duch...(a)yahoo.fr> wrote:
> Le Sun, 14 Feb 2010 01:42:13 +0100, jonathan <johns...(a)googlemail.com> a  
> écrit:

First example: we eigen-decompose an N x N = 2048 x 2048 matrix.
The data storage matrix is M x M = (1024+Padding) x (1024+Padding)

should be of course:

First example: we eigen-decompose an N x N = 2048 x 2048 matrix.
The data storage matrix is M x M = (2048+Padding) x (2048+Padding)

> Do not know if you already know this document (as I remember I picked it  
> up from some one thread at comp.lang.ada), I've talked about on the other  
> fr.c.l.a :http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls...
> I had pointed about frames  #17, #18, #19 et #20, which contains good  
> source of inspiration. Hope this could help you to figure a path.

Yes, I remembered this, probably from an old post of yours. I wanted
to cite it when
when I posted earlier, but I could not find the site. This is not
something you
forget quickly (frames 17 and 18):

1980: RAM latency ~ 1 cycle
2009: RAM latency ~ 400+ cycles

It's the heart of the matter, and it is just getting worse. Helps
convince me
anyway that I did not waste time on an unimportant matter! In
numerical linear
algebra the usual solution is to restructure matrices as a collection
of
blocks. That has a few of its own problems though. Minor footnote: I
did
some tests on Intel's new nehalem CPU's. Vastly improved performance
on these
multi-megabyte arrays. Problem not cured though. Don't know enough to
say more about it.
Thanks for the reminder.

Jonathan

From: David Thompson on
On Tue, 02 Feb 2010 09:52:31 +0100, Jean-Pierre Rosen
<rosen(a)adalog.fr> wrote:

> Jerry a �crit :
> > I've never understood why Ada does not allow slicing in
> > multidimensional arrays. What are the safety issues involved? And how
> > is it safe to force the programmer into ad hoc methods?
> >
> One-dimensional slices are simple and efficient. Multidimensional slices
> are a can of worms.
>
> I guess you are thinking about rectangular slices. But why stop there? A
> slice comprising the main diagonal and the diagonal above and below can
> be very useful for some calculations. Or a slice which is the triangular
> part of the upper half...
>
> AFAICT, Fortran-99 does provide this - and the syntax is so complicated
> that nobody uses it. And implementation is also a nightmare.
>
Fortran 90 (and later) has rectangular but optionally non-unit-stride
slices; X(1:5:4,2:6:2) is X(1,2) X(1,4) X(1,6) X(5,2) X(5,4) X(5,6).
(Fortran arrays are column-major). (And although it treats string --
fixed length only -- as a different type than array of character, you
can use corresponding substrings of elements of an array of strings in
a way that is effectively also rectangular.)

It also has 'vector subscripting;: X(Y) accesses X(Y(1)) X(Y(2)) ...
X(Y(N)) -- but this cannot be used as modifiable actual argument or
POINTER target, making it not really a firstclass object/variable.

A major new 'coarray' feature, which essentially allows distribution
across parallel processing systems, vaguely like an in-language OpenMP
(although I hear the standardese must be rather contorted to avoid
impermissibly specifying implementation) was proposed for what was
scheduled to be F 08, but proved contentious enough that AFAIK it
still hasn't been finalized.

*PL/I* has the X DEFINED Y (iSUB) syntax which allows things like a
diagonal (but AFAICS not more than one at a time).

> When designing a programming language, you have to stop at some point.
> The ratio (cost of implementation) / usefulness is a good measure for
> this. I think the ratio was simply to high for this feature.

The features in F90 at least in this area weren't too much of a
problem, at least judging from the reports of visibly intelligent and
apparently informed people in c.l.f. Although those implementors had
the advantage that F90 was originally scheduled for I believe 87, and
even before that there had been experience with nonstandard but fairly
widespread HPF "High Performance Fortran" extensions.

In contrast F03, with adds features mostly in the areas of OO and
'parameterized' types somewhat like Ada discriminated ones, has taken
longer although most vendors are now reportedly getting close.