From: Nick Maclaren on

In article <1161856490.438501.274930(a)b28g2000cwb.googlegroups.com>,
"ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes:
|>
|> For n<=4, a matrix will typically fit into a cache line. Often, the
|> code can be rejiggered to cache align the matrix and prefetch cache
|> lines, although how far in advance of using a matrix its prefetch can
|> be done varies by the problem.

Not quite. A 4x4 matrix of 64-bit reals is 128 bytes. Aligning the
matrix gives at most a factor of two gain. Prefetching is independent
of whether the data are scalars or small matrices.


Regards,
Nick Maclaren.
From: ranjit_mathews@yahoo.com on
Nick Maclaren wrote:
> In article <1161856490.438501.274930(a)b28g2000cwb.googlegroups.com>,
> "ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes:
> |>
> |> For n<=4, a matrix will typically fit into a cache line. Often, the
> |> code can be rejiggered to cache align the matrix and prefetch cache
> |> lines, although how far in advance of using a matrix its prefetch can
> |> be done varies by the problem.
>
> Not quite. A 4x4 matrix of 64-bit reals is 128 bytes.

IA64, MIPS 4x00, Power4, and the SUN Fireplane Interconnect have 128
byte cache lines.

> Aligning the
> matrix gives at most a factor of two gain. Prefetching is independent
> of whether the data are scalars or small matrices.

An operation on a pair of matrices takes longer than an operation on
scalars. So, prefetching before the former operation is prefetching
well in advance of use whereas prefetching before the latter operation
is prefetching little in advance of use.

From: Nick Maclaren on

In article <1161874112.453687.208280(a)m7g2000cwm.googlegroups.com>,
"ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes:
|>
|> An operation on a pair of matrices takes longer than an operation on
|> scalars. So, prefetching before the former operation is prefetching
|> well in advance of use whereas prefetching before the latter operation
|> is prefetching little in advance of use.

Er, no.

On a modern system, the time taken to perform even a 4x4 matrix
multiply is much less than the time spent fetching the data from
memory, so that the time is dominated by the delay in fetching the
first values. Whether the matrix is in one cache line or four, the
other values will be there when needed.

The prefetch has to be the same amount ahead in both cases to
eliminate the delay due to the fetch.


Regards,
Nick Maclaren.
From: ranjit_mathews@yahoo.com on

Nick Maclaren wrote:
> In article <1161874112.453687.208280(a)m7g2000cwm.googlegroups.com>,
> "ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes:
> |>
> |> An operation on a pair of matrices takes longer than an operation on
> |> scalars. So, prefetching before the former operation is prefetching
> |> well in advance of use whereas prefetching before the latter operation
> |> is prefetching little in advance of use.
>
> Er, no.
>
> On a modern system, the time taken to perform even a 4x4 matrix
> multiply is much less than the time spent fetching the data from
> memory, so that the time is dominated by the delay in fetching the
> first values. Whether the matrix is in one cache line or four, the
> other values will be there when needed.

Measure the timing of this C++ code on x86-64 with
(1) a prefetch member routine that prefetches, and
(2) a prefetch member routine that does nothing.

class matrix {
// fill in the details
....
};

typedef matrix(4,4) mat4;
#define N (1024*1024);

.....
mat4 a[N], b, c[N];
// fill in matrixes with random 64 bit floats
.....
for (i=0; i<N; i++) {
if (i<N-1) a[i+1].prefetch();
c[i] = a[i]*b;
}

> The prefetch has to be the same amount ahead in both cases to
> eliminate the delay due to the fetch.

The latency might not be eliminated but would be reduced to a greater
extent in the case of matrix multiplication than in the case of scalar
multiplication. Try the above example with scalars. There would be
little difference with and without prefetch. In the case of the
matrixes, prefetching would make some difference.

> Regards,
> Nick Maclaren.

From: ranjit_mathews@yahoo.com on

ranjit_mathews(a)yahoo.com wrote:
> Nick Maclaren wrote:
> > In article <1161874112.453687.208280(a)m7g2000cwm.googlegroups.com>,
> > "ranjit_mathews(a)yahoo.com" <ranjit_mathews(a)yahoo.com> writes:
> > |>
> > |> An operation on a pair of matrices takes longer than an operation on
> > |> scalars. So, prefetching before the former operation is prefetching
> > |> well in advance of use whereas prefetching before the latter operation
> > |> is prefetching little in advance of use.
> >
> > Er, no.
> >
> > On a modern system, the time taken to perform even a 4x4 matrix
> > multiply is much less than the time spent fetching the data from
> > memory, so that the time is dominated by the delay in fetching the
> > first values. Whether the matrix is in one cache line or four, the
> > other values will be there when needed.
>
> Measure the timing of this C++ code on x86-64 with
> (1) a prefetch member routine that prefetches, and
> (2) a prefetch member routine that does nothing.

On x86-64, if you do an unaligned implementation, your prefetch routine
would have to prefetch the first, last and middle 32 bit word of the
storage used for the matrix.

> class matrix {
> // fill in the details
> ...
> };
>
> typedef matrix(4,4) mat4;
> #define N (1024*1024);
>
> ....
> mat4 a[N], b, c[N];
> // fill in matrixes with random 64 bit floats
> ....
> for (i=0; i<N; i++) {
> if (i<N-1) a[i+1].prefetch();
> c[i] = a[i]*b;
> }
>
> > The prefetch has to be the same amount ahead in both cases to
> > eliminate the delay due to the fetch.
>
> The latency might not be eliminated but would be reduced to a greater
> extent in the case of matrix multiplication than in the case of scalar
> multiplication. Try the above example with scalars. There would be
> little difference with and without prefetch. In the case of the
> matrixes, prefetching would make some difference.
>
> > Regards,
> > Nick Maclaren.