Why so little parallelism? [Computer Architecture]

Prev: Historical question, what went wrong with bubble memory?
Next: Need SimpleScalar GCC Compiler

From: Nick Maclaren on 26 Oct 2006 05:02

In article <1161848990.855342.57090(a)m73g2000cwd.googlegroups.com>,
"BDH" <bhauth(a)gmail.com> writes:
|> > >There are only three important expensive operations: sorting, matrix
|> > >multiplication, and fourier transforms. Current architectures fail
|> > >miserably at all three.
|>
|> > Wow, now that's a sweeping over-generalization.
|>
|> You're right. Let me correct it. Prefix array calculation, vector
|> permutation, matrix arithmetic, and fourier and number theoretic
|> transforms are the only important expensive computations, and current
|> architectures generally do very badly with all of these.

That over-generalisation should be swept into the bin, and the lid
closed on it.

Other important expensive computations include keyed data access
(including sorting) and many graph-theoretic analyses (including linear
and quadratic programming).

|> > Would you care to give some examples? For example, floating point
|> > matrix-matrix multiply gets close to peak for most array sizes on most
|> > current architectures.
|>
|> Leaving aside algorithms with order less than n^3 (matrix
|> multiplication is probably n^2 or n^2 log n) - does matrix
|> multiplication of arrays of size n take more than O(n) time with O(n^2)
|> computational elements on your computer?

1) I await reading your paper showing how matrix multiplication can be
done in O(n^2 log n) with great interest, but expect that it will use
a complete unrealistic model.

2) Multiplication of arrays of size n with O(n^2) computational elements
is generally useful only for n <= 4, and many architectures have done it
very efficiently. In fact, MOST modern CPUs will spend MUCH less time
doing the computation than in loading the data from memory.

The issue is data access - see my other posting today.

Regards,
Nick Maclaren.

From: Nick Maclaren on 26 Oct 2006 05:08

In article <ehpt1f$ask$1(a)gemini.csx.cam.ac.uk>,
nmm1(a)cus.cam.ac.uk (Nick Maclaren) writes:
|>
|> That is, indeed, true, and the reason the misapprehension is widespread
|> is due to the sorry state of "computer science" "complexity theory".
|> Real mathematical complexity theory is fine, but does not pretend to
|> match computer hardware; it is the part that claims to be applied that
|> I am damning.

And, since someone will flame me for this, it is possible to damn an
area without damning everyone in it. There are, naturally, worthy
exceptions, but not that many and they don't dominate the area.

|> So most of the books on algorithms are misleading to the point of being
|> actually incorrect, which is almost certainly where the original poster
|> got his misapprehensions from.

Note that I said MOST.

Regards,
Nick Maclaren.

From: BDH on 26 Oct 2006 05:22

> Other important expensive computations include keyed data access

....a sorting problem...

> and many graph-theoretic analyses (including linear
> and quadratic programming).

....usually a combination of matrix arithmetic and sorting...

> 1) I await reading your paper showing how matrix multiplication can be
> done in O(n^2 log n) with great interest, but expect that it will use
> a complete unrealistic model.

I doubt I'll be the one to show this, but I believe I'll live to see
it. There is progress to be made in group theory.

From: BDH on 26 Oct 2006 05:29

> Back in the late 1960s and all the 1970s, a few of us were saying that
> we should NOT be measuring KFlops but KMaccs (memory accesses); even
> now, that mindset is not universal. And 90% of the double-quoted
> nonsense uses completely unrealistic and often inconsistent models.
> If you do an analysis using a simple, but more realistic model[*], you
> get VERY different complexities for most such operations. But such
> approaches are VERY hard to get published, as the establishment is not
> keen to have people point out that it has been bullshitting for many
> decades ....

You need some memory while gates wait for their input or a clock cycle,
but why would you send your data to some sort of separate RAM and back?

From: ranjit_mathews@yahoo.com on 26 Oct 2006 05:54

Nick Maclaren wrote:
> "BDH" <bhauth(a)gmail.com> writes:
> |> > >There are only three important expensive operations: sorting, matrix
> |> > >multiplication, and fourier transforms. Current architectures fail
> |> > >miserably at all three.
> |>
> |> > Wow, now that's a sweeping over-generalization.
> |>
> |> You're right. Let me correct it. Prefix array calculation, vector
> |> permutation, matrix arithmetic, and fourier and number theoretic
> |> transforms are the only important expensive computations, and current
> |> architectures generally do very badly with all of these.
>
> That over-generalisation should be swept into the bin, and the lid
> closed on it.
>
> Other important expensive computations include keyed data access
> (including sorting) and many graph-theoretic analyses (including linear
> and quadratic programming).
>
> |> > Would you care to give some examples? For example, floating point
> |> > matrix-matrix multiply gets close to peak for most array sizes on most
> |> > current architectures.
> |>
> |> Leaving aside algorithms with order less than n^3 (matrix
> |> multiplication is probably n^2 or n^2 log n) - does matrix
> |> multiplication of arrays of size n take more than O(n) time with O(n^2)
> |> computational elements on your computer?
>
> 1) I await reading your paper showing how matrix multiplication can be
> done in O(n^2 log n) with great interest, but expect that it will use
> a complete unrealistic model.
>
> 2) Multiplication of arrays of size n with O(n^2) computational elements
> is generally useful only for n <= 4, and many architectures have done it
> very efficiently. In fact, MOST modern CPUs will spend MUCH less time
> doing the computation than in loading the data from memory.

For n<=4, a matrix will typically fit into a cache line. Often, the
code can be rejiggered to cache align the matrix and prefetch cache
lines, although how far in advance of using a matrix its prefetch can
be done varies by the problem.

> The issue is data access - see my other posting today.
> Regards,
> Nick Maclaren.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Historical question, what went wrong with bubble memory?
Next: Need SimpleScalar GCC Compiler