From: Robert Myers on
On Jul 28, 12:10 pm, Jason Riedy <ja...(a)acm.org> wrote:
> And Andy Glew writes:
> > c) this is complex.  More complex than simply supporting sequential bursts.
>
> > But I'm not afraid of complexity.  I try to avoid complexity, when
> > there are simpler ways of solving a problem.  But it appears that this
> > random access problem is a problem that (a) is solvable (with a bit of
> > complexity), (b) has customers (Robert, and some other supercomputing
> > customers I have met, some very important), and (c) isn't getting
> > solved any other way.
>
> There are customers who evaluate systems using the GUPS benchmark[1],
> some vendors are trying to address it, and some contract RFPs require
> considering the issue (DARPA UHPC).  A dual-mode system supporting
> full-bandwidth streams (possibly along affine ("crystiline"?) patterns
> of limited dimension) and, say, half-bandwidth word access would permit
> balancing the better bandwidth and power efficiency of streams with
> scatter/gather/GUPS accesses that currently are bottlenecks.

Had I thought to use the word "affine," which I associate with mind-
bending texts involving Cristoffel symbols, I would have made a much
less memorable post and not gotten my point across nearly as
clearly. ;-)

Affine might be too general, though, since it includes general
rotation.

Robert.
From: Jason Riedy on
And Robert Myers writes:
> Affine might be too general, though, since it includes general
> rotation.

You're right. I was digging in my head for a term... Oh! The
polyhedral model[1] for loops. The affine part comes from
transformations between polyhedra. I'm imagining the memory actions
following the final, transformed polyhedron. Each can be described
by a base address and a vector of strides, etc. Some dimensions
will cross node boundaries.

I suspect (with no data) that supporting two dimensions covers most
uses that can be reduced to streaming-ish loads, and three
dimensions would cover almost all uses short of real random access.
IIRC, some prefetch engines handle the 2d case for same-node memory.
The IPDPS presentation implied (to me) that Blue Waters folks have
thought about striding across nodes for PGAS support and have some
support in the little communication processors.

I don't know if anyone has thought hard about system-wide
coordination of the stride patterns on modern (less synchronized)
machines, though, and that could prove critical for nontrivial
streaming access in "exascale" memories.

The MTA/XMT has one "solution" for global coordination of random
access patterns: randomize (hash) all addresses. The implications
for reliability are, um, open for interpretation. Forget about
direct I/O. And you still must avoid hot-spots. I'm dubious that
the randomization buys anything but pain.

Jason

Footnotes:
[1] http://en.wikipedia.org/wiki/Polytope_model

From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
Andy Glew <"newsgroup at comp-arch.net"> wrote:

> I have not been able to read the redbook yet (Google Chrome and Adobe
> Reader were conspiring to hang, and could not view/download the
> document; I had to fall back to Internet Explorer).

Perhaps it is easier to download from the parent page:
<http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/sg247833.htm
l>

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Andy Glew "newsgroup at on
On 7/28/2010 12:46 PM, Jason Riedy wrote:
> And Robert Myers writes:
>> Affine might be too general, though, since it includes general
>> rotation.
>
> I don't know if anyone has thought hard about system-wide
> coordination of the stride patterns on modern (less synchronized)
> machines, though, and that could prove critical for nontrivial
> streaming access in "exascale" memories.
>
> The MTA/XMT has one "solution" for global coordination of random
> access patterns: randomize (hash) all addresses. The implications
> for reliability are, um, open for interpretation. Forget about
> direct I/O. And you still must avoid hot-spots. I'm dubious that
> the randomization buys anything but pain.

"Affine" is a good word. I think including general rotations is good.

But "crystalline" sparkles. :-)

Thanks Jason for reminding me about GUPS. Plus, if you look at some of
the recent GUPS champions, they have been doing in software what I
propose doing in hardware, along the lines of scatter/gather bursting.

I suspect that we will end up in a 3-way discussion:

1) sequential

2) crystalline or affine - regular access patterns, just not supported
well by block sequential

3) random access patterns

My user community is more in the random access patterns than the
crystalline or affine. E.g. I am into pointers, linked data
structures, irregular sparse arrays. Not regularity. More AI than FFT.

At the moment 2) and 3) are allied. A good solution to 3) will also
help 2) a lot, but not vice versa. I am a little bit afraid of
proposals that seem to hardwire support for certain access patterns into
the hardware, at the expense of others not anticipated by the
anticipator of little foresight.

--

I grew up admiring the stunt box scheduling to coordinate stride
patterns on old machines. But every time I looked closely, it was not
really all that fancy - not as fancy as people in this group suggested.
They weren't solving diophantine equations for access pattern
optimizations in real time; they were applying simple heuristics,
usually variants of greedy, that delivered reasonable performance.
(Now, the *compiler* might be solving diophantine equations - but
usually not for scheduling.)

--

I'm symapthetic to the randomization, but not certain about it.



From: Andy Glew "newsgroup at on
On 7/28/2010 9:10 AM, Jason Riedy wrote:
> There are customers who evaluate systems using the GUPS benchmark[1],


Jason, can you explain why GUPS is so update heavy?

Sure, a workload of random updates seesm important. But similarly a
workload of random reads also seems important.

I think that we need two randoom read components:

(1) independent random reads

(2) dependent random reads, e.g. pointer chasing:

choose a random hash table

choose a random element in that hash table

and repeat.



Probably with the above compined with updatesin some sort of weighted mix.

Probably also witgh sequential accesses at the pointer dereferences.