From: David Kanter on
On Jul 27, 9:37 am, nos...(a)ab-katrinedal.dk (Niels Jørgen Kruse)
wrote:
> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
> > Andy Glew wrote:
> > > On 7/27/2010 6:16 AM, Niels Jørgen Kruse wrote:
> > >> 24 MB L3 per 4 cores
> > >> up to 768 MB L4
> > >> 256 byte line sizes at all levels.
>
> > > 256 *BYTE*?
>
> > Yes, that one rather screamed at me as well.
>
> Another surprising thing I spotted browsing through the redbook, is the
> claim of single cycle L1D access. That must be array access only, so
> there are at least address generation and format cycles before and
> after. Still, 3 cycle loads from a 128 KB L1D at 5.2 GHz must show up on
> the power budget.

It's definitely array access only.

Honestly, they've got an awful lot to get done in 2-3 cycles. TLB
look up (should be in parallel), tag and parity check, array access,
data format, send data somewhere. The data format and transfer might
be pipelined, but even so...that's a lot of activity.

What is interesting is that they indicated the L1 and L2 are both
write-thru/store-thru designs, while the L3 and L4 are write-back/
store-in. Write thru should be a transitive quality, and if that's
correct, then the latency for a store to retire is going to be pretty
high, requiring an L1 write, an L2 write and an L3 write.

DK
From: Andy Glew "newsgroup at on
On 7/27/2010 2:29 PM, David Kanter wrote:
> On Jul 27, 9:37 am, nos...(a)ab-katrinedal.dk (Niels J�rgen Kruse)
> wrote:

> What is interesting is that they indicated the L1 and L2 are both
> write-thru/store-thru designs, while the L3 and L4 are write-back/
> store-in. Write thru should be a transitive quality, and if that's
> correct, then the latency for a store to retire is going to be pretty
> high, requiring an L1 write, an L2 write and an L3 write.

But the latency for a store to retire doesn't matter that much. So long
as it can be pipelined.

E.g. if you have stores queued up

a) you can have "obtained ownership", i.e. ensured that all other copies
of the line have been invalidated in all other peer store-thru caches,
before the store starts to retire. (IBM does this; Intel did NOT do
this, up until Nehalem and QPI. I.e. IBM does "invalidate before
store-thru", whereas older Intel machines did
"write-through-invalidates", because they had a different, less
constrained, memory model and a more constrained system architecture.)

b) assuming store1 and store2 are queued up:

cycle 1: store1 L1 write

cycle 2: store2 l1 write; store1 L2 write

cycle 3: store2 L2 write; store1 L3 write

cycle 4: store2 L3 write

You have consistent state at all points.

Also, you can store combine, at least into same line (and, aggressively,
into different lines).

From: Andy Glew "newsgroup at on
On 7/27/2010 8:08 AM, Terje Mathisen wrote:
> Andy Glew wrote:
>> On 7/27/2010 6:16 AM, Niels J�rgen Kruse wrote:
>>> 24 MB L3 per 4 cores
>>> up to 768 MB L4
>>> 256 byte line sizes at all levels.
>>
>> 256 *BYTE*? [cache line size on new IBM z-Series]
>
> Yes, that one rather screamed at me as well.
>>
>> 2048 bits?
>>
>> Line sizes 4X the typical 64B line size of x86?
>>
>> These aren't cache lines. They are disk blocks.
>
> Yes. So what?
>
> I (and Nick, and you afair) have talked for years about how current CPUs
> are just like mainframes of old:
>
> new old
> DISK -> TAPE : Sequential access only
> RAM -> DISK : HW-controlled, block-based transfer
> CACHE -> RAM : Actual random access, but blocks are still faster
>
>>
>> Won't make Robert Myers happy.

Yes, I know. Many of my responses to Robert Myers have been
explanations of this, the state of the world.

However, the reason that I am willing to cheer Robert on as he tilts at
his windmill, and even to try to help out a bit, is that this trend is
not a fundamental limit. I.e. there is no fundamental reason that we
have to be hurting random accesses as memoy systems evolve.

People seem to act as if there are only two design points:

* low latency, small random accesses
* long latency, burst accesses

But it is possible to build a system that supports

* small random accesses with long latencies

By the way, it is more appropriate to say that the current trend is towards

* long latency, random long sequential burst accesses.

(Noting that you can have random non-sequential burst accesses, as I
have recently posted about.)

The things that seem to be driving the evolution towards long sequential
bursts are

a) tags in caches - the smaller the cache objects, the more area wasted
on tags. But if you don't care about tags for your small random accesses...

b) signalling overhead - long sequential bursts have a ratio of address
bits to data bits of, say, 64:512 = 1:8 for Intel's 64 byte cache lines,
and 64:2048 = 8:256 = 1:32 for IBM's 256 byte cache lines. Whereas
scatter gather has a signalling ratio of more like 1:1.

Signalling overhead manifests both in bandwidth and power.

One can imagine an interconnect that handles both sequential bursts and
scatter/gather random accesses - so that you don't pay a penalty for
sequential access patterns, but you support small random access patterns
with long latencies well. But...

c) this is complex. More complex than simply supporting sequential bursts.

But I'm not afraid of complexity. I try to avoid complexity, when there
are simpler ways of solving a problem. But it appears that this random
access problem is a problem that (a) is solvable (with a bit of
complexity), (b) has customers (Robert, and some other supercomputing
customers I have met, some very important), and (c) isn't getting solved
any other way.

For all that we talk about persuading programmers that DRAM is the new disk.



> 768 MB of L4 means your problem size is limited to a little less than
> that, otherwise random access is out.

It may be worse than you think.

I have not been able to read the redbook yet (Google Chrome and Adobe
Reader were conspiring to hang, and could not view/download the
document; I had to fall back to Internet Explorer).

But I wonder what the cache line size is in the interior caches, the L1,
L2, L3?

With the IBM heritage, it may a small, sectored cache line. Say 64 bytes.

But, I also recall seeing IBM machines that could transfer a full 2048
bits between cache and registers in a single cycle. Something which I
conjecture is good for context switches on mainframe workloads.

If the 256B cache line is used in the inside caches, then it might be
that only the L1 is really capable of random access.

Or, rather: there is no absolute "capable of random access". Instead,
there are penalties for random access.

I suggest that the main penalty should be measured as

ratio long burst sequential time to transfer N bytes
to
ratio small random access to transfer N bytes

Let us talk about 64bit randm accesses.

Inside the L1 cache at Intel, with 64 byte cache lines, this ratio is
close to 1:1.

Accessing data that fits in the L2, this ratio is circa 8:1 - i.e. long
burst sequential is 8X faster, higher bandwidth, than 64b random accesses.

From main memory the 1:8 ratio still approximately holds wire-wise, but
buffering effects tend to crop up which inflates it.

With 256B cache lines, the wire contribution to this ratio is 32:1. -
i.e. long burst sequential is 32X faster, higher bandwidth, than 64b
random accesses. Probably with more slowdowns.

---


What I am concerned about is that it may not be that "DRAM is the new disk".

It may be that "L2 cache is the new disk". More likely "L4 cache is the
new disk".


---


By the way, this is the first post I am making to Robert Myer's

high-bandwidth-computing(a)googlegroups.com

mailing list


Robert: is this an appropriate topic?

From: Jason Riedy on
And Andy Glew writes:
> c) this is complex. More complex than simply supporting sequential bursts.
>
> But I'm not afraid of complexity. I try to avoid complexity, when
> there are simpler ways of solving a problem. But it appears that this
> random access problem is a problem that (a) is solvable (with a bit of
> complexity), (b) has customers (Robert, and some other supercomputing
> customers I have met, some very important), and (c) isn't getting
> solved any other way.

There are customers who evaluate systems using the GUPS benchmark[1],
some vendors are trying to address it, and some contract RFPs require
considering the issue (DARPA UHPC). A dual-mode system supporting
full-bandwidth streams (possibly along affine ("crystiline"?) patterns
of limited dimension) and, say, half-bandwidth word access would permit
balancing the better bandwidth and power efficiency of streams with
scatter/gather/GUPS accesses that currently are bottlenecks. Those
bottlenecks also waste power, so having both could be a win from the
system perspective even if a single component might draw more power.

The Blue Waters slides presented at IPDPS'10 make me believe IBM's going
that route with a specialized interconnect controller per board, but I
don't remember/know the details. Another vendor also understands this
split and wants to support both access patterns. Again, I don't know
the details, but I'm pretty sure they're going in this dual-mode
direction.

Considering people have dropped things like networked file systems and
IP routing protocols into FPGAs and silicon, I can't believe supporting
two modes would be much more of a technical challenge. And it looks
like there may finally be money attached to tackling that challenge.

Jason

Footnotes:
[1] http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/

From: Robert Myers on
On Jul 28, 11:05 am, Andy Glew <"newsgroup at comp-arch.net"> wrote:

>
> By the way, this is the first post I am making to Robert Myer's
>
> high-bandwidth-computing(a)googlegroups.com
>
> mailing list
>
> Robert: is this an appropriate topic?

I'm happy to let the discussion go whatever way it wants to. Using
available bandwidth more effectively is the same as having more
bandwidth. You could say that reducing the issue to "more bandwidth!"
is as bad as reducing all of HPC to "more flops!"

Robert.