From: "Andy "Krazy" Glew" on
Andy "Krazy" Glew
> Hetero doesn't impact this - unless you are tempted to do things like
> track, say, only one outstanding transaction per small core, and not to
> allocate memory controller buffers for small core requests.
>
> Just say no.

Of course simple cores that block on cache misses or remote accesses are
the ultimate MIMD, and may be the endpoint of computer architecture as
transistors grow cheap and power expensive. In many ways, this is what
Shekhar Borkhar, the mouthpiece of Intel's CRL (Circuit Research Labs)
advocates. I've posted about my interest in coherers theaded GPUs that
interleave and/ or switch to other threads. But threads waste power with
their large register files.

However, when I door the math, we aren't there yet.

- _ -

This I am writing on a plane; using my tablet PC. I have to use it
rotated sideways, but at least I can use it, whereas I cannot type. Darn
compressed seating! Also, I usually get an aisle,

From: "Andy "Krazy" Glew" on
Terje Mathisen wrote:
> Andy "Krazy" Glew wrote:
> [interesting spectrum of distributed memory models snipped]
>> I think SMB-WB-bitmask is more likely to be important than the weak
>> models 0.7 and 0.9,
>> in part because I am in love with new ideas
>> but also because I think it scales better.
>>
>> It provides the performance of conventional PGAS, but supports cache
>> locality when it is present. And poses none of the semantic challenges
>> of software managed cache coherency, although it has all of the same
>> performance issues.
>>
>> Of ourse, it needs roghly 64 bits per cache line. Which may be enough to
>> kill it in its tracks.
>
> Isn't this _exactly_ the same as the current setup on some chips that
> use 128-byte cache lines, split into two sectors of 64 bytes each.
>
> I.e. an effective cache line size that is smaller than the "real" line
> size, taken to its logical end point.
>
> I would suggest that (as you note) register size words is the smallest
> item you might need to care about and track, so 8 bits for a 64-bit
> platform with 64-byte cache lines, but most likely you'll have to
> support semi-atomic 32-bit operations, so 16 bits which is a 3% overhead.
>
> Terje

Well, it's not * exactly * like sectored cache lines. You typically
need the sector size to be a multiple of the dram burst transfer size,
what Jim Goodman called the 'transfer block size' in his paper that I
thought defined the only really good terminology.

The 'cache line size', what Jim Goodman called the 'address block size',
Is A multiple, usually the usual power of two aligned multiple of the
transfer block size. Indeed, the address block may consist of several
sub blocks that, forgetting Jim's notation, I will call a residency
block, AKA sector. That in turn may consist of several transfer blocks.

Whereas the write bitmasks, whether at byte or word granularity, are
finer grain than the transfer block size.

Byte granularity is motivated because it is the smallest granularity
that you can usually write into some memories without having to do or
read modify write. Almost nobody allows you to write at bit
granularity. Sure, some systems do not allow you to write at byte
granularity and they may even require you to write at word or cache line
granularity. But byte granularity is very widespread.

If you track this at word granularity but allow the user to write a byte
granularity because that's what his instruction set has, then you run
the risk of losing writes.

Example: original memory location value equals ABCD.

Two processors P1 and P2, both read the memory location.

P1 writes X in the first location yielding XBCD.

P2 writes Y in the last location yielding ABCY.

Let's assume that both of these values are resident in their respective
processors' caches but the caches are not cash coherent. If P1 evicts
first then main memory and other processors will see XBCD, and if P2
then evicts then P1's write of X will disappear and memory will contain
ABCY.

Writes can be lost in this way whenever the bitmasks used to merge the
evicted cache lines are of coarser granularity than the minimum write
size in the instruction set.

I * think * that this may be important. People complain about non cache
coherent systems. But if you think about it, non cache coherent systems
really have several different surprising behaviors - behaviors that a
na�ve programmer might find surprising:

A) different processors may see different values in the same memory
location at the same time. Sure, this is confusing, but it is rather
inherent in non cache coherent systems. It's the whole point of cache
coherent systems.

B) non cache coherent systems usually have weak memory ordering.

C) writes get lost, as I describe above.

Writeback systems often solve all of these problems at the same time.

Write through cache protocols may solve them all, but often only solve A
and C leaving B, weak memory ordering. (The presenters of the memory
tutorial at ISCA earlier this year defined it succinctly: on strongly
ordered IBM systems with write-through caches, you must ensure that all
other copies of the cache line are invalidated before the write-through
is performed. To which I add, on a weakly ordered write-through system,
you perform the write-through and perform the invalidations as a side
effect of snooping the write-through. I.e., Strongly ordered
write-through systems essentially perform a read for ownership before
write-through.)

Since the whole point of this exercise is to try to reduce the overhead
of cache coherency, but people have demonstrated they don't like the
consequences semantically, I am trying a different combination: allow A,
multiple values; allow B weak ordering; but disallow C losing writes.

I possibly that this may be more acceptable and fewer bugs.

I.e. I am suspecting that full cache coherency is overkill, but that
completely eliminating cache coherency is underkill.

- - -

* This * post, by the way, is composed almost exclusively by speech
recognition, using the pen for certain trivial edits. It's nice to find
a way that I can actually compose stuff on a plane again.

From: "Andy "Krazy" Glew" on
Terje Mathisen wrote:
> Andy "Krazy" Glew wrote:
>> I think SMB-WB-bitmask is more likely...

> Isn't this _exactly_ the same as the current setup on some chips that
> use 128-byte cache lines, split into two sectors of 64 bytes each.
>
> I.e. an effective cache line size that is smaller than the "real" line
> size, taken to its logical end point.
>
> I would suggest that (as you note) register size words is the smallest
> item you might need to care about and track, so 8 bits for a 64-bit
> platform with 64-byte cache lines, but most likely you'll have to
> support semi-atomic 32-bit operations, so 16 bits which is a 3% overhead.
>
> Terje

I know that I meant to reply to this on the airplane going to my
parents' wedding anniversary. I just dug the post out of my drafts folder.

Briefly:

Sectors are usually the burst size of memory.

Anything coarser grain than byte granularity gives rise to the
possibility of losing writes. How bad is that? We already have the
possibility of losing writes when we write to individual bits within
word or byte. Maybe we can increase the granularity?
From: "Andy "Krazy" Glew" on
nmm1(a)cam.ac.uk wrote:
>> I think SMB-WB-bitmask is more likely to be important than the weak
>> models 0.7 and 0.9,
>> in part because I am in love with new ideas
>> but also because I think it scales better.
>
> It also matches language specifications much better than most of the
> others, which is not a minor advantage. That could well be the
> factor that gets it accepted, if it is.

This is my thinking. Language specifications, sure, but I think there
mainly important because they indicate what the programmer expects: not
losing writes.

Note: languages that allow bit fields to be specified, such as int a:1,
experience the lossage of losing writes for such sub-byte accesses even
on cache coherent shared memory subsystems. Unless, that is, they
generate interlocked RMWs such as LOCK BTS for all such bit accesses.

Hmm... Here's an idea: in the bad old days you would never want to
generate locked instructions if you could avoid them. Bus locks are
really slow. But the trend is being to make cache locks really really
cheap. They are on the verge of being as cheap as unlocked operations
if they hit in the cache, or if they miss but are uncontended.

Perhaps, if something like SMB-WB-bitmask is implemented at word
granularity rather than byte granularity, we should implement the
operations that allow bytes not be lost in much the same way that LOCK
BTS prevents bits from being lost, with unlocked BTS as a possible
optimization.

Such an instruction is: LOCK write bytes under mask.
I.e. LOCK mem := (mem & mask) | (stdata & ~mask)

Or even lock right bits under mask.

.... Or perhaps we would just want to use hardware that did this to
implement byte writes.
From: "Andy "Krazy" Glew" on
Andy "Krazy" Glew wrote:
> Hmm... Here's an idea: in the bad old days you would never want to
> generate locked instructions if you could avoid them. Bus locks are
> really slow. But the trend is being to make cache locks really really
> cheap. They are on the verge of being as cheap as unlocked operations
> if they hit in the cache, or if they miss but are uncontended.
>
> .... Or perhaps we would just want to use hardware that did this to
> implement byte writes.

Urg. But of course, the way we get cheap LOCKs is cache coherency. And
we are trying to avoid cache coherency.

Cache coherency only for byte accesses? :>-(