From: Alexander Terekhov on

Alexander Terekhov wrote:
> Ricardo Bugalho wrote:
> >
> > On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote:
> >
> > > I didn't bother to look at IA64 manual - anybody care to comment on this ?
> > > but I suspect that IA64 is RCpc and the manual is exactly correct after
> > > all.
> >
> > It's RCpc indeed.
> Not quite. Release stores to *WB* memory are constrained to ensure
> "remote write atomicity". Classic RCpc is weaker in this respect
> (and that's what makes RC != TSO). You better not rely on this
PC, not RC. -------------+

> property because emulating it on CELLs (for example) will make your
> ports run really slow. ;-)

From: Joe Seigh on
Ricardo Bugalho wrote:
> On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote:
>>I didn't bother to look at IA64 manual - anybody care to comment on this ?
>>but I suspect that IA64 is RCpc and the manual is exactly correct after
> It's RCpc indeed.

So what does "manual is exactly correct" in this case mean? Are
IA-32 loads equivalent to IA64 ld.acq and they are not equivalent
to IA64 ld? I.e. the latter can't emulate a IA-32 load in all cases.

Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
From: Alexander Terekhov on

Joe Seigh wrote:
> Are IA-32 loads equivalent to IA64 ld.acq and they are not equivalent
> to IA64 ld?

The ordering constraints are equivalent for IA32 loads and IA64 acquire
loads. But IA64 release stores to WB memory are more constrained than PC
stores, and IA32-under-IA64 effectively runs in TSO for WB memory, not

From: Eric P. on
Ricardo Bugalho wrote:
> On Wed, 31 Aug 2005 18:02:34 -0400, Eric P. wrote:
> >
> > I think the underlying question you asked about the x86 is:
> >
> > Does the Intel Processor Consistency model require processors to wait
> > for all other processors to acknowledge receipt of their invalidates
> > before any are allowed to use the new value?
> >
> It does not.
> The most straightforward example is buffered store forwarding: when a CPU
> writes a value into memory, it can read it again directly from the store
> buffer, even before it tries to make it visible to other processors.

I meant with regard to other processors not to itself.

Within a processor, yes, the docs explicitly state that
data from buffered writes can be forwarded to waiting reads.
As I understand it, while such local forwarding can have consequences
for consistency models, presumably because it allows subsequent
instructions to complete earlier than they otherwise would have,
it should not have an effect remote data update ordering.

In short, store to load forwarding, in and of itself, would not
allow a new value of Y to arrive at P3 before the new value of X.

For this to occur seems to me to require both of:
(a) the cache protocol to distribute updates in a non atomic manner by
allowing a new value to be available before all acks are received.
(b) the bus topology and protocol to somehow allow a message to get
from P1 to P2 then P2 to P3 passing the one from P1 to P3,
possibly due to an error and retransmit.


From: Andy Glew on

Bottom quoting: asbestos donned!

I think that Joe Seigh has incorrectly assumed that processor
consistency implies (a) a global ordering of all loads, and (b) causal

This is not true. At least, I am fairly certain that there is a
causal ordering memory model that is intermediate in semantics between
processor consistency and sequential consistency. (Google finfslots of
papers; I specifically recall Mossberger's survey.) And I do not
believe that I have ever seen a proof that processor consistency
implies a global ordering of all loads; I don't think such a proof
exists; I would be interested to see it if it does; and I strongly
suspect that there is a proof that orderings consistent with processor
consistency may violate causal ordering. Indeed, Joe may have
provided one.

(I do confess that I have occasionally wanted to move from processor
consistency to causal consistency, mainly because causal consistency
sounds like it should be easier to make proofs for; but I am not sure
if causal consistency is any easier to implement than sequential
consistency. Since sequential consistency is easy enough to
implement, I suspect that if we tighten up the memory model we will go
all the way.)

Nearly all statements in processor consistency are local.
For processors Pi, i = ...

Each Pi has a set of instructions Pi.Ij, some of which are loads, some
of which are stores. Notationally Pi.Lj and Pi.Sj, where the index
sets for Lj and Sj are not necessarily contiguous.

Each Pi also sees external stores in some order Pi.Xk.

The sequence of external stores seen by Pi, Pi.Xk, can be formed out
of an interleaving the set of stores from all other processors Pm.Sj,
m!=i. The only real constraint is that in this interleaving all of
the stores from a particular processor Pm.Sj appear in the order in
which they occurred on that processor; stores from a given processor
are not reordered in the sequence.

The sequence of external stores Pi.Xk is not necessarily equal to
Pj.Xk, for different processors i and j. I.e. although stores from
any single processor are performed in order at any other processor,
other processors do not necessarily see stores from different
processors interleaved in the same order. I.e. there is no single
global store order.

Instruction execution at a single Pi proceeds as if one instruction at
a time were executed, with some interleaving of the external stores
Pi.Xk. I.e. from the point of view of the local processor, it's loads
Pi.Lj are performed in order, and in order with the local stores
Pi.Sj. More specifically, there can be constructed an ordering Pi.Mx
which is an interleaving of Pi.Ij (and hence Pi.Lj and Pi.Sj) and
Pi.Xk, and local processor execution is consistent with such an
ordering Pi.Mx.

Note: we say "there can be constructed an ordering". But, so far as I
know, there is no easy way to construct such an ordering for an
particular processor. We know that one could be constructed, but we
don't know what it is. And certainly not an easy way to construct this
in an online manner.

And, again: there need not be a global ordering of stores from all
processors. And nor need there be a global ordering of loads.

A formal model must make a few more statements about the limited forms
of causality that are maintained in processor consistent system.
(E.g. two party causality; three party causality is not maintained, to
the best of my knowledge.) And, to be perfectly honest, I forget what
statements need to be made to differentiate between the two sub-types
of processor consistency: Gharacharloo type I and type II, where in
the latter you can forward from a store buffer (an implementation


As Mitch says, the above can be briefly stated: WB memory is processor
consistent, type II. Describing the interaction of other memory types
is morecomplicated.


I do not know or care very much what the Itanium processor manual says
about x86 memory ordering. I wouldn't be surprised if they got it
wrong; or, as in the examples Joe provide, describe a mapping which
has explanatory value, but not definitional value.


Joe Seigh <jseigh_01(a)> writes:

> MitchAlsup(a) wrote:
> > I didn't find it in the Intel book I have (Pentium Pro)
> > But chapter 7 in Volume 2 of AMD x86-64 Architecture Programmer's
> > Manual (System Programming) describes AMD's side of the situation,
> > starting on page 191 of the Purple Volume.
> > The problem is when you consider the number of memory modes {UC, CD,
> > WC, WP, WT and WB} that no simplistic statement can fully address what
> > the programmer can assume about memory and its ordering properties.
> > WriteBack (cacheable) memory is, however, Processor Consistent.
> >
> The argument being presented in c.p.t. is that processor consistency
> implies loads are in order, perhaps instigated by something Andy Glew
> said about this here
> AFAICT, this is not true for 3 or more processors. E.g.
> processor 1 stores into X
> processor 2 see the store by 1 into X and stores into Y
> So the store into Y occurred after causal reasoning.
> processor 3 loads from Y
> processor 3 loads from X
> If loads were in order you could infer that if processor 3
> sees the new value of Y then it will see the new value of X.
> But the rules for processor consistency *clearly* state that
> you will necessarily see stores by different processors in
> order.
> While there are still ordering constraints on the loads they
> don't have to be strictly in order as Andy incorrectly infers.