From: Alexander Terekhov on

Andy Glew wrote:
[...]
> briefly stated: WB memory is processor consistent, type II.

Would you please confirm that in order to get SC semantics for x86 WB
memory, I just need to replace all loads by lock-cmpxchg with 42 in
accumulator and simply use resulting value in accumulator after cmpxchg
as load operation result... which would also provide store-load fencing
inside cmpxchg with respect to load from DEST?

TIA.

regards,
alexander.
From: Eric P. on
Alexander Terekhov wrote:
>
> "Eric P." wrote:
> [...]
> > I was wondering that myself. How about:
> > P3:
> > LD X
> > LFENCE
> > LD Y
> > LFENCE
> > LD X
>
> That won't change anything. For causality, you need to CAS X on P3.

Does the following basically reflect your reasoning:

Scenario:
processor 1 stores into X
processor 2 see the store by 1 into X and stores into Y
processor 3 loads from Y
processor 3 loads from X

1) Processor Consistency intrinsically allows P3 to have a new
value for Y and a stale value for X. This can be accomplished,
for example, by allowing P1 to hand out new values for X to
some peers before ensuring all old values of X are invalid.

There may be an invalidate X winging its' way from P1 to P3,
but there is no guarantee when it will arrive (other than it
do so before the next store by P1 arrives at P3).

2) SFENCE "guarantees that the results of every store instruction
that precedes the store fence in program order is globally visible
before any store instruction that follows the fence."
This is intended for use with weak ordered memory types.

The guarantee is that the value will be 'globally visible' at
some time in the future and before the next store, NOT that it
will be globally visible at the end of the SFENCE.

When used with normal, Processor Consistency and Write Back caching
memory this is exactly the same guarantee as PC provides, therefore
the SFENCE does nothing to change invalidate delivery.

3) LFENCE does not explicitly guarantee to drain all pending
invalidates for a processor. However even assuming that was
just a documentation oversight and that it really does drain them,
since there is no guarantee that P3 will have received its
invalidate, an LFENCE on P3 does not guarantee X is not stale.
P3 can still receive the new Y, LFENCE to drain the invalidates
and read the old X.

(I considered whether LFENCE might perform a 'global sync' by
communicating with all peers and ensure there were no outstanding
invalidates/updates in flight to itself before the drain in order
to ensure X was up to date. However I don't believe this would
work unless the global sync was itself atomic.)

4) The only way to guarantee that a processor has the most recent
value of a location is to take ownership of the variable,
and that requires a write. Since we actually want to read X,
we use CAS (x86 LOCK CMPXCHG) to read the most recent value.

So in the presence of Processor Consistency, with its lack of
Atomic Visibility, then the causally consistent sequence is:

P3:
LD Y, r1
Loop:
LD X, r2
CAS X, r2, r2
BEZ Loop

Eric

From: Alexander Terekhov on

"Eric P." wrote:
[...]
> Does the following basically reflect your reasoning:

[... 1 - 3 ...]

Yes.

> 4) The only way to guarantee that a processor has the most recent
> value of a location is to take ownership of the variable,
> and that requires a write. Since we actually want to read X,
^^^^^^^^^^^^^^^^^^^^^^^^^

That's the key.

> we use CAS (x86 LOCK CMPXCHG) to read the most recent value.
>
> So in the presence of Processor Consistency, with its lack of
> Atomic Visibility, then the causally consistent sequence is:
>
> P3:
> LD Y, r1
> Loop:
> LD X, r2
> CAS X, r2, r2
> BEZ Loop

That will work too, but you don't really need to LD X and loop on
CAS compare failure given that x86's cmpxchg always makes a write.
"The destination operand is written back if the comparison fails;
otherwise, the source operand is written into the destination. (The
processor never produces a locked read without also producing a
locked write.)"

So just do cmpxchg(&X, 42, 42) which will perform locked read-write
(with its read part store-load fenced from prior writes, I infer).
You'll get classic SC if you replace all loads with cmpxchg(&X, 42,
42). That's my understanding, and I'm eagerly awaiting confirmation
from Andy Glew and/or someone from Intel hanging at C++ memory model
mailing list.

http://tinyurl.com/aqgjj

regards,
alexander.
From: David Hopwood on
Eric P. wrote:
> Joe Seigh wrote:
>>Alexander Terekhov wrote:
>>
>>>Neither will give you "global ordering of loads". Loads on ia32 are
>>>in-order with respect to other loads and subsequent stores (by the
>>>same processor). The only thing that differentiates PC from TSO is
>>>the lack of remote write atomicity (in IA64 formal memory model
>>>speak). Implementations (e.g. SPO) of course can do all sorts of
>>>tricks to improve performance, but that doesn't change the memory
>>>model. You're in denial.
>>
>>Whatever. I'm going to use LFENCE for situations where I'd use
>>#LoadLoad on sparc (generic, not assuming TSO). And it's not
>>because I'm in denial. It's because nothing you say is
>>comprehensible. It's possible you are making some kind of
>>valid technical point but I have no way of telling.
>
> As I understand it, the key to causal ordering is Atomic Visibility
> whereby a write becomes visible simultaneously to all processors
> other than the one that issued the write. According to Gharacharloo,
> Processor Consistency does not require updates be Atomically Visible
> and, in theory allows non causal ordering of the kind in your
> example. TSO does require Atomic Visibility.

Right.

[...]
> The text of LFENCE instruction in the Intel instruction manual says
> "Performs a serializing operation on all load-from-memory instructions
> that were issued prior the LFENCE instruction. This serializing
> operation guarantees that every load instruction that precedes in
> program order the LFENCE instruction is globally visible before any
> load instruction that follows the LFENCE instruction is globally
> visible. The LFENCE instruction is ordered with respect to load
> instructions, other LFENCE instructions,"...
>
> seems to provide the guarantees for globally visibility and
> therefore causality that you are looking for.

It's not entirely clear what "globally visible" in the Intel manual
is supposed to mean in the terminology of
<http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>,
but I think it means just "performed" (with respect to all processors),
*not* "globally performed".

--
David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>
From: David Hopwood on
Joe Seigh wrote:
> Alexander Terekhov wrote:
>
>> So where do you put the fence, then?
>>
>> : processor 1 stores into X
>> : processor 2 see the store by 1 into X and stores into Y
>> : processor 3 loads from Y
>> : processor 3 loads from X
>
> Since this was my example I should clarify. It was meant to
> show that PC alone wasn't sufficient to guarantee that if processor
> 3 saw the store into Y by processor 2 that it would see the
> store into X by processor 1.
>
> My understanding of the ia32 memory model is that you
> need a fence instruction between the loads by processor 3
> and a fence between the load and store by processor 2 to
> make the guarantee work.

My understanding is that if the claimed problem exists at all, adding
these fences won't fix it (as far as the model is concerned, possibly
as opposed to implementation details of specific chips).

--
David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: CPU <> Memory chip communication interface
Next: Multicores