From: Eric P. on
Alexander Terekhov wrote:
>
> "Eric P." wrote:
> [...]
> > I was wondering that myself. How about:
> > P3:
> > LD X
> > LFENCE
> > LD Y
> > LFENCE
> > LD X
>
> That won't change anything. For causality, you need to CAS X on P3.

Does the following basically reflect your reasoning:

Scenario:
processor 1 stores into X
processor 2 see the store by 1 into X and stores into Y
processor 3 loads from Y
processor 3 loads from X

1) Processor Consistency intrinsically allows P3 to have a new
value for Y and a stale value for X. This can be accomplished,
for example, by allowing P1 to hand out new values for X to
some peers before ensuring all old values of X are invalid.

There may be an invalidate X winging its' way from P1 to P3,
but there is no guarantee when it will arrive (other than it
do so before the next store by P1 arrives at P3).

2) SFENCE "guarantees that the results of every store instruction
that precedes the store fence in program order is globally visible
before any store instruction that follows the fence."
This is intended for use with weak ordered memory types.

The guarantee is that the value will be 'globally visible' at
some time in the future and before the next store, NOT that it
will be globally visible at the end of the SFENCE.

When used with normal, Processor Consistency and Write Back caching
memory this is exactly the same guarantee as PC provides, therefore
the SFENCE does nothing to change invalidate delivery.

3) LFENCE does not explicitly guarantee to drain all pending
invalidates for a processor. However even assuming that was
just a documentation oversight and that it really does drain them,
since there is no guarantee that P3 will have received its
invalidate, an LFENCE on P3 does not guarantee X is not stale.
P3 can still receive the new Y, LFENCE to drain the invalidates
and read the old X.

(I considered whether LFENCE might perform a 'global sync' by
communicating with all peers and ensure there were no outstanding
invalidates/updates in flight to itself before the drain in order
to ensure X was up to date. However I don't believe this would
work unless the global sync was itself atomic.)

4) The only way to guarantee that a processor has the most recent
value of a location is to take ownership of the variable,
and that requires a write. Since we actually want to read X,
we use CAS (x86 LOCK CMPXCHG) to read the most recent value.

So in the presence of Processor Consistency, with its lack of
Atomic Visibility, then the causally consistent sequence is:

P3:
LD Y, r1
Loop:
LD X, r2
CAS X, r2, r2
BEZ Loop

Eric

From: David Hopwood on
Eric P. wrote:
> Joe Seigh wrote:
>>Alexander Terekhov wrote:
>>
>>>Neither will give you "global ordering of loads". Loads on ia32 are
>>>in-order with respect to other loads and subsequent stores (by the
>>>same processor). The only thing that differentiates PC from TSO is
>>>the lack of remote write atomicity (in IA64 formal memory model
>>>speak). Implementations (e.g. SPO) of course can do all sorts of
>>>tricks to improve performance, but that doesn't change the memory
>>>model. You're in denial.
>>
>>Whatever. I'm going to use LFENCE for situations where I'd use
>>#LoadLoad on sparc (generic, not assuming TSO). And it's not
>>because I'm in denial. It's because nothing you say is
>>comprehensible. It's possible you are making some kind of
>>valid technical point but I have no way of telling.
>
> As I understand it, the key to causal ordering is Atomic Visibility
> whereby a write becomes visible simultaneously to all processors
> other than the one that issued the write. According to Gharacharloo,
> Processor Consistency does not require updates be Atomically Visible
> and, in theory allows non causal ordering of the kind in your
> example. TSO does require Atomic Visibility.

Right.

[...]
> The text of LFENCE instruction in the Intel instruction manual says
> "Performs a serializing operation on all load-from-memory instructions
> that were issued prior the LFENCE instruction. This serializing
> operation guarantees that every load instruction that precedes in
> program order the LFENCE instruction is globally visible before any
> load instruction that follows the LFENCE instruction is globally
> visible. The LFENCE instruction is ordered with respect to load
> instructions, other LFENCE instructions,"...
>
> seems to provide the guarantees for globally visibility and
> therefore causality that you are looking for.

It's not entirely clear what "globally visible" in the Intel manual
is supposed to mean in the terminology of
<http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>,
but I think it means just "performed" (with respect to all processors),
*not* "globally performed".

--
David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>
From: David Hopwood on
Joe Seigh wrote:
> Alexander Terekhov wrote:
>
>> So where do you put the fence, then?
>>
>> : processor 1 stores into X
>> : processor 2 see the store by 1 into X and stores into Y
>> : processor 3 loads from Y
>> : processor 3 loads from X
>
> Since this was my example I should clarify. It was meant to
> show that PC alone wasn't sufficient to guarantee that if processor
> 3 saw the store into Y by processor 2 that it would see the
> store into X by processor 1.
>
> My understanding of the ia32 memory model is that you
> need a fence instruction between the loads by processor 3
> and a fence between the load and store by processor 2 to
> make the guarantee work.

My understanding is that if the claimed problem exists at all, adding
these fences won't fix it (as far as the model is concerned, possibly
as opposed to implementation details of specific chips).

--
David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>
From: Alexander Terekhov on

David Hopwood wrote:

[... SSE2 LFENCE ...]

> It's not entirely clear what "globally visible" in the Intel manual

It's just copy&paste leftover from SSE1 SFENCE description.

regards,
alexander.
From: Joe Seigh on
David Hopwood wrote:
> Joe Seigh wrote:
>
>> Alexander Terekhov wrote:
>>
>>> So where do you put the fence, then?
>>>
>>> : processor 1 stores into X
>>> : processor 2 see the store by 1 into X and stores into Y
>>> : processor 3 loads from Y
>>> : processor 3 loads from X
>>
>>
>> Since this was my example I should clarify. It was meant to
>> show that PC alone wasn't sufficient to guarantee that if processor
>> 3 saw the store into Y by processor 2 that it would see the
>> store into X by processor 1.
>>
>> My understanding of the ia32 memory model is that you
>> need a fence instruction between the loads by processor 3
>> and a fence between the load and store by processor 2 to
>> make the guarantee work.
>
>
> My understanding is that if the claimed problem exists at all, adding
> these fences won't fix it (as far as the model is concerned, possibly
> as opposed to implementation details of specific chips).
>

The architected memory model as opposed to the implemented one?

"Despite the fact that Pentium 4, Intel Xeon, and P6 family
processors support processor ordering, Intel does not guarantee that future processors will
support this model. To make software portable to future processors, it is recommended that operating
systems provide critical region and resource control constructs and APIĆ½s (application
program interfaces) based on I/O, locking, and/or serializing instructions be used to synchronize
access to shared areas of memory in multiple-processor systems."

That one? And what to people think the memory model that only
"I/O, locking, and/or serializing instructions" can synchronize is?

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.