From: Joe Seigh on
Alexander Terekhov wrote:
> So where do you put the fence, then?
>
> : processor 1 stores into X
> : processor 2 see the store by 1 into X and stores into Y
> : processor 3 loads from Y
> : processor 3 loads from X
>

Since this was my example I should clarify. It was meant to
show that PC alone wasn't sufficient to guarantee that if processor
3 saw the store into Y by processor 2 that it would see the
store into X by processor 1.

My understanding of the ia32 memory model is that you
need a fence instruction between the loads by processor 3
and a fence between the load and store by processor 2 to
make the guarantee work.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
From: Alexander Terekhov on

Joe Seigh wrote:
[...]
> My understanding of the ia32 memory model is that you
> need a fence instruction between the loads by processor 3

And what are you going to do on a (hypothetical) quad 486 (or
some other old ia32) box without SSE fences? ;-)

regards,
alexander.
From: Alexander Terekhov on

Joe Seigh wrote:
[...]
> My understanding of the ia32 memory model is that you
> need a fence instruction between the loads by processor 3

LFENCE/#LoadLoad is implied by processor consistency.

> and a fence between the load and store by processor 2 to
> make the guarantee work.

#LoadStore fence for P2 (load X ... store Y) is also implied by
processor consistency.

So what's the point?

regards,
alexander.
From: Eric P. on
Alexander Terekhov wrote:
>
> "Eric P." wrote:
> [...]
> > I was wondering that myself. How about:
> > P3:
> > LD X
> > LFENCE
> > LD Y
> > LFENCE
> > LD X
>
> That won't change anything. For causality, you need to CAS X on P3.

Yeah. X could change again after the first fence. Silly me. :-)
I was trying to avoid the fact that the LFENCE definition does NOT
require all queued invalidates to be delivered before proceeding.
That might allow the update to X to remain outstanding.

It would be simpler if they had used definitions like for the
Alpha Memory Barrier MB instruction:

"MB and CALL_PAL IMB force all preceding writes to at least reach
their respective coherency points. This does not mean that main-memory
writes have been done, just that the order of the eventual writes is
committed.

MB and CALL_PAL IMB also force all queued cache invalidates to be
delivered to the local caches before starting any subsequent reads
(that may otherwise cache hit on stale data) or writes (that may
otherwise write the cache, only to have the write effectively
overwritten by a late-delivered invalidate)."

> Power architecture also doesn't guarantee atomic visibility.

Not that it is relevant to the x86, but a PowerPC 750 manual that
I have from 1999 says

"3.3.5.1 Performed Loads and Stores
The PowerPC architecture defines a performed load operation as one
that has the addressed memory location bound to the target register
of the load instruction. The architecture defines a performed store
operation as one where the stored value is the value that any other
processor will receive when executing a load operation."

This would seem to indicate that, at least for that model,
it used atomic visibility. It still needs sync instructions
to prevent load & store reordering or bypassing.

Eric
From: Alexander Terekhov on

Andy Glew wrote:
[...]
> briefly stated: WB memory is processor consistent, type II.

Would you please confirm that in order to get SC semantics for x86 WB
memory, I just need to replace all loads by lock-cmpxchg with 42 in
accumulator and simply use resulting value in accumulator after cmpxchg
as load operation result... which would also provide store-load fencing
inside cmpxchg with respect to load from DEST?

TIA.

regards,
alexander.