From: Alexander Terekhov on

Andy Glew wrote:
[...]
> I think that the overall intention is that placing MFENCE before and
> after every memory reference is supposed to get you SC semantics.

But without remote write atomicity, I suppose. And, BTW, that's what
revised Java volatiles do. I mean JSR-133 memory model.

> However, MFENCE, LFENCE, and SFENCE were defined after my time, and I
> suspect that their definitions are not quite complete enough for what
> you want. In particular, *FENCE really only work wrt WC cacheable
> memory, and do not drain external buffers such as may occur in bus
> bridges.

My reading of the specs is that MFENCE is guaranteed to provide
store-load barrier.

P1: X = 1; R1 = Y;
P2: Y = 1; R2 = X;

(R1, R2) = (0, 0) is allowed under pure PC, but

P1: X = 1; MFENCE; R1 = Y;
P2: Y = 1; MFENCE; R2 = X;

(R1, R2) = (0, 0) is NOT allowed.

> In general, the P6 and Wmt families' mechanism for ensuring
> ordering, waiting for global observability, only works for perfectly
> vanilla WC cacheable memory, and is frequently violated wrt other
> memory types. So I do not want to guarantee that it will work for
> things like WC cached memory that is private to a graphics
> accelerator.

I want to know whether MFENCE provides store-load barrier for WB
memory.

>
> You may be right that using the cmpxchg as you describe achieves SC on
> x86. However, I need to think about it a bit more, since the
> reasoning you provide is implementation specific, not architectural.

I'm just reading the specs.

CMPXCHG on x86 always performs a (hopefully StoreLoad+LoadLoad fenced)
load followed by a (LoadStore+StoreStore fenced) store (plus trailing
MFENCE, so to speak). Locked CMPXCHG is supposed to be "fully fenced".

Regarding safety net for remote write atomicity, I rely on the
following CMPXCHG wording:

"The destination operand is written back if the comparison fails;
otherwise, the source operand is written into the destination.
(The processor never produces a locked read without also
producing a locked write.)"

I suspect that (locked) XADD(addr, 0) will also work... but I'm
somewhat missing strong language about mandatory write as in CMPXCHG.

[... cmpxchg could well be implemented without any fencing ...]

"Locked operations are atomic with respect to all other memory
operations and all externally visible events. Only instruction
fetch and page table accesses can pass locked instructions. Locked
instructions can be used to synchronize data written by one
processor and read by another processor.

For the P6 family processors, locked operations serialize all
outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception: load operations that reference
weakly ordered memory types (such as the WC memory type) may not
be serialized."

> You are confusing implementation with semantics.

Fix the specs, then.

And explain how can one achieve classic SC semantics for WB memory.

regards,
alexander.
From: David Hopwood on
Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>
>>> I'm not sure what you're saying here. That all future processors
>>> from Intel that don't have processor ordering won't be x86?
>>
>> Well, they won't be x86-as-we-know-it. OSes, compilers, etc. will
>> have to be changed to run on or generate code for this new x86-like
>> thing, and changes in the memory model will probably be only one issue
>> they need to deal with.
>>
>>> And that the synchronization intructions in these future processors
>>> won't be similar to the one's in x86? That Intel is telling people
>>> in an x86 manual to start writing portable code not now but when
>>> they get to the future processor?
>>
>> Of course not. Read what they actually wrote.
>
> I did. It sounded to me like they said if you want to write
> portable code, don't assume processor ordering but use the
> locking and serializing instructions instead on the current
> processors.

But OSes, thread libraries and language implementations *aren't* portable
code.

--
David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>
From: Joe Seigh on
David Hopwood wrote:
> Joe Seigh wrote:
>
>> David Hopwood wrote:
>>>
>>> Of course not. Read what they actually wrote.
>>
>>
>> I did. It sounded to me like they said if you want to write
>> portable code, don't assume processor ordering but use the
>> locking and serializing instructions instead on the current
>> processors.
>
>
> But OSes, thread libraries and language implementations *aren't* portable
> code.
>

I do not think that word means what you think it means.

Note that I am an ex-kernel developer and have created enough sychronization
api's that run on totally different platforms. I've created an atomically
threadsafe reference counted smart pointer that has two totally different
implmentations on two different architectures. Given that Sun Microsystems'
research division couldn't manage to do this and could only do it is on a
obsolete architecture, I'd say I have a pretty good idea what portability is
and what its issues are.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
From: Joe Seigh on
Alexander Terekhov wrote:
> Andy Glew wrote:
>
>
>>You are confusing implementation with semantics.
>
>
> Fix the specs, then.

I think you can assume that the serializing stuff does the right thing.
If not and you have strong reason to believe otherwise, then you should
short Intel stock as you'd stand a pretty good chance of making a fortune.
Basically, no OS would work correctly on an Intel based multi-processor
server and Intel would be out of that business. Also Intel would be
screwed in the multi-core workstation and desktop market as it would be
too late to fix the current processors going into production.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
From: David Hopwood on
Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>> David Hopwood wrote:
>>>
>>>> Of course not. Read what they actually wrote.
>>>
>>> I did. It sounded to me like they said if you want to write
>>> portable code, don't assume processor ordering but use the
>>> locking and serializing instructions instead on the current
>>> processors.
>>
>> But OSes, thread libraries and language implementations *aren't* portable
>> code.
>
> I do not think that word means what you think it means.
>
> Note that I am an ex-kernel developer and have created enough
> sychronization api's that run on totally different platforms.

You are totally missing the point. OSes, thread libraries and language
implementations have some code that needs to be adapted to each hardware
architecture. If the memory model were to change in future processors
that are otherwise x86-like, this code would have to change. It's not a
big deal, because this platform-specific code is maintained by people who
know how to change it, and because there are few enough OSes, thread
libraries, and language implementations for the total effort involved
not to be very great. It would, however, be a big deal if existing x86
*applications* stopped working on an otherwise x86-compatible processor.

--
David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>