From: Andy Glew on
Alexander Terekhov <terekhov(a)web.de> writes:

> So just do cmpxchg(&X, 42, 42) which will perform locked read-write
> (with its read part store-load fenced from prior writes, I infer).
> You'll get classic SC if you replace all loads with cmpxchg(&X, 42,
> 42). That's my understanding, and I'm eagerly awaiting confirmation
> from Andy Glew and/or someone from Intel hanging at C++ memory model
> mailing list.

42, eh? Sounds like a joke: Goodbye, and thanks for all the thrash...

I think that the overall intention is that placing MFENCE before and
after every memory reference is supposed to get you SC semantics.
However, MFENCE, LFENCE, and SFENCE were defined after my time, and I
suspect that their definitions are not quite complete enough for what
you want. In particular, *FENCE really only work wrt WC cacheable
memory, and do not drain external buffers such as may occur in bus
bridges. In general, the P6 and Wmt families' mechanism for ensuring
ordering, waiting for global observability, only works for perfectly
vanilla WC cacheable memory, and is frequently violated wrt other
memory types. So I do not want to guarantee that it will work for
things like WC cached memory that is private to a graphics
accelerator.

You may be right that using the cmpxchg as you describe achieves SC on
x86. However, I need to think about it a bit more, since the
reasoning you provide is implementation specific, not architectural.

(Note that an atomic RMW like cmpxchg could well be implemented
without any fencing semantics. I.e. atomic RMWs and memory
ordering/fencing are independent concepts. I argued for this in
Itanium; I am trying to remember if x86 required that the two be mixed
up together. I can't see why it should have... I.e. I am sure that
using cmpxchg as you describe need not provide SC on a reasonable
computer architecture. I just need to find out if x86 mixed the two up
for some legacy reasons. In the meantime: use the fences would be my
recommendation.)


> > 4) The only way to guarantee that a processor has the most recent
> > value of a location is to take ownership of the variable,
> > and that requires a write. Since we actually want to read X,
> ^^^^^^^^^^^^^^^^^^^^^^^^^
>
> That's the key.
>
> > we use CAS (x86 LOCK CMPXCHG) to read the most recent value.

Flawed argument.

It is entirely possible to imagine implementations of CAS that do not
write the variable if the value is unchanged.

> That will work too, but you don't really need to LD X and loop on
> CAS compare failure given that x86's cmpxchg always makes a write.
> "The destination operand is written back if the comparison fails;
> otherwise, the source operand is written into the destination. (The
> processor never produces a locked read without also producing a
> locked write.)"

You are confusing implementation with semantics.
From: Joe Seigh on
David Hopwood wrote:
> Joe Seigh wrote:
>
>> David Hopwood wrote:
>>>
>>>> That one? And what do people think the memory model that only
>>>> "I/O, locking, and/or serializing instructions" can synchronize is?
>>>
>>>
>>> You're overanalysing a fairly loosely worded recommendation.
>>
>>
>> I'm not sure what you're saying here. That all future processors
>> from Intel that don't have processor ordering won't be x86?
>
>
> Well, they won't be x86-as-we-know-it. OSes, compilers, etc. will
> have to be changed to run on or generate code for this new x86-like
> thing, and changes in the memory model will probably be only one issue
> they need to deal with.
>
>> And that the synchronization intructions in these future processors
>> won't be similar to the one's in x86? That Intel is telling people
>> in an x86 manual to start writing portable code not now but when
>> they get to the future processor?
>
>
> Of course not. Read what they actually wrote.
>

I did. It sounded to me like they said if you want to write
portable code, don't assume processor ordering but use the
locking and serializing instructions instead on the current
processors.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
From: Alexander Terekhov on

Andy Glew wrote:
[...]
> I think that the overall intention is that placing MFENCE before and
> after every memory reference is supposed to get you SC semantics.

But without remote write atomicity, I suppose. And, BTW, that's what
revised Java volatiles do. I mean JSR-133 memory model.

> However, MFENCE, LFENCE, and SFENCE were defined after my time, and I
> suspect that their definitions are not quite complete enough for what
> you want. In particular, *FENCE really only work wrt WC cacheable
> memory, and do not drain external buffers such as may occur in bus
> bridges.

My reading of the specs is that MFENCE is guaranteed to provide
store-load barrier.

P1: X = 1; R1 = Y;
P2: Y = 1; R2 = X;

(R1, R2) = (0, 0) is allowed under pure PC, but

P1: X = 1; MFENCE; R1 = Y;
P2: Y = 1; MFENCE; R2 = X;

(R1, R2) = (0, 0) is NOT allowed.

> In general, the P6 and Wmt families' mechanism for ensuring
> ordering, waiting for global observability, only works for perfectly
> vanilla WC cacheable memory, and is frequently violated wrt other
> memory types. So I do not want to guarantee that it will work for
> things like WC cached memory that is private to a graphics
> accelerator.

I want to know whether MFENCE provides store-load barrier for WB
memory.

>
> You may be right that using the cmpxchg as you describe achieves SC on
> x86. However, I need to think about it a bit more, since the
> reasoning you provide is implementation specific, not architectural.

I'm just reading the specs.

CMPXCHG on x86 always performs a (hopefully StoreLoad+LoadLoad fenced)
load followed by a (LoadStore+StoreStore fenced) store (plus trailing
MFENCE, so to speak). Locked CMPXCHG is supposed to be "fully fenced".

Regarding safety net for remote write atomicity, I rely on the
following CMPXCHG wording:

"The destination operand is written back if the comparison fails;
otherwise, the source operand is written into the destination.
(The processor never produces a locked read without also
producing a locked write.)"

I suspect that (locked) XADD(addr, 0) will also work... but I'm
somewhat missing strong language about mandatory write as in CMPXCHG.

[... cmpxchg could well be implemented without any fencing ...]

"Locked operations are atomic with respect to all other memory
operations and all externally visible events. Only instruction
fetch and page table accesses can pass locked instructions. Locked
instructions can be used to synchronize data written by one
processor and read by another processor.

For the P6 family processors, locked operations serialize all
outstanding load and store operations (that is, wait for them to
complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception: load operations that reference
weakly ordered memory types (such as the WC memory type) may not
be serialized."

> You are confusing implementation with semantics.

Fix the specs, then.

And explain how can one achieve classic SC semantics for WB memory.

regards,
alexander.
From: David Hopwood on
Joe Seigh wrote:
> David Hopwood wrote:
>> Joe Seigh wrote:
>>
>>> I'm not sure what you're saying here. That all future processors
>>> from Intel that don't have processor ordering won't be x86?
>>
>> Well, they won't be x86-as-we-know-it. OSes, compilers, etc. will
>> have to be changed to run on or generate code for this new x86-like
>> thing, and changes in the memory model will probably be only one issue
>> they need to deal with.
>>
>>> And that the synchronization intructions in these future processors
>>> won't be similar to the one's in x86? That Intel is telling people
>>> in an x86 manual to start writing portable code not now but when
>>> they get to the future processor?
>>
>> Of course not. Read what they actually wrote.
>
> I did. It sounded to me like they said if you want to write
> portable code, don't assume processor ordering but use the
> locking and serializing instructions instead on the current
> processors.

But OSes, thread libraries and language implementations *aren't* portable
code.

--
David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>
From: Joe Seigh on
David Hopwood wrote:
> Joe Seigh wrote:
>
>> David Hopwood wrote:
>>>
>>> Of course not. Read what they actually wrote.
>>
>>
>> I did. It sounded to me like they said if you want to write
>> portable code, don't assume processor ordering but use the
>> locking and serializing instructions instead on the current
>> processors.
>
>
> But OSes, thread libraries and language implementations *aren't* portable
> code.
>

I do not think that word means what you think it means.

Note that I am an ex-kernel developer and have created enough sychronization
api's that run on totally different platforms. I've created an atomically
threadsafe reference counted smart pointer that has two totally different
implmentations on two different architectures. Given that Sun Microsystems'
research division couldn't manage to do this and could only do it is on a
obsolete architecture, I'd say I have a pretty good idea what portability is
and what its issues are.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: CPU <> Memory chip communication interface
Next: Multicores