|
From: Andy Glew on 5 Sep 2005 20:22 Alexander Terekhov <terekhov(a)web.de> writes: > So just do cmpxchg(&X, 42, 42) which will perform locked read-write > (with its read part store-load fenced from prior writes, I infer). > You'll get classic SC if you replace all loads with cmpxchg(&X, 42, > 42). That's my understanding, and I'm eagerly awaiting confirmation > from Andy Glew and/or someone from Intel hanging at C++ memory model > mailing list. 42, eh? Sounds like a joke: Goodbye, and thanks for all the thrash... I think that the overall intention is that placing MFENCE before and after every memory reference is supposed to get you SC semantics. However, MFENCE, LFENCE, and SFENCE were defined after my time, and I suspect that their definitions are not quite complete enough for what you want. In particular, *FENCE really only work wrt WC cacheable memory, and do not drain external buffers such as may occur in bus bridges. In general, the P6 and Wmt families' mechanism for ensuring ordering, waiting for global observability, only works for perfectly vanilla WC cacheable memory, and is frequently violated wrt other memory types. So I do not want to guarantee that it will work for things like WC cached memory that is private to a graphics accelerator. You may be right that using the cmpxchg as you describe achieves SC on x86. However, I need to think about it a bit more, since the reasoning you provide is implementation specific, not architectural. (Note that an atomic RMW like cmpxchg could well be implemented without any fencing semantics. I.e. atomic RMWs and memory ordering/fencing are independent concepts. I argued for this in Itanium; I am trying to remember if x86 required that the two be mixed up together. I can't see why it should have... I.e. I am sure that using cmpxchg as you describe need not provide SC on a reasonable computer architecture. I just need to find out if x86 mixed the two up for some legacy reasons. In the meantime: use the fences would be my recommendation.) > > 4) The only way to guarantee that a processor has the most recent > > value of a location is to take ownership of the variable, > > and that requires a write. Since we actually want to read X, > ^^^^^^^^^^^^^^^^^^^^^^^^^ > > That's the key. > > > we use CAS (x86 LOCK CMPXCHG) to read the most recent value. Flawed argument. It is entirely possible to imagine implementations of CAS that do not write the variable if the value is unchanged. > That will work too, but you don't really need to LD X and loop on > CAS compare failure given that x86's cmpxchg always makes a write. > "The destination operand is written back if the comparison fails; > otherwise, the source operand is written into the destination. (The > processor never produces a locked read without also producing a > locked write.)" You are confusing implementation with semantics.
From: Joe Seigh on 5 Sep 2005 21:13 David Hopwood wrote: > Joe Seigh wrote: > >> David Hopwood wrote: >>> >>>> That one? And what do people think the memory model that only >>>> "I/O, locking, and/or serializing instructions" can synchronize is? >>> >>> >>> You're overanalysing a fairly loosely worded recommendation. >> >> >> I'm not sure what you're saying here. That all future processors >> from Intel that don't have processor ordering won't be x86? > > > Well, they won't be x86-as-we-know-it. OSes, compilers, etc. will > have to be changed to run on or generate code for this new x86-like > thing, and changes in the memory model will probably be only one issue > they need to deal with. > >> And that the synchronization intructions in these future processors >> won't be similar to the one's in x86? That Intel is telling people >> in an x86 manual to start writing portable code not now but when >> they get to the future processor? > > > Of course not. Read what they actually wrote. > I did. It sounded to me like they said if you want to write portable code, don't assume processor ordering but use the locking and serializing instructions instead on the current processors. -- Joe Seigh When you get lemons, you make lemonade. When you get hardware, you make software.
From: Alexander Terekhov on 6 Sep 2005 05:01 Andy Glew wrote: [...] > I think that the overall intention is that placing MFENCE before and > after every memory reference is supposed to get you SC semantics. But without remote write atomicity, I suppose. And, BTW, that's what revised Java volatiles do. I mean JSR-133 memory model. > However, MFENCE, LFENCE, and SFENCE were defined after my time, and I > suspect that their definitions are not quite complete enough for what > you want. In particular, *FENCE really only work wrt WC cacheable > memory, and do not drain external buffers such as may occur in bus > bridges. My reading of the specs is that MFENCE is guaranteed to provide store-load barrier. P1: X = 1; R1 = Y; P2: Y = 1; R2 = X; (R1, R2) = (0, 0) is allowed under pure PC, but P1: X = 1; MFENCE; R1 = Y; P2: Y = 1; MFENCE; R2 = X; (R1, R2) = (0, 0) is NOT allowed. > In general, the P6 and Wmt families' mechanism for ensuring > ordering, waiting for global observability, only works for perfectly > vanilla WC cacheable memory, and is frequently violated wrt other > memory types. So I do not want to guarantee that it will work for > things like WC cached memory that is private to a graphics > accelerator. I want to know whether MFENCE provides store-load barrier for WB memory. > > You may be right that using the cmpxchg as you describe achieves SC on > x86. However, I need to think about it a bit more, since the > reasoning you provide is implementation specific, not architectural. I'm just reading the specs. CMPXCHG on x86 always performs a (hopefully StoreLoad+LoadLoad fenced) load followed by a (LoadStore+StoreStore fenced) store (plus trailing MFENCE, so to speak). Locked CMPXCHG is supposed to be "fully fenced". Regarding safety net for remote write atomicity, I rely on the following CMPXCHG wording: "The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)" I suspect that (locked) XADD(addr, 0) will also work... but I'm somewhat missing strong language about mandatory write as in CMPXCHG. [... cmpxchg could well be implemented without any fencing ...] "Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor. For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception: load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized." > You are confusing implementation with semantics. Fix the specs, then. And explain how can one achieve classic SC semantics for WB memory. regards, alexander.
From: David Hopwood on 6 Sep 2005 07:26 Joe Seigh wrote: > David Hopwood wrote: >> Joe Seigh wrote: >> >>> I'm not sure what you're saying here. That all future processors >>> from Intel that don't have processor ordering won't be x86? >> >> Well, they won't be x86-as-we-know-it. OSes, compilers, etc. will >> have to be changed to run on or generate code for this new x86-like >> thing, and changes in the memory model will probably be only one issue >> they need to deal with. >> >>> And that the synchronization intructions in these future processors >>> won't be similar to the one's in x86? That Intel is telling people >>> in an x86 manual to start writing portable code not now but when >>> they get to the future processor? >> >> Of course not. Read what they actually wrote. > > I did. It sounded to me like they said if you want to write > portable code, don't assume processor ordering but use the > locking and serializing instructions instead on the current > processors. But OSes, thread libraries and language implementations *aren't* portable code. -- David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>
From: Joe Seigh on 6 Sep 2005 07:53
David Hopwood wrote: > Joe Seigh wrote: > >> David Hopwood wrote: >>> >>> Of course not. Read what they actually wrote. >> >> >> I did. It sounded to me like they said if you want to write >> portable code, don't assume processor ordering but use the >> locking and serializing instructions instead on the current >> processors. > > > But OSes, thread libraries and language implementations *aren't* portable > code. > I do not think that word means what you think it means. Note that I am an ex-kernel developer and have created enough sychronization api's that run on totally different platforms. I've created an atomically threadsafe reference counted smart pointer that has two totally different implmentations on two different architectures. Given that Sun Microsystems' research division couldn't manage to do this and could only do it is on a obsolete architecture, I'd say I have a pretty good idea what portability is and what its issues are. -- Joe Seigh When you get lemons, you make lemonade. When you get hardware, you make software. |