From: James Kanze on
On Mar 30, 12:04 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote:
> James Kanze wrote:
>> On Mar 28, 10:05 pm, George Neuner <gneun...(a)comcast.net> wrote:
> <snip>
>>> All the fence does is create a checkpoint in the
>>> instruction sequence at which relevant load or store
>>> instructions dispatched prior to dispatch of the fence
>>> instruction will have completed execution.

>> That's not true for the two architectures whose
>> documentation I've studied, Intel and Sparc. To quote the
>> Intel documentation of MFENCE:

>> Performs a serializing operation on all load and store
>> instructions that were issued prior the MFENCE
>> instruction. This serializing operation guarantees that
>> every load and store instruction that precedes in
>> program order the MFENCE instruction is globally visible
>> before any load or store instruction that follows the
>> MFENCE instruction is globally visible.

>> Note the "globally visible".

> If you read the whole sentence, you have: <1> is globally
> visible before <2> is globally visible. That doesn't sound to
> me as saying that <1> is globally visible.

It guarantees that if <1> is not globally visible, then neither
will <2> be. I don't know what more you might want.

> I don't think that the above says that instructions that
> precede MFENCE are guaranteed to be visible after the MFENCE
> instruction completes.

I don't think that "instruction completes" has any real meaning
in a modern processor. Arguably, a store instruction doesn't
complete until the data is stable in main memory. But that
doesn't stop the processor from going on and doing other things.
In practice, the ordering is all that is relevant anyway.

> It does guarantee, however, that the earlier instructions are
> visible before ANY of the following instructions are visible.

Which is all that is required.

--
James Kanze

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: James Kanze on
On Mar 29, 11:55 pm, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
> "James Kanze" <james.ka...(a)gmail.com> wrote in message

> news:36f7e40e-4584-430d-980e-5f7478728d16(a)z3g2000yqz.googlegroups.com...
> <snip>

>>> Performance is often cited as another reason to not use
>>> volatile however the use of volatile can actually help with
>>> multi-threading performance as you can perform a safe
>>> lock-free check before performing a more expensive lock.

>> Again, I'd like to see how. This sounds like the
>> double-checked locking idiom, and that's been proven not to
>> work.

> IMO for an OoO CPU the double checked locking pattern can be
> made to work with volatile if fences are also used or the lock
> also acts as a fence (as is the case with VC++/x86).

Double checked locking can be made to work if you introduce
inline assembler or use some other technique to insert a fence
or a membar instruction in the appropriate places. But of
course, then, the volatile becomes superficial.

> This is also the counter-example you are looking for, it
> should work on some implementations.

It's certainly not an example of a sensible use of volatile,
since without the membar/fence, the algorithm doesn't work (at
least on most modern processors, which are multicore). And with
the membar/fence, the volatile is superfluous, and not needed.

> FWIW VC++ is clever enough to make the volatile redundant for
> this example however adding volatile makes no difference to
> the generated code (read: no performance penalty) and I like
> making such things explicit similar to how one uses const
> (doesn't effect the generated output but documents the
> programmer's intentions).

The use of a fence or membar (or some system specific "atomic"
access) would make the intent explicit. The use of volatile
suggests something completely different (memory mapped IO, or
some such).

> Which is better: use volatile if there is no noticeable
> performance penalty or constantly check your compiler's
> generated assembler to check the optimizer is not breaking
> things?

The reason there is no performance penalty is because volatile
doesn't do the job. And you don't have to check the generated
assembler for anything (unless you suspect a compiler error);
you check the guarantees given by the compiler.

OK: I'll admit that finding such specifications is very, very
difficult. But they should exist, and they'll guarantee you
with regards to future releases, as well. And there are some
guarantees that are expressed indirectly: if a compiler claims
Posix conformance, and supports multithreading, then you get the
guarantees from the Posix standard; the issue is a bit less
clear under Windows, but if a compiler claims to support
multithreading, then it should conform to the Windows
conventions about this.

> The only volatile in my entire codebase is for the "status" of
> my "threadable" base class and I don't always acquire a lock
> before checking this status and I don't fully trust that the
> optimizer won't cache it for all cases that might crop up as I
> develop code.

I'd have to see the exact code to be sure, but I'd guess that
without an mfence somewhere in there, the code won't work on a
multicore machine (which is just about everything today), and
with the mfence, the the volatile isn't necessary.

Also, at least under Solaris, if there is no contention, the
execution time of pthread_mutex_lock is practically the same as
that of membar. Although I've never actually measured it, I
suspect that the same is true if you use CriticalSection (and
not Mutex) under Windows.

> BTW I try and avoid singletons too so I haven't found the need
> to use the double checked locking pattern AFAICR.

Double checked locking is a pattern which can be applied to many
things, not just to singletons.

--
James Kanze

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: James Kanze on
On Mar 30, 12:05 pm, George Neuner <gneun...(a)comcast.net> wrote:
> On Mon, 29 Mar 2010 16:53:44 CST, James Kanze <james.ka...(a)gmail.com>
> wrote:
>> On Mar 28, 10:05 pm, George Neuner <gneun...(a)comcast.net> wrote:

[...]
>>> The purpose of the fence is to sequence memory accesses.

>> For a much more rigorous definition of "access" that that used
>> by the C++ standard.

> Not exactly. I agree that the processor's memory model
> guarantees stronger ordering than that of the C++ standard (or
> almost any language standard, for that matter), but you are
> attributing semantics to "fence" that aren't there.

I'm not attributing anything. I'm just quoting the
documentation.

>>> All the fence does is create a checkpoint in the instruction
>>> sequence at which relevant load or store instructions
>>> dispatched prior to dispatch of the fence instruction will
>>> have completed execution.

>> That's not true for the two architectures whose documentation
>> I've studied, Intel and Sparc.

> Then you'd better go back and study 8)

I've quoted Intel. Regretfully, the Sparc site was down when I
tried to access it, but I've studied their documentation fairly
intensely, and it basically guarantees the same thing.

>> To quote the Intel documentation of MFENCE:

>> Performs a serializing operation on all load and store
>> instructions that were issued prior the MFENCE
>> instruction. This serializing operation guarantees that
>> every load and store instruction that precedes in
>> program order the MFENCE instruction is globally visible
>> before any load or store instruction that follows the
>> MFENCE instruction is globally visible.

> Now look at LFENCE, SFENCE and CLFUSH and think about why they
> are provided separately. Also look at PREFETCH and see what
> it says about fences.

There are many types of fences. Obviously, in any given case,
you should use the one which provides the guarantees you need.

> Intel provides MFENCE as a heavyweight combination of LFENCE,
> SFENCE and CLFLUSH. MFENCE does propagate to memory *because*
> it flushes the cache. However the primitive, SFENCE, ensures
> propagation of writes only to L2 cache.

So what use is it, then?

> Sparc has no single instruction that both fences and flushes
> the cache. MEMBAR ensures propagation only to L2 cache. A
> separate FLUSH instruction is necessary to ensure propagation
> to memory.

That's not what it says in the Sparc Architecture Specification.
(Levels of cache are never mentionned; the architecture allows
an implementation with any number of levels.)

> Sparc also does not have separate load and store fences, but
> it offers two variants of MEMBAR which provide differing
> consistency guarantees.

There is only one Membar instruction, with a 4 bit mask to
control the barriers: LOADLOAD, LOADSTORE, STORELOAD and
STORESTORE. (There are some other bits to control other
functionality, but they are irrelevant with regards to memory
synchronization in a multithreaded environment.)

>> Note the "globally visible". Both Intel and Sparc guarantee
>> strong ordering within a single core (i.e. a single thread);
>> mfence or membar (Sparc) are only necessary if the memory will
>> also be "accessed" from a separate unit: a thread running on a
>> different core, or memory mapped IO.

> Again, you're attributing semantics that aren't there.

I just quoted the documentation. What part of "globally
visible" don't you understand.

> For a store to be "globally visible" means that the value must
> be visible from outside the core. This requires the value be
> in *some* externally visible memory - not *main* memory in
> particular. For both x86 and Sparc, this means L2 cache - the
> first level that can be snooped off-chip.

That's an original definition of "global".

> For a load "globally visible" means that the value is present
> at all levels of the memory hierarchy and cannot be seen
> differently by an external observer. This simply follows from
> the normal operation of the read pipeline - the value is
> written into all levels of cache (more or less) at the same
> time it is loaded into the core register.

> Note also that some CPUs can prefetch data in ways that bypass
> externally visible levels of cache. Sparc and x86 (at least
> since Pentium III) do not permit this.

Sparc certainly does allow it (at least according to the Sparc
Architecture Specification), and I believe some of the newer
Intel do as well.

>>> However, in a memory hierarchy with caching, a store
>>> instruction does not guarantee a write to memory but only that
>>> one or more write cycles is executed on the core's memory
>>> connection bus.

>> On Intel and Sparc architectures, a store instruction doesn't
>> even guarantee that. All it guarantees is that the necessary
>> information is somehow passed to the write pipeline. What
>> happens after that is anybody's guess.

> No. On both of those architectures a store instruction will
> eventually cause the value to be written out of the core
> (except maybe if a hardware exception occurs).

Not on a Sparc. At least not according to the Sparc
Architecture Specification. Practically speaking I doubt that
this is guaranteed for any modern architecture, given the
performance implications.

> Additionally the source register may renamed or the stored
> value may be forwarded within the core to rendezvous with a
> subsequent read of the same location already in the pipeline
> ... but these internal flow optimizations don't affect the
> externally visible operation of the store instruction.

As long as there is only a single store instruction to a given
location, that store will eventually percolate out to the main
memory. If there are several, it's quite possible that some of
them will never appear outside the processor.

>>> For another thread (or core or CPU) to perceive a change a
>>> value must be propagated into shared memory. For all
>>> multi-core processors I am aware of, the first shared level of
>>> memory is cache - not main memory. Cores on the same die
>>> snoop each other's primary caches and share higher level
>>> caches. Cores on separate dies in the same package share
>>> cache at the secondary or tertiary level.

>> And on more advanced architectures, there are core's which
>> don't share any cache. All of which is irrelevant, since
>> simply issuing a store instruction doesn't even guarantee a
>> write to the highest level cache, and a membar or a fence
>> instruction guarantees access all the way down to the main,
>> shared memory.

> Sorry, but no. Even the architectures we've discussed here, x86 and
> Sparc, do not satisfy your statement.

I quoted the specification from Intel for the x86. The Sparc
site was down, and my copy of the Sparc Architecture
Specification is on a machine in France, so I'm sorry, I can't
quote it here. But I do know what it says. And a membar
instruction does guarantee strong ordering.

> There might be architectures I'm unaware of which can elide an
> off-core write entirely by rendezvous forwarding and register
> renaming, but you haven't named one. I would consider eliding
> the store to be a dangerous interpretation of memory semantics
> and I suspect I would not be alone.

Dangerous or not, no processor can afford to neglect this
important optimization opportunity. And it causes no problems
in single threaded programs, nor in multithreaded programs which
use proper synchronization methods.

> I'm not familiar with any cached architecture for which
> fencing alone guarantees that a store writes all the way to
> main memory - I know some that don't even have/need fencing
> because their on-chip caches are write-through.

I just pointed one out. By quoting the manufacturer's
specifications for the mfence instruction. If I were on my Unix
machine in Paris, I could equally quote similar text for the
Sparc.

>>> The upshot is this:
>>> - "volatile" is required for any CPU.

>> I'm afraid that doesn't follow from anything you've said.
>> Particularly because the volatile is largely a no-op on most
>> current compilers---it inhibits compiler optimizations, but the
>> generated code does nothing to prevent the reordering that
>> occurs at the hardware level.

> "volatile" is required because the compiler must not reorder
> or optimize away the loads or stores.

Which loads and stores. The presence of a fence (or inline
assembler, or specific system or library calls) guarantee that
the compiler cannot reorder around it. And whether the compiler
reorders or suppresses elsewhere is irrelevant, since the
hardware can do it regardless of the code the compiler
generates.

>>> - fences are required for an OoO CPU.

>> By OoO, I presume you mean "out of order". That's not the only
>> source of the problems.

> OoO is not the *only* source of the problem. The compiler has
> little control over hardware reordering ... fences are blunt
> instruments that impact all loads or stores ... not just those
> to language level "volatiles".

Agreed. Volatile has different semantics (at least that was the
intent). See Herb Sutter's comments else thread.

>>> - cache control is required for a write-back cache between
>>> CPU and main memory.

>> The cache is largely irrelevent on Sparc or Intel. The
>> processor architectures are designed in a way to make it
>> irrelevant. All of the problems would be there even in the
>> absence of caching. They're determined by the implementation of
>> the write and read pipelines.

> That's a naive point of view. For a cached processor, the
> operation of the cache and it's impact on real programs is
> *never* "irrelevant".

I was speaking uniquely in the context of threading. The
operation of the cache is very relevant with regards to
performance, for example.

--
James Kanze

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: George Neuner on
On Tue, 30 Mar 2010 05:04:48 CST, Andy Venikov
<swojchelowek(a)gmail.com> wrote:

> James Kanze wrote:
>
>> To quote the Intel documentation of MFENCE:
>>
>> Performs a serializing operation on all load and store
>> instructions that were issued prior the MFENCE
>> instruction. This serializing operation guarantees that
>> every load and store instruction that precedes in
>> program order the MFENCE instruction is globally visible
>> before any load or store instruction that follows the
>> MFENCE instruction is globally visible.
>>
>> Note the "globally visible".
>
> If you read the whole sentence, you have:
> <1> is globally visible before <2> is globally visible. That doesn't
> sound to me as saying that <1> is globally visible.
>
> I don't think that the above says that instructions that precede MFENCE
> are guaranteed to be visible after the MFENCE instruction completes. It
> does guarantee, however, that the earlier instructions are visible
> before ANY of the following instructions are visible.

It reads a bit funny, but it is correct.

MFENCE divides the total set of loads and stores in process into the
group dispatched before the fence instruction and the group dispatched
after. At completion of the fence instruction, it is guaranteed that
all of the before group will have been completed AND none of the after
group will be completed.

See LFENCE and SFENCE for more details. MFENCE combines their effects
and adds a cache flush as well.

George

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: James Kanze on
On Mar 30, 12:03 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote:
> Herb Sutter wrote:

[...]
> So, with the above said, here's a concrete example of how I'd
> use volatile without an access to a ready-made library. Let's
> take Magued Michael's lock-free queue ("Simple, Fast and
> Practical Non-blocking and blocking queue algorithms", Magued
> Michael & Michael Scott; 1996). It uses a technique similar to
> DCL to verify a validity of a read. Look into it's deque()
> method.
> I'll provide the pseudo code here:

> dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean
> D1: loop # Keep trying until Dequeue is done
> D2: head = Q->Head # Read Head
> D3: tail = Q->Tail # Read Tail
> D4: next = head->next # Read Head.ptr->next
> D5: if head == Q->Head # Are head, tail, and next consistent?
> D6: if head.ptr == tail.ptr # Is queue empty or Tail falling behind?
> D7: if next.ptr == NULL # Is queue empty?
> D8: return FALSE # Queue is empty, couldn�t dequeue
> D9: endif
> # Tail is falling behind. Try to advance it
> D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>)
> D11: else # No need to deal with Tail
> # Read value before CAS, otherwise another dequeue might free the
> next node
> D12: *pvalue = next.ptr->value
> # Try to swing Head to the next node
> D13: if CAS(&Q->Head, head, <next.ptr, head.count+1>)
> D14: break # Dequeue is done. Exit loop
> D15: endif
> D16: endif
> D17: endif
> D18: endloop
> D19: free(head.ptr) # It is safe now to free the old dummy node
> D20: return TRUE # Queue was not empty, dequeue succeeded

> Look at line D5: it needs to check if Q->Head is still the
> same as what we read from it before. Otherwise two
> possibilities for breaking the correctness arise: 1) it would
> be possible for the element pointed to by Q->Head to be
> re-inserted back into the queue with NULL in the "next" and

> was never empty in any given moment; or 2) The first element
> was removed after we've read Q->Head and before we've read
> next thus there could be garbage in head->next by the time we
> read it and we'd try to access garbage on line D12.

> This piece of pseudo code could be naively translated to a
> following c++ code:

> while (true)
> {
> Node * localHead = head_;
> Node * localTail = tail_;
> Node * localNext = localHead->next;
> if (localHead == head_)
> {
> ...
> }

> But it wouldn't work for the obvious reasons. One needs to
> insert MemoryFences in the right places. Memory fences is
> something that is highly platform-specific, so one would
> define macros for them that would expand to different
> instructions on different platforms. Here's the code with
> memory fences inserted:

> while (true)
> {
> Node * localHead = head_;
> Node * localTail = tail_;
> DataDependencyBarrier(); //All the systems that I know of will do
> //this sort of barrier automatically, so
> //this macro will expand to nothing

> Node * localNext = localHead->next;
> LoadLoadBarrier(); //on x86 this will expand to nothing

> if (localHead == head_)
> {
> ...
>
> }

> This is much better, but it still got problems: first, on x86,
> the LoadLoadBarrier() will expand to nothing and there will be
> no indication to the compiler not to re-order different loads;
> and second (and I think it's the crux of my argument) that an
> optimizing compiler will dispose of the "if" statement even in
> the face of memory barriers.

First, I very much doubt that the LoadLoad barrier can expand to
nothing if the code is to work. It certainly cannot on a Sparc,
and I rather doubt that it can on an Intel; I'd have to see the
guarantees in writing to believe otherwise. And second, if a
compiler moves code accross a barrier, it is broken, and there's
not much you can do about it.

> No matter how many or what type of memory barriers you insert,
> the compiler will be allowed to omit the if statement.

Really? According to who? Obviously, the ISO standard says
nothing about this case; the presence of the barrier introduces
undefined behavior, and the compiler might do anything, as far
as ISO C++ is concerned. But you're obviously depending on
other standards as well. Standards which specify what the
barrier guarantees, for example. And these standards apply to
the compiler as well.

> The ONLY way to force the compiler (any compiler for that
> matter) to generate it is to declare head_ as volatile.

No. The way to force the compiler to generate the necessary
accesses is to use the same implementation specific guarantees
you count on for the barrier to work.

> Here's the final code:
> struct Node
> {
> <unspecified> data;
> Node volatile * pNext;};

> Node volatile * volatile head_;
> Node volatile * volatile tail_;

> dequeue()
> {
> while (true)
> {
> Node volatile * localHead = head_;
> Node volatile * localTail = tail_;
> DataDependencyBarrier();
> Node volatile * localNext = localHead->next;

> if (localHead == head_)
> {
> ...
> }
> ....
>
> }

> Now this code will produce the intended correct object code on
> all the compilers I've listed above and on at least these
> CPUs: x86, itanium, mips, PowerPC (assuming that all the
> MemoryBarriers have been defined for all the platforms). And
> without any modifications to the above code. How's that for
> portability?

Yes, but it's just as portable without the volatile. If the
barriers are defined correctly, the compiler will not move code
over them.

--
James Kanze


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]