From: Andy Venikov on
Herb Sutter wrote:
> Please remember this: Standard ISO C/C++ volatile is useless for
> multithreaded programming. No argument otherwise holds water; at best
> the code may appear to work on some compilers/platforms, including all
> attempted counterexamples I've seen on this thread.

You have an enormous clout on C++ professionals, including myself, so
before permanently agreeing to such an all-encompassing statement allow
me to maybe step back a little and see what it is that's at the core of
this argument. Maybe we're arguing the same point. Or maybe I'm missing
something big in which case I'll be doubly glad to have been shown my
wrong assumptions.

I understand that volatile never was supposed to be of any help for
multithreaded programming. I don't expect it to issue any memory fences
nor make any guarantees whatsoever about anything thread-related...
Yet, on all the compilers I know of (gcc, mingw, MSVC, LLVM, Intel) it
produces just the code I need for my multithreaded programs. And I
really don't see how it wouldn't, given common-sense understanding of
what it should do in single-threaded programs. And I'm pretty sure that
it's not going to change in a foreseeable future.

So my use of volatile maybe not standard-portable, but it sure is
real-life portable.

Here's the point of view I'm coming from.
Imagine that someone needs to implement a library that provides certain
multithreading (multiprogramming) tools like atomic access,
synchronization primitives and some lock-free algorithms that will be
used by other developers so that they wouldn't have to worry about
things like volatile. (Now that boost.atomic is almost out, I'll happily
use it. But Helge Bahmann (the author of the library) didn't have such a
luxury, so to make his higher-level APIs work he had to internally
resort to low-level tools like volatiles where appropriate.)

So, with the above said, here's a concrete example of how I'd use
volatile without an access to a ready-made library. Let's take Magued
Michael's lock-free queue ("Simple, Fast and Practical Non-blocking and
blocking queue algorithms", Magued Michael & Michael Scott; 1996). It
uses a technique similar to DCL to verify a validity of a read. Look
into it's deque() method.
I'll provide the pseudo code here:

dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean
D1: loop # Keep trying until Dequeue is done
D2: head = Q�>Head # Read Head
D3: tail = Q�>Tail # Read Tail
D4: next = head�>next # Read Head.ptr�>next
D5: if head == Q�>Head # Are head, tail, and next consistent?
D6: if head.ptr == tail.ptr # Is queue empty or Tail falling behind?
D7: if next.ptr == NULL # Is queue empty?
D8: return FALSE # Queue is empty, couldn�t dequeue
D9: endif
# Tail is falling behind. Try to advance it
D10: CAS(&Q�>Tail, tail, <next.ptr, tail.count+1>)
D11: else # No need to deal with Tail
# Read value before CAS, otherwise another dequeue might free the
next node
D12: *pvalue = next.ptr�>value
# Try to swing Head to the next node
D13: if CAS(&Q�>Head, head, <next.ptr, head.count+1>)
D14: break # Dequeue is done. Exit loop
D15: endif
D16: endif
D17: endif
D18: endloop
D19: free(head.ptr) # It is safe now to free the old dummy node
D20: return TRUE # Queue was not empty, dequeue succeeded

Look at line D5: it needs to check if Q->Head is still the same as what
we read from it before. Otherwise two possibilities for breaking the
correctness arise: 1) it would be possible for the element pointed to by
Q->Head to be re-inserted back into the queue with NULL in the "next"
and then dequeue would return "empty" when in reality the queue was
never empty in any given moment; or 2) The first element was removed
after we've read Q->Head and before we've read next thus there could be
garbage in head->next by the time we read it and we'd try to access
garbage on line D12.

This piece of pseudo code could be naively translated to a following c++
code:

while (true)
{
Node * localHead = head_;
Node * localTail = tail_;
Node * localNext = localHead->next;
if (localHead == head_)
{
...
}

But it wouldn't work for the obvious reasons.
One needs to insert MemoryFences in the right places.
Memory fences is something that is highly platform-specific, so one
would define macros for them that would expand to different instructions
on different platforms.
Here's the code with memory fences inserted:

while (true)
{
Node * localHead = head_;
Node * localTail = tail_;
DataDependencyBarrier(); //All the systems that I know of will do
//this sort of barrier automatically, so
//this macro will expand to nothing
Node * localNext = localHead->next;
LoadLoadBarrier(); //on x86 this will expand to nothing
if (localHead == head_)
{
....
}

This is much better, but it still got problems: first, on x86, the
LoadLoadBarrier() will expand to nothing and there will be no indication
to the compiler not to re-order different loads; and second (and I think
it's the crux of my argument) that an optimizing compiler will dispose
of the "if" statement even in the face of memory barriers. No matter how
many or what type of memory barriers you insert, the compiler will be
allowed to omit the if statement. The ONLY way to force the compiler
(any compiler for that matter) to generate it is to declare head_ as
volatile.

Here's the final code:
struct Node
{
<unspecified> data;
Node volatile * pNext;
};
Node volatile * volatile head_;
Node volatile * volatile tail_;

dequeue()
{
while (true)
{
Node volatile * localHead = head_;
Node volatile * localTail = tail_;
DataDependencyBarrier();
Node volatile * localNext = localHead->next;

if (localHead == head_)
{
...
}
.....
}


Now this code will produce the intended correct object code on all the
compilers I've listed above and on at least these CPUs: x86, itanium,
mips, PowerPC (assuming that all the MemoryBarriers have been defined
for all the platforms). And without any modifications to the above code.
How's that for portability?

I think my fault was that in my previous posts I was pushing more
heavily on volatile's ability to tell the compiler not to reorder the
instructions it generates (which is still useful) rather than to
emphasize the fact that I want volatile to tell the compiler not to
optimize away certain instructions. The reordering problem could be
circumvented by using inline asm statements (and then again, on x86,
LoadLoadBarrier would expand to nothing, so we'd be forced to use a
bogus inline asm statement - I'd rather chose to use volatile), but I
don't see how the optimizing away problem could be circumvented without
the use of volatile.

Now, after writing all this, I realize that I could've used a simpler
example - a simple Peterson's algorithm for two threads wouldn't work
without a use of a volatile: the "turn" variable is assigned the same
value as it's being compared to later, so the compiler will omit the "if
turn == x" part in the if statement.

<snip>

I hope this clears matters - I'm sorry if I wasn't clear before.



> ---
> Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)
>
> Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
> Architect, Visual C++ (www.gotw.ca/microsoft)
>


Andy.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Andy Venikov on
James Kanze wrote:
> On Mar 26, 12:05 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote:
>> James Kanze wrote:
>>> If that's the case, then the fence instruction is seriously
>>> broken. The whole purpose of a fence instruction is to
>>> guarantee that another CPU (with another thread) can see the
>>> changes. (Of course, the other thread also needs a fence.)
>
>> Hmm, the way I understand fences is that they introduce
>> ordering and not necessarily guarantee visibility. For
>> example:
>
>> 1. Store to location 1
>> 2. StoreStore fence
>> 3. Store to location 2
>
>> will guarantee only that if store to location 2 is visible to
>> some thread, then the store to location 1 is guaranteed to be
>> visible to the same thread as well.
>
> A StoreStore fence guarantees that all stores issued before the
> fence are visible in main memory, and that none issued after the
> fence are visible (at the time the StoreStore fence is
> executed).
>
> Of course, for another thread to be guaraneed to see the results
> of any store, it has to use a load fence, to ensure that the
> values it sees are those after the load fence, and not some
> value that it happened to pick up earlier.

What I meant was that memory fence doesn't mean that the effects of a
write will be immediately flushed to the main memory or effects of a
read immediately read from the main memory. Generally, memory fence is
merely a checkpoint to tell the processor not to reorder instructions
around the fence. I don't remember what processor docs I've read (I
believe it was Itanium) but here's for example what the docs said about
a store fence: a store barrier would make sure that all the stores
appearing before a fence would be stored in the write-queue before any
of the stores that follow the fence. In no way you're guaranteed that
any of the stores are in main memory after the fence instruction was
executed. For that you'd have to use a flush instruction.

Andy.

>
> --
> James Kanze
>

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Andy Venikov on
James Kanze wrote:
> On Mar 28, 10:05 pm, George Neuner <gneun...(a)comcast.net> wrote:
<snip>
>> All the fence does is create a checkpoint in the instruction
>> sequence at which relevant load or store instructions
>> dispatched prior to dispatch of the fence instruction will
>> have completed execution.
>
> That's not true for the two architectures whose documentation
> I've studied, Intel and Sparc. To quote the Intel documentation
> of MFENCE:
>
> Performs a serializing operation on all load and store
> instructions that were issued prior the MFENCE
> instruction. This serializing operation guarantees that
> every load and store instruction that precedes in
> program order the MFENCE instruction is globally visible
> before any load or store instruction that follows the
> MFENCE instruction is globally visible.
>
> Note the "globally visible".

If you read the whole sentence, you have:
<1> is globally visible before <2> is globally visible. That doesn't
sound to me as saying that <1> is globally visible.

I don't think that the above says that instructions that precede MFENCE
are guaranteed to be visible after the MFENCE instruction completes. It
does guarantee, however, that the earlier instructions are visible
before ANY of the following instructions are visible.


<snip>

Andy.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: George Neuner on
On Mon, 29 Mar 2010 16:53:44 CST, James Kanze <james.kanze(a)gmail.com>
wrote:

>On Mar 28, 10:05 pm, George Neuner <gneun...(a)comcast.net> wrote:
>> On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james.ka...(a)gmail.com>
>> wrote:
>
>> >On Mar 25, 7:10 pm, George Neuner <gneun...(a)comcast.net> wrote:
>>
>> >> As you noted, 'volatile' does not guarantee that an OoO CPU will
>> >> execute the stores in program order ...
>
>> >Arguably, the original intent was that it should. But it
>> >doesn't, and of course, the ordering guarantee only applies to
>> >variables actually declared volatile.
>
>> "volatile" is quite old ... I'm pretty sure the "intent" was defined
>> before there were OoO CPUs (in de facto use if not in standard
>> document). Regardless, "volatile" only constrains the behavior of the
>> *compiler*.
>
>More or less. Volatile requires the compiler to issue code
>which is conform to what the documentation says it does. It
>requires all accesses to take place after the preceding sequence
>point, and the results of those accesses to be stable before the
>following sequence point. But it leaves it up to the
>implementation to define what is meant by "access", and most
>take a very, very liberal view of it.

Agreed, with the caveat that some ISAs do not give the compiler the
tools to achieve this.


>> The purpose of the fence is to sequence memory accesses.
>
>For a much more rigorous definition of "access" that that used
>by the C++ standard.

Not exactly. I agree that the processor's memory model guarantees
stronger ordering than that of the C++ standard (or almost any
language standard, for that matter), but you are attributing semantics
to "fence" that aren't there.


>> All the fence does is create a checkpoint in the instruction
>> sequence at which relevant load or store instructions
>> dispatched prior to dispatch of the fence instruction will
>> have completed execution.
>
>That's not true for the two architectures whose documentation
>I've studied, Intel and Sparc.

Then you'd better go back and study 8)


>To quote the Intel documentation of MFENCE:
>
> Performs a serializing operation on all load and store
> instructions that were issued prior the MFENCE
> instruction. This serializing operation guarantees that
> every load and store instruction that precedes in
> program order the MFENCE instruction is globally visible
> before any load or store instruction that follows the
> MFENCE instruction is globally visible.

Now look at LFENCE, SFENCE and CLFUSH and think about why they are
provided separately. Also look at PREFETCH and see what it says about
fences.


Intel provides MFENCE as a heavyweight combination of LFENCE, SFENCE
and CLFLUSH. MFENCE does propagate to memory *because* it flushes the
cache. However the primitive, SFENCE, ensures propagation of writes
only to L2 cache.

Sparc has no single instruction that both fences and flushes the
cache. MEMBAR ensures propagation only to L2 cache. A separate FLUSH
instruction is necessary to ensure propagation to memory.

Sparc also does not have separate load and store fences, but it offers
two variants of MEMBAR which provide differing consistency guarantees.


>Note the "globally visible". Both Intel and Sparc guarantee
>strong ordering within a single core (i.e. a single thread);
>mfence or membar (Sparc) are only necessary if the memory will
>also be "accessed" from a separate unit: a thread running on a
>different core, or memory mapped IO.

Again, you're attributing semantics that aren't there.

For a store to be "globally visible" means that the value must be
visible from outside the core. This requires the value be in *some*
externally visible memory - not *main* memory in particular.
For both x86 and Sparc, this means L2 cache - the first level that can
be snooped off-chip.

For a load "globally visible" means that the value is present at all
levels of the memory hierarchy and cannot be seen differently by an
external observer. This simply follows from the normal operation of
the read pipeline - the value is written into all levels of cache
(more or less) at the same time it is loaded into the core register.

Note also that some CPUs can prefetch data in ways that bypass
externally visible levels of cache. Sparc and x86 (at least since
Pentium III) do not permit this.


>> However, in a memory hierarchy with caching, a store
>> instruction does not guarantee a write to memory but only that
>> one or more write cycles is executed on the core's memory
>> connection bus.
>
>On Intel and Sparc architectures, a store instruction doesn't
>even guarantee that. All it guarantees is that the necessary
>information is somehow passed to the write pipeline. What
>happens after that is anybody's guess.

No. On both of those architectures a store instruction will
eventually cause the value to be written out of the core (except maybe
if a hardware exception occurs). Additionally the source register may
renamed or the stored value may be forwarded within the core to
rendezvous with a subsequent read of the same location already in the
pipeline ... but these internal flow optimizations don't affect the
externally visible operation of the store instruction.


>> For another thread (or core or CPU) to perceive a change a
>> value must be propagated into shared memory. For all
>> multi-core processors I am aware of, the first shared level of
>> memory is cache - not main memory. Cores on the same die
>> snoop each other's primary caches and share higher level
>> caches. Cores on separate dies in the same package share
>> cache at the secondary or tertiary level.
>
>And on more advanced architectures, there are core's which don't
>share any cache. All of which is irrelevant, since simply
>issuing a store instruction doesn't even guarantee a write to
>the highest level cache, and a membar or a fence instruction
>guarantees access all the way down to the main, shared memory.

Sorry, but no. Even the architectures we've discussed here, x86 and
Sparc, do not satisfy your statement.


There might be architectures I'm unaware of which can elide an
off-core write entirely by rendezvous forwarding and register
renaming, but you haven't named one. I would consider eliding the
store to be a dangerous interpretation of memory semantics and I
suspect I would not be alone.

I'm not familiar with any cached architecture for which fencing alone
guarantees that a store writes all the way to main memory - I know
some that don't even have/need fencing because their on-chip caches
are write-through.


>> The upshot is this:
>> - "volatile" is required for any CPU.
>
>I'm afraid that doesn't follow from anything you've said.
>Particularly because the volatile is largely a no-op on most
>current compilers---it inhibits compiler optimizations, but the
>generated code does nothing to prevent the reordering that
>occurs at the hardware level.

"volatile" is required because the compiler must not reorder or
optimize away the loads or stores.


>> - fences are required for an OoO CPU.
>
>By OoO, I presume you mean "out of order". That's not the only
>source of the problems.

OoO is not the *only* source of the problem. The compiler has little
control over hardware reordering ... fences are blunt instruments that
impact all loads or stores ... not just those to language level
"volatiles".


>> - cache control is required for a write-back cache between
>> CPU and main memory.
>
>The cache is largely irrelevent on Sparc or Intel. The
>processor architectures are designed in a way to make it
>irrelevant. All of the problems would be there even in the
>absence of caching. They're determined by the implementation of
>the write and read pipelines.

That's a naive point of view. For a cached processor, the operation
of the cache and it's impact on real programs is *never* "irrelevant".


>James Kanze

George

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Herb Sutter on
On Mon, 29 Mar 2010 16:55:44 CST, "Leigh Johnston" <leigh(a)i42.co.uk>
wrote:
>"James Kanze" <james.kanze(a)gmail.com> wrote in message
>news:36f7e40e-4584-430d-980e-5f7478728d16(a)z3g2000yqz.googlegroups.com...
>>> Performance is often cited as another reason to not use
>>> volatile however the use of volatile can actually help with
>>> multi-threading performance as you can perform a safe
>>> lock-free check before performing a more expensive lock.
>>
>> Again, I'd like to see how. This sounds like the double-checked
>> locking idiom, and that's been proven not to work.
>
>IMO for an OoO CPU the double checked locking pattern can be made to work
>with volatile if fences are also used or the lock also acts as a fence (as
>is the case with VC++/x86). This is also the counter-example you are
>looking for, it should work on some implementations. FWIW VC++ is clever
>enough to make the volatile redundant for this example however adding
>volatile makes no difference to the generated code (read: no performance
>penalty)

Are you sure? On x86 a VC++ volatile write is supposed to be emitted
as xchg, whereas an ordinary write is usually emitted as mov. If the
DCL control variable write is emitted as mov on x86 then DCL won't
work correctly (well, it'll appear to work...).

Herb


---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]