From: Leigh Johnston on


"James Kanze" <james.kanze(a)gmail.com> wrote in message news:da63ca83-4d6e-416a-9825-c24deed3e49f(a)10g2000yqq.googlegroups.com...

<snip>

> Double checked locking can be made to work if you introduce
> inline assembler or use some other technique to insert a fence
> or a membar instruction in the appropriate places. But of
> course, then, the volatile becomes superficial.

It is only superficial if there is a compiler guarantee that a load/store for a non-volatile variable is emitted in the presence of a fence which sounds like a dubious guarantee to me. What compilers stop performing optimizations in the presence of a
fence and/or how does the compiler know which variables accesses can be optimized in the presence of a fence?

>
>> This is also the counter-example you are looking for, it
>> should work on some implementations.
>
> It's certainly not an example of a sensible use of volatile,
> since without the membar/fence, the algorithm doesn't work (at
> least on most modern processors, which are multicore). And with
> the membar/fence, the volatile is superfluous, and not needed.

Read what I said above.

>> FWIW VC++ is clever enough to make the volatile redundant for
>> this example however adding volatile makes no difference to
>> the generated code (read: no performance penalty) and I like
>> making such things explicit similar to how one uses const
>> (doesn't effect the generated output but documents the
>> programmer's intentions).
>
> The use of a fence or membar (or some system specific "atomic"
> access) would make the intent explicit. The use of volatile
> suggests something completely different (memory mapped IO, or
> some such).

Obviously we disagree on this point hence the reason for the existence of this argument we are having.

<snip>

>> The only volatile in my entire codebase is for the "status" of
>> my "threadable" base class and I don't always acquire a lock
>> before checking this status and I don't fully trust that the
>> optimizer won't cache it for all cases that might crop up as I
>> develop code.
>
> I'd have to see the exact code to be sure, but I'd guess that
> without an mfence somewhere in there, the code won't work on a
> multicore machine (which is just about everything today), and
> with the mfence, the the volatile isn't necessary.

The code does work on a multi-core machine and I am confident it will continue to work when I write new code precisely because I am using volatile and therefore guaranteed a load will be emitted not optimized away.

> Also, at least under Solaris, if there is no contention, the
> execution time of pthread_mutex_lock is practically the same as
> that of membar. Although I've never actually measured it, I
> suspect that the same is true if you use CriticalSection (and
> not Mutex) under Windows.

Critical sections are expensive when compared to a simple load that is guaranteed by using volatile. It is not always necessary to use a fence as all a fence is doing is guaranteeing order so it all depends on the use-case.

>
>> BTW I try and avoid singletons too so I haven't found the need
>> to use the double checked locking pattern AFAICR.
>
> Double checked locking is a pattern which can be applied to many
> things, not just to singletons.
>

I never said otherwise, singletons are an obvious example use though and this thread was originally about singletons.

/Leigh

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Herb Sutter on
On Mon, 29 Mar 2010 20:22:54 CST, "Leigh Johnston" <leigh(a)i42.co.uk>
wrote:
>"Herb Sutter" <herb.sutter(a)gmail.com> wrote in message news:j462r55lr984u1vg0jttl47upeutfcardg(a)4ax.com...
><snip>
>>>On Mar 29, 7:45 am, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
>>>> Performance is often cited as another reason to not use
>>>> volatile however
>>
>> No "however" needed, that cited reason is correct. Volatile disables
>> optimizations that would be legal for an atomic<>. A quick example is
>> combining/eliding writes (e.g., v = 1; v = 2; can't be transformed to
>> v = 2;, but a = 1; a = 2; can be transformed to a = 2;). Another is
>> combining/eliding reads.
>
>If atomic reads can be elided won't that be problematic for using atomics with the double checked locking pattern (so we are back to using volatile atomics)?

No, because it's only adjacent atomic reads (i.e., separated only by
ordinary memory operations) that can be combined. In DCL there's a
lock acquire operation in between, and the atomic reads can't be
reordered across that to make them adjacent, so they can't be
combined.

Herb


---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Herb Sutter on
On Tue, 30 Mar 2010 05:03:11 CST, Andy Venikov
<swojchelowek(a)gmail.com> wrote:
>Herb Sutter wrote:
>> Please remember this: Standard ISO C/C++ volatile is useless for
>> multithreaded programming. No argument otherwise holds water; at best
>> the code may appear to work on some compilers/platforms, including all
>> attempted counterexamples I've seen on this thread.
>
>You have an enormous clout on C++ professionals, including myself, so
>before permanently agreeing to such an all-encompassing statement allow
>me to maybe step back a little and see what it is that's at the core of
>this argument. Maybe we're arguing the same point. Or maybe I'm missing
>something big in which case I'll be doubly glad to have been shown my
>wrong assumptions.

Short answer: Note I deliberately said "Standard" above -- the above
statement is completely true for portable usage. You may get away with
it on some platforms today, but it's nonportable and even the
getting-away won't last.

Slightly longer answer follows:

>I understand that volatile never was supposed to be of any help for
>multithreaded programming. I don't expect it to issue any memory fences
> nor make any guarantees whatsoever about anything thread-related...

Yes, and that's why it can't reliably be used for inter-thread
communication == synchronization.

>Yet, on all the compilers I know of (gcc, mingw, MSVC, LLVM, Intel) it
>produces just the code I need for my multithreaded programs. And I
>really don't see how it wouldn't, given common-sense understanding of
>what it should do in single-threaded programs. And I'm pretty sure that
>it's not going to change in a foreseeable future.
>
>So my use of volatile maybe not standard-portable, but it sure is
>real-life portable.

It's like relying on undefined behavior. UB may happen to do what you
expected, most of the time, on your current compiler and platform.
That doesn't mean it's correct or portable, and it will be less and
less real-life portable on multi-core systems.

Because there was no better hook, volatile was strengthened (in
non-standard ways) on various systems. For example, on MS VC++ prior
to VC++ 2005 (I think), volatile had no ordering semantics at all, but
people thought it was used for inter-thread communications because the
Windows InterlockedXxxx APIs happened to take a volatile variable. But
that was just using volatile as a type system tag to help you not
accidentally pass a plain variable, and a little bit to leverage the
lack of optimizations on volatile -- the real reason it worked was
because you were calling the InterlockedXxx APIs because *those* are
correctly synchronized for lock-free coding.

Even now in VC++ 2005 and later, when volatile was strengthened so
that reads and writes are (almost) SC, to get fully SC lock-free code
in all cases you still have to use the InterlockedXxx APIs rather than
direct reads and writes of the volatile variable. The strengthened
volatile semantics makes that, on that compiler and when targeting
x86/x64, using direct reads and writes is enough to make most examples
like DCL work, but it isn't enough to make examples like Dekker's work
-- for Dekker's to work correctly you still have to use the
InterlockedXxx APIs.


>Here's the point of view I'm coming from.
>Imagine that someone needs to implement a library that provides certain
>multithreading (multiprogramming) tools like atomic access,
>synchronization primitives and some lock-free algorithms that will be
>used by other developers so that they wouldn't have to worry about
>things like volatile. (Now that boost.atomic is almost out, I'll happily
>use it.

Important note: Using std::atomic<> is exactly the correct answer!

The only caveat is that it's not yet widely available, but this year
we're getting over the hump of wide availability thanks to Boost and
others.

>But Helge Bahmann (the author of the library) didn't have such a

Isn't it Anthony Williams who's doing Boost's atomic<> implementation?
Hmm.

>luxury, so to make his higher-level APIs work he had to internally
>resort to low-level tools like volatiles where appropriate.)

Of course, sure. The implementation of std::atomic<> on any given
platform needs to use platform-specific tools, including things like
explicit fences/membars (e.g., mf+st.rel on IA64), ordered APIs (e.g,.
InterlockedIncrement on Windows), and/or other nonstandard and
nonportable goo (e.g., platform-specific variants of volatile).

The implementation of any standard feature typically will internally
use nonstandard system-specific features. That's the standard
feature's purpose, to shield users from those details and make this
particular system do the right particular thing.

[...]
>Look at line D5: it needs to check if Q->Head is still the same as what
>we read from it before. Otherwise two possibilities for breaking the
>correctness arise: 1) it would be possible for the element pointed to by
[...]
>This piece of pseudo code could be naively translated to a following c++
>code:
>
>while (true)
>{
>Node * localHead = head_;
>Node * localTail = tail_;
>Node * localNext = localHead->next;
>if (localHead == head_)
>{
> ...
>}
>
>But it wouldn't work for the obvious reasons.
>One needs to insert MemoryFences in the right places.
[...]

Fences are evil. Nearly nobody can use them consistently correctly,
including people who have years of experience with them. Those people
(write once, and from then on) use the Linux atomics package or C++0x
std::atomic.

Every mutable shared object should be protected by a mutex (99.9%
case) or be atomic (0.1% case).

If you're going to write lock-free code, it's really, really, really
important to just make the shared variables be C++0x std::atomic<> (or
equivalently Java or .NET volatile, which isn't the same thing as ISO
C and C++ volatile). If you do, you won't have to reason about where
the fences need to go. Reasoning about where the fences need to go is
such a futile and error-prone job that most lock-free papers don't
even try to say where to put them and just assume SC execution.


>Here's the final code:

I apologize for not having time to read your transformations of
Maged's code closely, but in all of the following, why is the volatile
on the Node, not on the pointer? Even if volatile did all the magic
you want it to do (like Java/.NET volatile), that's broken because
it's in the wrong place, isn't it? Of course, the usual manifestation
of the problem is that the code will compile, run, and appear to
work...

>struct Node
>{
> <unspecified> data;
> Node volatile * pNext;
>};
>Node volatile * volatile head_;
>Node volatile * volatile tail_;
>
>dequeue()
>{
> while (true)
> {
> Node volatile * localHead = head_;
> Node volatile * localTail = tail_;
> DataDependencyBarrier();
> Node volatile * localNext = localHead->next;
>
> if (localHead == head_)
> {
> ...
> }
>....
>}
>
>
>Now this code will produce the intended correct object code on all the
>compilers I've listed above and on at least these CPUs: x86, itanium,
>mips, PowerPC (assuming that all the MemoryBarriers have been defined
>for all the platforms). And without any modifications to the above code.
>How's that for portability?

Without even read the code logic and looking for races, I doubt it.

For a detailed analysis of multiple lock-free implementations of a
similar queue example, including an exceedingly rare race that even
under sustained heavy stress on a 24-core system only manifested once
every tens of millions of insertions, see:

Measuring Parallel Performance: Optimizing a Concurrent Qeue
http://www.drdobbs.com/high-performance-computing/212201163


>Now, after writing all this, I realize that I could've used a simpler
>example - a simple Peterson's algorithm for two threads wouldn't work
>without a use of a volatile: the "turn" variable is assigned the same
>value as it's being compared to later, so the compiler will omit the "if
>turn == x" part in the if statement.

Actually, Dekker's/Peterson's is broken even with VC++ 2008
heavily-strengthened volatile. (Sorry.) To make it correct you have to
store to the flag variable using InterlockedExchange() or similar, not
using a simple write to the flag variable.

Herb


---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Herb Sutter on
On Tue, 30 Mar 2010 09:47:55 CST, "Leigh Johnston" <leigh(a)i42.co.uk>
wrote:
>>>IMO for an OoO CPU the double checked locking pattern can be made to work
>>>with volatile if fences are also used or the lock also acts as a fence (as
>>>is the case with VC++/x86). This is also the counter-example you are
>>>looking for, it should work on some implementations. FWIW VC++ is clever
>>>enough to make the volatile redundant for this example however adding
>>>volatile makes no difference to the generated code (read: no performance
>>>penalty)
>>
>> Are you sure? On x86 a VC++ volatile write is supposed to be emitted
>> as xchg, whereas an ordinary write is usually emitted as mov. If the
>> DCL control variable write is emitted as mov on x86 then DCL won't
>> work correctly (well, it'll appear to work...).
>
>Yes on x86 VC++ (VC9) emits a MOV for a volatile write however entering the
>critical section in the DCL should act as a fence so it should work. I
>asked this question (about VC++ volatile not emitting fences) in
>microsoft.public.vc.language but didn't get a satisfactory reply.

Ah yes, my thinko. I was momentarily rusty on the outcome of internal
design discussions a year or three ago.

Let me try again:

Yes, we (VC++ since VC++ 2005 I think) do emit a plan MOV for a
volatile write. DCL works with a plain MOV on x86/x64 because
x86/x64's strong memory model combined with VC++ 2005+'s restrictions
on reorderings around volatile reads/wrties is just enough to do the
trick.

However, that's not enough to make Dekker's work. To make Dekker's
work correctly today on our platform, you have to perform the write to
the flag variable using an InterlockedXxx() call of some sort, not as
a direct write to a volatile flag variable.

If we wanted native volatile to be suitable for general lock-free
coding including Dekker's using only direct reads and writes of the
volatile variable (i.e., without InterlockedXxx API calls), however,
we would have to emit writes as XCHG, not MOV (and some other things).
We don't do that today, we discussed doing it, and the current
decision is that we do not intend to do it -- it would penalize any
existing uses of volatile, and the right tool for lock-free coding is
atomic<>. (Which we also need to ship and isn't in VC 2010, sorry, but
we're working on it next.)

FWIW, the issue is that VC++ volatile is load/acquire and
store/release, but that still allows one reordering -- store-load
reordering, which kills Dekker's. Emitting the store as xchg would
prevent store-load reordering, hence would fix Dekker's. But the right
way to fix Dekker's on our platform, again, is: for now, use
InterlockedXxx APIs for lock-free code; and when it's available, use
atomic<> (Boost's or eventually ours) for lock-free code.

Herb


---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Herb Sutter on
On Tue, 30 Mar 2010 16:37:59 CST, James Kanze <james.kanze(a)gmail.com>
wrote:
>(I keep seeing mention here of instruction reordering. In the
>end, instruction reordering is irrelevant. It's only one thing
>that may lead to reads and writes being reordered.

Yes, but: Any reordering at any level can be treated as an instruction
reordering -- actually, as a source code reordering. That's why all
language-level MM discussions only bother to talk about source-level
reorderings, because any CPU or cache transformations end up having
the same effect as some corresponding source-level reordering.

Herb


---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]