From: Joshua Maurice on
On Mar 30, 8:14 pm, Herb Sutter <herb.sut...(a)gmail.com> wrote:
> On Tue, 30 Mar 2010 16:37:59 CST, James Kanze <james.ka...(a)gmail.com>
> wrote:
>
> >(I keep seeing mention here of instruction reordering. In the
> >end, instruction reordering is irrelevant. It's only one thing
> >that may lead to reads and writes being reordered.
>
> Yes, but: Any reordering at any level can be treated as an instruction
> reordering -- actually, as a source code reordering. That's why all
> language-level MM discussions only bother to talk about source-level
> reorderings, because any CPU or cache transformations end up having
> the same effect as some corresponding source-level reordering.

Not quite, no. On "weaker guarantee" processors, let's take the
following example:

/*
start pseudo code example. Forgive me for any "typos". This is off the
top of my head and I haven't really used lambda functions.
*/
int main()
{ int a = 0;
int b = 0;
int c[4];
int d[4];
start_thread([&]() -> void { c[0] = a; d[0] = b; });
start_thread([&]() -> void { c[1] = a; d[1] = b; });
start_thread([&]() -> void { c[2] = a; d[2] = b; });
start_thread([&]() -> void { c[3] = a; d[3] = b; });
a = 1;
b = 2;
cout << c[0] << " " << d[0] << '\n'
<< c[1] << " " << d[1] << '\n'
<< c[2] << " " << d[2] << '\n'
<< c[3] << " " << d[3] << endl;
}
//end pseudo code example

On some modern processors, most (in)famously the DEC Alpha with its
awesome split cache, this program in the real world (or something very
much like it) can print:
0 0
0 2
1 0
1 2

Specifically, this is a single execution of the program. In this
single execution, the writes "a = 1; b = 2;" are seen to happen in two
different orders, the exact same "store instructions" become visible
to other cores in different orders. There is no (sane) source code
level reordering that can achieve this. I tried to emphasize this else-
thread: you cannot think about threading in terms of "possible
interleavings of instructions". It does not portably work. Absent
synchronization, on some processors, there is no global order of
instructions.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: James Kanze on
On 31 Mar, 04:14, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
> "James Kanze" <james.ka...(a)gmail.com> wrote in
> messagenews:da63ca83-4d6e-416a-9825-c24deed3e49f(a)10g2000yqq.googlegroups.com...

> <snip>

> > Double checked locking can be made to work if you introduce
> > inline assembler or use some other technique to insert a
> > fence or a membar instruction in the appropriate places.
> > But of course, then, the volatile becomes superficial.

> It is only superficial if there is a compiler guarantee that a
> load/store for a non-volatile variable is emitted in the
> presence of a fence which sounds like a dubious guarantee to
> me. What compilers stop performing optimizations in the
> presence of a fence and/or how does the compiler know which
> variables accesses can be optimized in the presence of a
> fence?

All of the compilers I know either treat inline assembler or an
external function call to a function written in assembler as a
worse case with regards to optimizing, and do not move code
accross it, or they provide a means of specifying to the
compiler which variables, etc. are affected by the assembler.

> >> This is also the counter-example you are looking for, it
> >> should work on some implementations.

> > It's certainly not an example of a sensible use of volatile,
> > since without the membar/fence, the algorithm doesn't work
> > (at least on most modern processors, which are multicore).
> > And with the membar/fence, the volatile is superfluous, and
> > not needed.

> Read what I said above.

I have. But it doesn't hold water.

> >> FWIW VC++ is clever enough to make the volatile redundant
> >> for this example however adding volatile makes no
> >> difference to the generated code (read: no performance
> >> penalty) and I like making such things explicit similar to
> >> how one uses const (doesn't effect the generated output but
> >> documents the programmer's intentions).

> > The use of a fence or membar (or some system specific
> > "atomic" access) would make the intent explicit. The use of
> > volatile suggests something completely different (memory
> > mapped IO, or some such).

> Obviously we disagree on this point hence the reason for the
> existence of this argument we are having.

Yes. Theoretically, I suppose, you could find a compiler which
documented that it would move code accross a fence or a membar
instruction. In practice: either the compiler treats assembler
as a black box, and supposes that it might do anything, or it
analyses the assembler, and takes the assembler into account
when optimizing. In the first case, the compiler must
synchronize it's view of the memory, because it must suppose
that the assembler reads and writes arbitrary values from
memory. And in the second (which is fairly rare), it recognizes
the fence, and adjusts its optimization accordingly.

Your argument is basically that the compiler writers are either
completely incompetent, or that they are intentionally out to
make your life difficult. In either case, there are a lot more
things that they can do to make your life difficult. I wouldn't
use such a compiler, because it would be, in effect, unusable.

> <snip>
> >> The only volatile in my entire codebase is for the "status" of
> >> my "threadable" base class and I don't always acquire a lock
> >> before checking this status and I don't fully trust that the
> >> optimizer won't cache it for all cases that might crop up as I
> >> develop code.

> > I'd have to see the exact code to be sure, but I'd guess that
> > without an mfence somewhere in there, the code won't work on a
> > multicore machine (which is just about everything today), and
> > with the mfence, the the volatile isn't necessary.

> The code does work on a multi-core machine and I am confident
> it will continue to work when I write new code precisely
> because I am using volatile and therefore guaranteed a load
> will be emitted not optimized away.

If you have the fence in the proper place, you're guaranteed
that it will work, even without volatile. If you don't, you're
not guaranteed anything.

> > Also, at least under Solaris, if there is no contention, the
> > execution time of pthread_mutex_lock is practically the same
> > as that of membar. Although I've never actually measured
> > it, I suspect that the same is true if you use
> > CriticalSection (and not Mutex) under Windows.

> Critical sections are expensive when compared to a simple load
> that is guaranteed by using volatile. It is not always
> necessary to use a fence as all a fence is doing is
> guaranteeing order so it all depends on the use-case.

I'm not sure I follow. Basically, the fence guarantees that the
hardware can't do specific optimizations. The same
optimizations that the software can't do in the case of
volatile. If you think you need volatile, then you certainly
need a fence. (And if you have the fence, you no longer need
the volatile.)

--
James Kanze

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Anthony Williams on
Herb Sutter <herb.sutter(a)gmail.com> writes:

>>But Helge Bahmann (the author of the library) didn't have such a
>
> Isn't it Anthony Williams who's doing Boost's atomic<> implementation?
> Hmm.

No. Helge's implementation covers more platforms than I have access to
or know how to write atomics for.

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Leigh Johnston on
"James Kanze" <james.kanze(a)gmail.com> wrote in message
news:bbd4bca1-2c16-489b-b814-98db0aafb492(a)z4g2000yqa.googlegroups.com...
> On 31 Mar, 04:14, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
>> "James Kanze" <james.ka...(a)gmail.com> wrote in
>> messagenews:da63ca83-4d6e-416a-9825-c24deed3e49f(a)10g2000yqq.googlegroups.com...
>
>> <snip>
>
>> > Double checked locking can be made to work if you introduce
>> > inline assembler or use some other technique to insert a
>> > fence or a membar instruction in the appropriate places.
>> > But of course, then, the volatile becomes superficial.
>
>> It is only superficial if there is a compiler guarantee that a
>> load/store for a non-volatile variable is emitted in the
>> presence of a fence which sounds like a dubious guarantee to
>> me. What compilers stop performing optimizations in the
>> presence of a fence and/or how does the compiler know which
>> variables accesses can be optimized in the presence of a
>> fence?
>
> All of the compilers I know either treat inline assembler or an
> external function call to a function written in assembler as a
> worse case with regards to optimizing, and do not move code
> accross it, or they provide a means of specifying to the
> compiler which variables, etc. are affected by the assembler.
>

Yes I realized that after posting but as this newsgroup is moderated posting
an immediate retraction reply is not possible. :)

{ An immediate retraction may be possible. Just write to the moderators (see the
link in the banner at the end of this article) including the article's tracking
number. If not yet approved the article is then rejected per request. -mod }


<snip>

>> The code does work on a multi-core machine and I am confident
>> it will continue to work when I write new code precisely
>> because I am using volatile and therefore guaranteed a load
>> will be emitted not optimized away.
>
> If you have the fence in the proper place, you're guaranteed
> that it will work, even without volatile. If you don't, you're
> not guaranteed anything.

It is guaranteed to work on the platform for which I am implementing for and
I find it hard to believe that it wouldn't work on other platforms/compilers
which have similar semantics for volatile (which you already agreed was a
fair assumption).

>
>> > Also, at least under Solaris, if there is no contention, the
>> > execution time of pthread_mutex_lock is practically the same
>> > as that of membar. Although I've never actually measured
>> > it, I suspect that the same is true if you use
>> > CriticalSection (and not Mutex) under Windows.
>
>> Critical sections are expensive when compared to a simple load
>> that is guaranteed by using volatile. It is not always
>> necessary to use a fence as all a fence is doing is
>> guaranteeing order so it all depends on the use-case.
>
> I'm not sure I follow. Basically, the fence guarantees that the
> hardware can't do specific optimizations. The same
> optimizations that the software can't do in the case of
> volatile. If you think you need volatile, then you certainly
> need a fence. (And if you have the fence, you no longer need
> the volatile.)
>

My point is that it is possible to write a piece of multi-threaded code
which does not use a fence or a mutex/critical section and just reads a
single shared variable in isolation (ordering not important and read is
atomic on the platform in question) and for this *particular* case volatile
can be useful. I find it hard to believe that there are no cases at all
where this applies.

/Leigh


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Andy Venikov on
James Kanze wrote:
<snip>
> I'm not sure I follow. Basically, the fence guarantees that the
> hardware can't do specific optimizations. The same
> optimizations that the software can't do in the case of
> volatile. If you think you need volatile, then you certainly
> need a fence. (And if you have the fence, you no longer need
> the volatile.)
>

Ah, finally I think I see where you are coming from. You think that if
you have the fence you no longer need a volatile.

I think you assume too much about how fence is really implemented. Since
the standard says nothing about fences you have to rely on a library
that provides them and if you don't have such a library, you'll have to
implement one yourself. A reasonable way to implement a barrier would be
to use macros that, depending on a platform you run, expand to inline
assembly containing the right instruction. In this case the inline asm
will make sure that the compiler won't reorder the emitted instructions,
but it won't make sure that the optimizer will not throw away some
needed instructions.

For example, following my post where I described Magued Michael's
algorithm, here's how relevant excerpt without volatiles would look like:

//x86-related defines:
#define LoadLoadBarrier() asm volatile ("mfence")


//Common code
struct Node
{
Node * pNext;
};
Node * head_;

void f()
{
Node * pLocalHead = head_;
Node * pLocalNext = pLocalHead->pNext;

LoadLoadBarrier();

if (pLocalHead == head_)
{
printf("pNext = %p\n", pLocalNext);
}
}

Just to make you happy I defined LoadLoadBarrier as a full mfence
instruction, even though on x86 there is no need for a barrier here,
even on a multicore/multiprocessor.

And here's how gcc 4.3.2 on Linux/x86-64 generated object code:

0000000000400630 <_Z1fv>:
400630: 0f ae f0 mfence
400633: 48 8b 05 fe 09 20 00 mov 0x2009fe(%rip),%rax #
601038 <head_>
40063a: bf 5c 07 40 00 mov $0x40075c,%edi
40063f: 48 8b 30 mov (%rax),%rsi
400642: 31 c0 xor %eax,%eax
400644: e9 bf fe ff ff jmpq 400508 <printf(a)plt>
400649: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)

As you can see, it uselessly put mfence right at the beginning of
function f() and threw away the second read of head_ and the whole if
statement altogether.

Naively, you could say that we could put "memory" clobber in the inline
assembly clobber list like this:
#define LoadLoadBarrier() asm volatile ("mfence" : : : "memory")

This will work, but it will be a huge overkill, because after this the
compiler will need to re-read all variables, even unrelated ones. And
when f() gets inlined, you get a huge performance hit.

Volatile saves the day nicely and beautifully, albeit not "standards"
portably. But as I said elsewhere, this will work on most compilers and
hardware. Of course I'd need to test it on the compiler/hardware
combination that client is going to run it on, but such is the peril of
trying to provide portable interface with non-portable implementation.
But so far I haven't found a single combination that wouldn't correctly
compile the code with volatiles. And of course I'll gladly embrace C++0x
atomic<>... when it becomes available. Right now though, I'm slowly
migrating to boost::atomic (which again, internally HAS TO and IS using
volatiles).


Thanks,
Andy.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]