From: Herb Sutter on
On Wed, 31 Mar 2010 02:56:56 CST, Joshua Maurice
<joshuamaurice(a)gmail.com> wrote:
>On Mar 30, 8:14 pm, Herb Sutter <herb.sut...(a)gmail.com> wrote:
>> On Tue, 30 Mar 2010 16:37:59 CST, James Kanze <james.ka...(a)gmail.com>
>> wrote:
>> >(I keep seeing mention here of instruction reordering. In the
>> >end, instruction reordering is irrelevant. It's only one thing
>> >that may lead to reads and writes being reordered.
>>
>> Yes, but: Any reordering at any level can be treated as an instruction
>> reordering -- actually, as a source code reordering. That's why all
>> language-level MM discussions only bother to talk about source-level
>> reorderings, because any CPU or cache transformations end up having
>> the same effect as some corresponding source-level reordering.
>
>Not quite, no.

You mistake what we mean by "reordering" -- see note at bottom.

>On "weaker guarantee" processors, let's take the
>following example:
>
>/*
>start pseudo code example. Forgive me for any "typos". This is off the
>top of my head and I haven't really used lambda functions.
>*/
>int main()
>{ int a = 0;
> int b = 0;
> int c[4];
> int d[4];
> start_thread([&]() -> void { c[0] = a; d[0] = b; });
> start_thread([&]() -> void { c[1] = a; d[1] = b; });
> start_thread([&]() -> void { c[2] = a; d[2] = b; });
> start_thread([&]() -> void { c[3] = a; d[3] = b; });
> a = 1;
> b = 2;
> cout << c[0] << " " << d[0] << '\n'
> << c[1] << " " << d[1] << '\n'
> << c[2] << " " << d[2] << '\n'
> << c[3] << " " << d[3] << endl;
>}
>//end pseudo code example
>
>On some modern processors, most (in)famously the DEC Alpha with its
>awesome split cache, this program in the real world (or something very
>much like it) can print:
>0 0
>0 2
>1 0
>1 2
>
>Specifically, this is a single execution of the program. In this
>single execution, the writes "a = 1; b = 2;" are seen to happen in two
>different orders, the exact same "store instructions" become visible
>to other cores in different orders. There is no (sane) source code
>level reordering that can achieve this.

Sure there is, in fact a single reordering of two statements will do:
Reorder *only* the second thread (to make it be "d[1] = b; c[1] =
a;"), then the above occurs in several possible interleavings.

>I tried to emphasize this else-
>thread: you cannot think about threading in terms of "possible
>interleavings of instructions".
^^^^^^^^^^^^^

Aha, I see our disconnect. My point wasn't about "interleavings" but
rather "reorderings." Here, "reorderings" doesn't mean possible
interleavings of the source code as written, it's about actually
permuting the source code to execute operations (notably, memory
operations) out of order (as compilers, processors, and caches are all
wont to do).

Herb


---
Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca)

Convener, SC22/WG21 (C++) (www.gotw.ca/iso)
Architect, Visual C++ (www.gotw.ca/microsoft)

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Anthony Williams on
Andy Venikov <swojchelowek(a)gmail.com> writes:

> And of course I'll gladly embrace C++0x
> atomic<>... when it becomes available.

std::atomic<> is available now for some platforms: my just::thread C++0x
thread library provides std::atomic<> for MSVC 2008, MSVC2010 on Windows
and g++4.3, g++4.4 on Ubuntu linux.

http://www.stdthread.co.uk

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Bart van Ingen Schenau on
On Mar 31, 5:14 am, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
> "James Kanze" <james.ka...(a)gmail.com> wrote in messagenews:da63ca83-4d6e-416a-9825-c24deed3e49f(a)10g2000yqq.googlegroups.com...
>
> <snip>
>
> > Double checked locking can be made to work if you introduce
> > inline assembler or use some other technique to insert a fence
> > or a membar instruction in the appropriate places. But of
> > course, then, the volatile becomes superficial.
>
> It is only superficial if there is a compiler guarantee that a load/store for a non-volatile variable is emitted in the presence
> of a fence which sounds like a dubious guarantee to me. What compilers stop performing optimizations in the presence of a
> fence and/or how does the compiler know which variables accesses can be optimized in the presence of a fence?

To extend that question: what guarantees do you have that the compiler
will not optimise the entire program to a single NOOP?

Either the compiler understands the fence instruction and it
understands the implications for possible optimisations on objects
that are (potentially) accessible from outside the current function.
Or the compiler does not understand the fence instruction and sees
just a bit of black magic, which must be assumed to change all
(potentially) externally accessible objects.
Or the compiler is broken beyond recognition.

I don't see any options there where adding volatile qualifications
would make any difference.

>
> /Leigh
>
Bart v Ingen Schenau


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Joshua Maurice on
On Mar 31, 4:49 pm, Herb Sutter <herb.sut...(a)gmail.com> wrote:
> On Wed, 31 Mar 2010 02:56:56 CST, Joshua Maurice
> >On Mar 30, 8:14 pm, Herb Sutter <herb.sut...(a)gmail.com> wrote:
> >> Yes, but: Any reordering at any level can be treated as an instruction
> >> reordering -- actually, as a source code reordering. That's why all
> >> language-level MM discussions only bother to talk about source-level
> >> reorderings, because any CPU or cache transformations end up having
> >> the same effect as some corresponding source-level reordering.
>
> >Not quite, no.
>
> You mistake what we mean by "reordering" -- see note at bottom.
[snip]
> >I tried to emphasize this else-
> >thread: you cannot think about threading in terms of "possible
> >interleavings of instructions".
>
> Aha, I see our disconnect. My point wasn't about "interleavings" but
> rather "reorderings." Here, "reorderings" doesn't mean possible
> interleavings of the source code as written, it's about actually
> permuting the source code to execute operations (notably, memory
> operations) out of order (as compilers, processors, and caches are all
> wont to do).

Yes; it is just an argument over semantics. You said "Any reordering at
any level can be treated as an instruction reordering -- actually, as a
source code reordering". I interpreted "reordering" and "source code"
as actual compilable source code for a sequentially consistent single
core processor with preemptive threading. Take my example from my
previous post, except with each thread running the same function, but
passed different "c" and "d" arguments. There is no (not malicious)
source code transformation from that example to compilable source code
for that processor which could produce the result "00 10 02 12".
Instead, you are using a model of a processor which has cores which
each take "source code", possibly reorder the "source code" in
isolation from other cores, and execute the "source code". Under this
model, I agree. However, I think this is a misnomer of "source code"
especially in context of your post and this thread. Your post suggests
to some (including myself) that hardware "reordering" is equivalent to
compiler reordering, which it is not. Hardware reordering can result
in the same writes becoming visible to other threads in different
orders. Optimizing compilers on sequentially consistent hardware
cannot replicate all possible hardware reorderings of not-sequentially
consistent processors, such as my example, without being malicious.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Joshua Maurice on
On Mar 31, 3:35 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote:
> James Kanze wrote:
>
> <snip>
>
> > I'm not sure I follow. Basically, the fence guarantees that the
> > hardware can't do specific optimizations. The same
> > optimizations that the software can't do in the case of
> > volatile. If you think you need volatile, then you certainly
> > need a fence. (And if you have the fence, you no longer need
> > the volatile.)
>
> Ah, finally I think I see where you are coming from. You think that if
> you have the fence you no longer need a volatile.
>
> I think you assume too much about how fence is really implemented. Since
> the standard says nothing about fences you have to rely on a library
> that provides them and if you don't have such a library, you'll have to
> implement one yourself. A reasonable way to implement a barrier would be
> to use macros that, depending on a platform you run, expand to inline
> assembly containing the right instruction. In this case the inline asm
> will make sure that the compiler won't reorder the emitted instructions,
> but it won't make sure that the optimizer will not throw away some
> needed instructions.
>
> For example, following my post where I described Magued Michael's
> algorithm, here's how relevant excerpt without volatiles would look like:
>
> //x86-related defines:
> #define LoadLoadBarrier() asm volatile ("mfence")
>
> //Common code
> struct Node
> {
> Node * pNext;};
>
> Node * head_;
>
> void f()
> {
> Node * pLocalHead = head_;
> Node * pLocalNext = pLocalHead->pNext;
>
> LoadLoadBarrier();
>
> if (pLocalHead == head_)
> {
> printf("pNext = %p\n", pLocalNext);
> }
>
> }
>
> Just to make you happy I defined LoadLoadBarrier as a full mfence
> instruction, even though on x86 there is no need for a barrier here,
> even on a multicore/multiprocessor.
>
> And here's how gcc 4.3.2 on Linux/x86-64 generated object code:
>
> 0000000000400630 <_Z1fv>:
> 400630: 0f ae f0 mfence
> 400633: 48 8b 05 fe 09 20 00 mov 0x2009fe(%rip),%rax #
> 601038 <head_>
> 40063a: bf 5c 07 40 00 mov $0x40075c,%edi
> 40063f: 48 8b 30 mov (%rax),%rsi
> 400642: 31 c0 xor %eax,%eax
> 400644: e9 bf fe ff ff jmpq 400508 <printf(a)plt>
> 400649: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
>
> As you can see, it uselessly put mfence right at the beginning of
> function f() and threw away the second read of head_ and the whole if
> statement altogether.
>
> Naively, you could say that we could put "memory" clobber in the inline
> assembly clobber list like this:
> #define LoadLoadBarrier() asm volatile ("mfence" : : : "memory")
>
> This will work, but it will be a huge overkill, because after this the
> compiler will need to re-read all variables, even unrelated ones. And
> when f() gets inlined, you get a huge performance hit.
>
> Volatile saves the day nicely and beautifully, albeit not "standards"
> portably. But as I said elsewhere, this will work on most compilers and
> hardware. Of course I'd need to test it on the compiler/hardware
> combination that client is going to run it on, but such is the peril of
> trying to provide portable interface with non-portable implementation.
> But so far I haven't found a single combination that wouldn't correctly
> compile the code with volatiles. And of course I'll gladly embrace C++0x
> atomic<>... when it becomes available. Right now though, I'm slowly
> migrating to boost::atomic (which again, internally HAS TO and IS using
> volatiles).

I was hoping someone more knowledgeable would reply, but I guess it's
up to me. My experience has been mostly limited to POSIX, WIN32, and
Java threading, so what I'm about to say I say without the highest
level of confidence.

I understand your desire to use volatile in this way, and I think it's
a reasonable use case and desire. You are assuming that all relevant
shared state between the two threads is accessible only through head_.
You issue the hardware only fence instruction with "asm volatile" to
make sure the compiler does not remove it. You do not put the "memory"
clobber, and thus the compiler does not understand what it does.
Finally, you use "volatile" to make sure that the compiler will put
another load instruction after the fence in the compiled machine
code.

My replies are thus:

First, you don't want the "memory" clobber because you know that the
(relevant) shared state is only accessible through head. If the
compiler will load in data dependency order and the volatile read will
do a read after the hardware fence, then everything should work out.
I'm not sure if you could find such guarantee published for any
compiler, though. However, I have a hard time thinking of a compiler
which would not do this, but I am not sure, and I would not rely upon
it without checking the compiled output.

This is partly an argument over the definition of fence. One might say
that when one talks about a portable fence, it applies to all memory,
not just a single load specified by the coder and all data dependent
loads. The "general" definition of fence demands the "memory" clobber
(absent the existence of a less draconian clobber). However, this is
irrelevant to the discussion of volatile and threading.

Finally, it comes down to whether volatile will force a load
instruction to be emitted from the compiler. At the very least, this
seems to be included in the intent and spirit of volatile, and I would
hazard a guess that all compilers would emit a load (but not
necessarily anything more [not that anything more would be needed in
this case], and ignoring volatile bugs, which greatly abound across
compilers). However, you're already at the assembly level. Would it be
that much harder to use an "asm volatile" to do a load instead of a
volatile qualified load? You're already doing assembly hackery, so at
least use something which is "more guaranteed" to work like an "asm
volatile" load and not volatile which was never intended to be a
useful threading primitive. Perhaps supply a primitive like
DataDependencyFenceForSingleLoad? I don't know enough about hardware
to even hazard a guess if such a thing is portably efficient. I do
know enough that what you're doing is not portable as is.

All in all though, this does not change the fact that volatile is not
a useful, correct, portable threading primitive. All you've
demonstrated is that volatile in conjunction with assembly (not
portable) can be a useful, correct, non-portable threading primitive,
though I would argue that the code has poor "style" and should not be
using volatile.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]