From: James Kanze on
On Mar 20, 8:22 am, Tony Jorgenson <tonytinker2...(a)yahoo.com> wrote:

[...]
> I understand that volatile does not guarantee that the order
> of memory writes performed by one thread are seen in the same
> order by another thread doing memory reads of the same
> locations. I do understand the need for memory barriers
> (mutexes, atomic variables, etc) to guarantee order, but there
> are still 2 questions that have never been completely
> answered, at least to my satisfaction, in all of the
> discussion I have read on this group (and the non moderated
> group) on these issues.

> First of all, I believe that volatile is supposed to guarantee the
> following:

> Volatile forces the compiler to generate code that performs
> actual memory reads and writes rather than caching values in
> processor registers. In other words, I believe that there is a
> one-to-one correspondence between volatile variable reads and
> writes in the source code and actual memory read and write
> instructions executed by the generated code. Is this correct?

Sort of. The standard uses a lot of weasel words (for good
reasons) with regards to volatile, and in particular, leaves it
up to the implementation to define exactly what it means by
"access". Still, it's hard to imagine an interpretation that
doesn't imply a machine instruction which loads or stores.

Of course, on modern machines, a store instruction doesn't
necessarily result in a write to physical memory; you typically
need additional instructions to ensure that. And on the
compilers I know (g++, Sun CC and VC++), volatile doesn't cause
them to be generated. (My most concrete experience is with Sun
CC on a Sparc, where volatile doesn't ensure that memory mapped
I/O works correctly.)

> Question 1:
> My first question is with regard to using volatile instead of
> memory barriers in some restricted multi-threaded cases. If my
> above statements are correct, is it possible to use _only_
> volatile with no memory barriers to signal between threads in
> a reliable way if only a single word (perhaps a single byte)
> is written by one thread and read by another?

No. Storing a byte (at the machine code level) on one processor
or core doesn't mean that the results of the store will be seen
on another processor. Modern processors reorder memory writes
in hardware, so the given the sequence:

volatile int a = 0, b = 0; // suppose int atomic

void f()
{
a = 1;
b = 1;
}

another thread may still see b == 1 and a == 0.

> Question 1a:
> First of all, please correct me if I am wrong, but I believe
> volatile _must_always_ work as described above on any single
> core CPU. One CPU means one cache (or one hierarchy of caches)
> meaning one view of actual memory through the cache(s) that
> the CPU sees, regardless of which thread is running. Is this
> much correct for any CPU in existence? If not please mention a
> situation where this is not true (for single core).

The standard doesn't make any guarantees, but all of the
processor architectures I know do guarantee coherence within a
single core.

The real question here is rather: who has a single core machine
anymore? The last Sparc I worked on had 32 core, and I got it
because it was deemed to slow for production work (where we had
128 core). And even my small laptop is a dual core.

> Question 1b:
> Secondly, the only way I could see this not working on a
> multi-core CPU, with individual caches for each core, is if a
> memory write performed by one CPU is allowed to never be
> updated in the caches of other CPU cores. Is this possible?
> Are there any multi-core CPUs that allow this? Doesn�t the
> MESI protocol guarantee that eventually memory cached in one
> CPU core is seen by all others? I know that there may be
> delays in the propagation from one CPU cache to the others,
> but doesn�t it eventually have to be propagated? Can it be
> delayed indefinitely due to activity in the cores involved?

The problem occurs upstream of the cache. Modern processors
access memory through a pipeline. And optimize the accesses in
hardware. Reading and writing a cache line at a time. So if
you read a, then b, but the hardware finds that b is already in
the read pipeline (because you've recently accessed something
near it), then the hardware won't issue a new bus access for b;
it will simply use the value already in the pipeline. Which may
be older than the value of a, if the hardware does have to go to
memory for a.

All processors have instructions to force ordering: fence on an
Intel (and IIRC, a lock prefix creates an implicit fence),
membar on a Sparc. But the compilers I know don't issue these
instructions in case of volatile access. So the hardware still
remains free to do the optimizations that volation has forbid
the compiler.

> Question 2:
> My second question is with regard to if volatile is necessary
> for multi-threaded code in addition to memory barriers. I know
> that it has been stated that volatile is not necessary in this
> case, and I do believe this, but I don�t completely understand
> why. The issue as I see it is that using memory barriers,
> perhaps through use of mutex OS calls, does not in itself
> prevent the compiler from generating code that caches
> non-volatile variable writes in registers.

Whether it prevents it or not is implementation defined. As
soon as you start doing this, you're formally in undefined
behavior as far as C or C++ are concerned. Posix and Windows,
however, make additional guarantees, and if the compiler is
Posix compliant or Windows compliant, you're safe with regards
to code movement accross any of the API's which forbid it.

If you're using things like inline assembler, or functions
written in assembler, you'll have to check your compiler
documentation, but in practice, the compiler will assume that
the inline code modifies all visible variables (and so ensure
that they are correctly written and read with regards to it)
unless it has some means to know better, and those means will
also allow it to take a possible fence or membar instruction
into account.

> I have heard it written in this group that posix, for example,
> supports additional guarantees that make mutex lock/unlock
> (for example) sufficient for correct inter-thread
> communication through memory without the use of volatile. I
> believe I read here once (from James Kanze I believe) that
> �volatile is neither sufficient nor necessary for proper
> multi- threaded code� (quote from memory). This seems to imply
> that posix is in cahoots with the compiler to make sure that
> this works.

Posix imposes additional constraints on C compilers, in addition
to what the C standard does. Technically, Posix doesn't know
that C++ exists (and vice versa); practically, C++ compilers do
claim Posix compliance, and exterpolate the C guarantees in a
logical fashion. (Given that they generally concern basic types
like int, this really isn't too difficult.)

I've seen less formal specification with regards to Windows (and
heaven knows, I'm looking, now that I'm working in an almost
exclusively Windows environment). But practically speaking,
VC++ behaves under Windows like Posix compliant compilers under
Posix, and you won't find any other compiler breaking things
that work with VC++.

> If you add mutex locks and unlocks (I know RAII, so please
> don�t derail my question) around some variable reads and
> writes, how do the mutex calls force the compiler to generate
> actual memory reads and writes in the generated code rather
> than register reads and writes?

That's the problem of the compiler implementor. Posix
(explicitly) and Windows (implicitly, at least) say that it has
to work, so it's up to the compiler implementor to make it work.
(In practice, most won't look into a function for which they
don't have the source code, and won't move code accross a
function whose semantics they don't know.)

> I understand that compilation optimization affects these
> issues, but if I optimize the hell out of my code, how do
> posix calls (or any other OS threading calls) force the
> compiler to do the right thing? My only conjecture is that
> this is just an accident of the fact that the compiler can�t
> really know what the mutex calls do and therefore the compiler
> must make sure that all globally accessible variables are
> pushed to memory (if they are in registers) in case _any_
> called function might access them. Is this what makes it work?

In practice, in a lot of cases, yes:-). It's an easy and safe
solution for the implementor, and it really doesn't affect
optimization that much---critical zones which include system
calls or other functions for which the compiler doesn't have the
source code aren't that common. In theory, however, a compiler
could know the list of system requests which guarantee memory
synchronization, and disassemble the object files of any
functions for which it didn't have the sources, to see if they
made any such requests. I just don't know of any compilers
which do this.

> If not, then how do mutex call guarantee the compiler doesn�t
> cache data in registers, because this would surely make the
> mutexes worthless without volatile (which I know from
> experience that they are not).

The system API says that they have to work. It's up to the
compiler implementor to ensure that they do. Most adopt the
simple solution: I don't know what this function does, so I'll
assume the worst. But at least in theory, more elaborate
strategies are possible.

--
James Kanze


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Michael Doubez on
On 24 mar, 12:33, James Kanze <james.ka...(a)gmail.com> wrote:
> On Mar 23, 1:42 pm, Michael Doubez <michael.dou...(a)free.fr> wrote:
>
> > On 23 mar, 00:22, "Bo Persson" <b...(a)gmb.dk> wrote:
>
> [...]
>
> > Still it does say something of the semantic of the memory
> > location. In practice the compiler will cut the optimizations
> > regarding the volatile location; I don't see a compiler
> > ignoring this kind of notification.
>
> Not really. It makes some vague statements concerning "access",
> while not defining what it really means by access. And "memory
> location", without further qualifiers, has no real meaning on
> modern processors, with their five or six levels of memory---is
> the memory the core specific cache, the memory shared by all the
> cores, or the virtual backup store (which maintains its values
> even after the machine has been shut down)?
>
> And of course, what really counts is what the compilers
> implement: neither g++, nor Sun CC, nor VC++ (at least through
> 8.0) give volatile any more semantics that issuing a load or
> store instruction---which the hardware will execute when it gets
> around to it. Maybe.
>
> > Which means that the memory value will eventually (after an
> > undetermined amount of time) be flushed to the location and
> > not kept around in the stack or somewhere else for
> > optimization reasons.
>
> Sorry, but executing a store instruction (or a mov with a
> destination in memory) does NOT guarantee that there will be a
> write cycle in main memory, ever. At least not on modern Sparc
> and Intel architectures. (I'm less familiar with others, but
> from what I've heard, Sparc and Intel are among the most strict
> in this regard.)

I am surprised. I would have expected cache lines to be flushed after
a given amount of time in order to avoid coherency issues.
'volatile' making it worse by *forcing* a flush per modification
(although without guaranteeing ordering with other non-volatile memory
access).

[snip]

--
Michael


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: James Kanze on
On Mar 24, 7:12 pm, Michael Doubez <michael.dou...(a)free.fr> wrote:
> On 24 mar, 12:33, James Kanze <james.ka...(a)gmail.com> wrote:

[...]
> > Sorry, but executing a store instruction (or a mov with a
> > destination in memory) does NOT guarantee that there will be
> > a write cycle in main memory, ever. At least not on modern
> > Sparc and Intel architectures. (I'm less familiar with
> > others, but from what I've heard, Sparc and Intel are among
> > the most strict in this regard.)

> I am surprised. I would have expected cache lines to be
> flushed after a given amount of time in order to avoid
> coherency issues. 'volatile' making it worse by *forcing* a
> flush per modification (although without guaranteeing ordering
> with other non-volatile memory access).

Cache lines are only part of the picture, but similar concerns
apply to them. All of the coherency issues are addressed by
considering values, not store instructions. So if you modify
the same value several times before it makes it out of the
processor, some of those "writes" are lost. (This is generally
not an issue for threading, but it definitely affects things
like memory mapped I/O.) And for better or for worse, volatile
doesn't force any flushing on any of the compilers I know; all
it does is ensure that a store instruction is executed. So that
given something like:
int volatile a;
int volatile b;

// ...
a = 1;
b = 2;
, the compiler will ensure that the store instruction to a is
executed before the store instruction to b, but the hardware
(write pipeline, typically) may reorder the modifications to
main memory, or even in some extreme cases suppress one of them.

--
James Kanze


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Andy Venikov on
Joshua Maurice wrote:
> On Mar 21, 2:32 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote:
<snip>
>>
>> The standard places a requirement on conforming implementations that:
>>
>> 1.9.6
>> The observable behavior of the abstract machine is its sequence of reads
>> and writes to volatile data and calls to library I/O functions
>>
>> 1.9.7
>> Accessing an object designated by a volatile lvalue (3.10), modifying an
>> object, calling a library I/O function, or calling a function that does
>> any of those operations are all side effects, which are changes in the
>> state of the execution environment. Evaluation of an expression might
>> produce side effects. At certain specified points in the execution
>> sequence called sequence points, all side effects of previous
>> evaluations shall be complete and no side effects of subsequent
>> evaluations shall have taken place
>>
>> 1.9.11
>> The least requirements on a conforming implementation are:
>> � At sequence points, volatile objects are stable in the sense that
>> previous evaluations are complete and
>> subsequent evaluations have not yet occurred.
>>
>> That to me sounds like a complete enough requirement that compilers
>> don't perform optimizations that produce "surprising" results in so far
>> as observable behavior in an abstract (single-threaded) machine are
>> concerned. This requirement happens to be very useful for multi-threaded
>> programs that can augment volatile with hardware fences to produce
>> meaningful results.

> That is one interpretation. Unfortunately / fortunately (?), that
> interpretation is not the prevailing interpretation. Thus far in this
> thread, we have members of the C++ standards committee or its
> affiliates explicitly disagreeing on the committee's website with that
> interpretation (linked else-thread). The POSIX standard explicitly
> disagrees with your interpretation (see google). The
> comp.programming.threads FAQ explicitly disagrees with you several
> times (linked else-thread). We have gcc docs and implementation
> disagreeing with your interpretation (see google). We have an official
> blog from intel, the biggest maker of chips in the world, and a major
> compiler writer, explicitly disagreeing with your interpretation
> (linked else-thread). We have experts in the C++ community explicitly
> disagreeing with your interpretation.


All the sources that you listed were saying that volatile isn't
sufficient. And some went on as far as to say that it's "mostly"
useless. That "mostly", however, covers an area that is real and I was
talking about that area. None of them disagreed with what I said.

Here's a brief example that I hope will put this issue to rest:


volatile int n;

n = 5;
n = 6;


volatile guarantees (note: no interpretation here, it's just what it
says) that the compiler will issue two store instructions in the correct
order (5 then 6). And that is a very useful quality for multi-threaded
programs that chose not to use synchronization primitives like mutexes
and such. Of course it doesn't mean that the processor executes them in
that order, that's why we'd use memory fences. But to stop the
compiler from messing around with these sequences, the volatile is
necessary.

>(Thanks Andrei, and his paper "C+
> + And The Perils Of Double Checked Locking".
>
>Andy, have you even read it?


Of course I have. It's no secret that I admire works of both of the
authors. I have read a lot of other papers as well. Magued Michael (who
co-authored an article on lock-free algorithms with Andrei) and Tim
Harris in particular are my favorites. But it wasn't the point of the
discussion, was it?
It's a great article. Among other things, it talks about the
non-portability of a solution that relies solely on volatile. How is it
different from what I have said in my earlier post? Quoting:

"Is volatile sufficient - absolutely not.
Portable - hardly.
Necessary in certain conditions - absolutely."


<snip>


Thanks,
Andy.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: George Neuner on
On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov
<swojchelowek(a)gmail.com> wrote:

>
>All the sources that [Joshua Maurice] listed were saying that volatile
>isn't sufficient. And some went on as far as to say that it's "mostly"
>useless. That "mostly", however, covers an area that is real and I was
>talking about that area. None of them disagreed with what I said.
>
>Here's a brief example that I hope will put this issue to rest:
>
>
>volatile int n;
>
>n = 5;
>n = 6;
>
>
>volatile guarantees (note: no interpretation here, it's just what it
>says) that the compiler will issue two store instructions in the correct
>order (5 then 6). And that is a very useful quality for multi-threaded
>programs that chose not to use synchronization primitives like mutexes
>and such. Of course it doesn't mean that the processor executes them in
>that order, that's why we'd use memory fences. But to stop the
>compiler from messing around with these sequences, the volatile is
>necessary.

Not exactly. 'volatile' is necessary to force the compiler to
actually emit store instructions, else optimization would elide the
useless first assignment and simply set n = 6. Beyond that constant
propagation and/or value tracking might also eliminate the remaining
assignment and the variable altogether.

As you noted, 'volatile' does not guarantee that an OoO CPU will
execute the stores in program order ... for that you need to add a
write fence between them. However, neither 'volatile' nor write fence
guarantees that any written value will be flushed all the way to
memory - depending on other factors - cache snooping by another
CPU/core, cache write back policies and/or delays, the span to the
next use of the variable, etc. - the value may only reach to some
level of cache before the variable is referenced again. The value may
never reach memory at all.

OoO execution and cache behavior are the reasons 'volatile' doesn't
work as intended for many systems even in single-threaded use with
memory-mapped peripherals. A shared (atomically writable)
communication channel in the case of interrupts or concurrent threads
is actually a safer, more predictable use of 'volatile' because, in
general, it does not require values to be written all the way to main
memory.


>It's a great article. Among other things, it talks about the
>non-portability of a solution that relies solely on volatile. How is it
>different from what I have said in my earlier post? Quoting:
>
>"Is volatile sufficient - absolutely not.
>Portable - hardly.
>Necessary in certain conditions - absolutely."

I haven't seen the whole thread and I'm not sure of the post to which
you are referring. I think you might not be giving enough thought to
the way cache behavior can complicate the standard's simple memory
model. But it's possible that you have considered this and simply
have not explained yourself thoroughly enough for [me and others] to
see it.

'volatile' is necessary for certain uses but is not sufficient for
(al)most (all) uses. I would say that for expert uses, some are
portable and some are not. For non-expert uses ... I would say that
most uses contemplated by non-experts will be neither portable nor
sound.


> Andy.

George

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]