From: Andy Venikov on
James Kanze wrote:
> On Mar 25, 7:10 pm, George Neuner <gneun...(a)comcast.net> wrote:
>> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov
>
> [...]
>> As you noted, 'volatile' does not guarantee that an OoO CPU will
>> execute the stores in program order ...
>
> Arguably, the original intent was that it should. But it
> doesn't, and of course, the ordering guarantee only applies to
> variables actually declared volatile.
>
>> for that you need to add a write fence between them. However,
>> neither 'volatile' nor write fence guarantees that any written
>> value will be flushed all the way to memory - depending on
>> other factors - cache snooping by another CPU/core, cache
>> write back policies and/or delays, the span to the next use of
>> the variable, etc. - the value may only reach to some level of
>> cache before the variable is referenced again. The value may
>> never reach memory at all.
>
> If that's the case, then the fence instruction is seriously
> broken. The whole purpose of a fence instruction is to
> guarantee that another CPU (with another thread) can see the
> changes. (Of course, the other thread also needs a fence.)

Hmm, the way I understand fences is that they introduce ordering and not
necessarily guarantee visibility. For example:

1. Store to location 1
2. StoreStore fence
3. Store to location 2

will guarantee only that if store to location 2 is visible to some
thread, then the store to location 1 is guaranteed to be visible to the
same thread as well. But it doesn't necessarily guarantee that the
stores will be ever visible to some other thread. Yes, on certain CPUs
fences are implemented as "flushes", but they don't need to be.


Thanks,
Andy.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Joshua Maurice on
On Mar 26, 4:05 am, Andy Venikov <swojchelo...(a)gmail.com> wrote:
> James Kanze wrote:
> > On Mar 25, 7:10 pm, George Neuner <gneun...(a)comcast.net> wrote:
> >> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov
>
> > [...]
> >> As you noted, 'volatile' does not guarantee that an OoO CPU will
> >> execute the stores in program order ...
>
> > Arguably, the original intent was that it should. But it
> > doesn't, and of course, the ordering guarantee only applies to
> > variables actually declared volatile.
>
> >> for that you need to add a write fence between them. However,
> >> neither 'volatile' nor write fence guarantees that any written
> >> value will be flushed all the way to memory - depending on
> >> other factors - cache snooping by another CPU/core, cache
> >> write back policies and/or delays, the span to the next use of
> >> the variable, etc. - the value may only reach to some level of
> >> cache before the variable is referenced again. The value may
> >> never reach memory at all.
>
> > If that's the case, then the fence instruction is seriously
> > broken. The whole purpose of a fence instruction is to
> > guarantee that another CPU (with another thread) can see the
> > changes. (Of course, the other thread also needs a fence.)
>
> Hmm, the way I understand fences is that they introduce ordering and not
> necessarily guarantee visibility. For example:
>
> 1. Store to location 1
> 2. StoreStore fence
> 3. Store to location 2
>
> will guarantee only that if store to location 2 is visible to some
> thread, then the store to location 1 is guaranteed to be visible to the
> same thread as well. But it doesn't necessarily guarantee that the
> stores will be ever visible to some other thread. Yes, on certain CPUs
> fences are implemented as "flushes", but they don't need to be.

Well yes. Volatile does not change that though. Most of my
understanding comes from
http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt
and
The JSR-133 Cookbook for Compiler Writers
http://g.oswego.edu/dl/jmm/cookbook.html
(Note that the discussion of volatile in the above link is for Java
volatile 1.5+, not C and C++ volatile.)

I'm not the most versed on this, so please correct me if I'm wrong. As
an example:

main thread:
a = 0
b = 0
start thread 2
a = 1
write barrier
b = 2

thread 2:
print b
read barrier
print a

Without the read and write memory barriers, this will print any of the
4 possible combinations:
0 0, 2 0, 0 1, 2 1

With the barriers, it removes one possible:
0 0, 0 1, 2 1

As I understand "read" and "write" barriers (which are a subset of
"store/store, store/load, load/store, load/load", the semantics are:
"If a read before the read barrier sees a write after the write
barrier, then all reads after the read barrier will see all writes
before the write barrier." Yes, the semantics are conditional. It does
not guarantee that a write will ever become visible. However, volatile
will not change that. If thread 2 prints b == 2, then thread 2 will
print a == 1, volatile or no volatile. If thread 2 prints b == 0, then
thread 2 can print a == 0 or a == 1, volatile or no volatile. For some
lock free algorithms, these guarantees are very useful, such as making
double checked locking correct. Ex:

T* get_singleton()
{
//all static storage is zero initialized before runtime
static singleton_t * p;

if (0 != p) //check 1
{
READ_BARRIER();
return p;
}
Lock lock;
if (0 != p) //check 2
return p;
singleton_t * tmp = new singleton_t;
WRITE_BARRIER();
p = tmp;
return p;
}

If a thread reads p != 0 at check 1 which is before the read barrier,
then it sees the write after the write barrier "p = tmp", and it is
thus guaranteed that all subsequent reads after the read barrier (in
the caller code) will see all writes before the write barrier (from
the singleton_t constructor). This conditional visibility is exactly
what we need in this case, what DCLP really wants. If the read at
check 1 gives us 0, then we do have to use a mutex to force
visibility, but most of the time it will read p as nonzero at check 1,
and the barriers will guarantee correct semantics. Also, from what I
remember, the read barrier is quite cheap on most systems, possibly
free on the x86 (?). (See the JRS Cookbook linked above.) I don't
quite grasp the nuances enough yet to say anything more concrete than
this at this time.

Again, I'm coding this up from memory, so please correct if any
mistakes.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: George Neuner on
On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james.kanze(a)gmail.com>
wrote:

>On Mar 25, 7:10 pm, George Neuner <gneun...(a)comcast.net> wrote:
>> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov
>
> [...]
>> As you noted, 'volatile' does not guarantee that an OoO CPU will
>> execute the stores in program order ...
>
>Arguably, the original intent was that it should. But it
>doesn't, and of course, the ordering guarantee only applies to
>variables actually declared volatile.

"volatile" is quite old ... I'm pretty sure the "intent" was defined
before there were OoO CPUs (in de facto use if not in standard
document). Regardless, "volatile" only constrains the behavior of the
*compiler*.


>> for that you need to add a write fence between them. However,
>> neither 'volatile' nor write fence guarantees that any written
>> value will be flushed all the way to memory - depending on
>> other factors - cache snooping by another CPU/core, cache
>> write back policies and/or delays, the span to the next use of
>> the variable, etc. - the value may only reach to some level of
>> cache before the variable is referenced again. The value may
>> never reach memory at all.
>
>If that's the case, then the fence instruction is seriously
>broken. The whole purpose of a fence instruction is to
>guarantee that another CPU (with another thread) can see the
>changes.

The purpose of the fence is to sequence memory accesses. All the
fence does is create a checkpoint in the instruction sequence at which
relevant load or store instructions dispatched prior to dispatch of
the fence instruction will have completed execution. There may be
separate load and store fence instructions and/or they may be combined
in a so-called "full fence" instruction.

However, in a memory hierarchy with caching, a store instruction does
not guarantee a write to memory but only that one or more write cycles
is executed on the core's memory connection bus. Where that write
goes is up to the cache/memory controller and the policies of the
particular cache levels involved. For example, many CPUs have
write-thru primary caches while higher levels are write-back with
delay (an arrangement that allows snooping of either the primary or
secondary cache with identical results).

For another thread (or core or CPU) to perceive a change a value must
be propagated into shared memory. For all multi-core processors I am
aware of, the first shared level of memory is cache - not main memory.
Cores on the same die snoop each other's primary caches and share
higher level caches. Cores on separate dies in the same package share
cache at the secondary or tertiary level.

The same holds true for all separate CPU shared memory multiprocessors
I am aware of ... they are connected so that they can snoop other's
caches at some level, or an additional level of shared cache is placed
between the CPUs and memory, or both.


>>(Of course, the other thread also needs a fence.)

Not necessarily.


>> OoO execution and cache behavior are the reasons 'volatile'
>> doesn't work as intended for many systems even in
>> single-threaded use with memory-mapped peripherals.
>
>The reason volatile doesn't work with memory-mapped peripherals
>is because the compilers don't issue the necessary fence or
>membar instruction, even if a variable is volatile.

It still wouldn't matter if they did. Lets take a simple case of one
thread and two memory mapped registers:

volatile unsigned *regA = 0x...;
volatile unsigned *regB = 0x...;
unsigned oldval, retval;

*regA = SOME_OP;
*regA = SOME_OP;

oldval = *regB;
do {
retval = *regB;
}
while ( retval == oldval );

Let's suppose that writing a value twice to regA initiates some
operation that returns a value in regB. Will the above code work?

No. The processor will execute both writes, but the cache will
combine them so the device will see only a single write. The cache
needs to be flushed between writes to regA.

Ok, let's assume there is a flush API and add some flushes:

*regA = SOME_OP;
FLUSH *regA;
*regA = SOME_OP;
FLUSH *regA;

oldval = *regB;
do {
retval = *regB;
}
while ( retval == oldval );

Does this now work?

Maybe. It will work if the flush operation includes a fence,
otherwise you can't know whether the write has occurred before the
cache line is flushed.

Ok, let's assume there is a fence API and add fences:

*regA = SOME_OP;
SFENCE;
FLUSH *regA;
*regA = SOME_OP;
SFENCE;
FLUSH *regA;

oldval = *regB;
do {
retval = *regB;
}
while ( retval == oldval );

Does this now work?

Yes. Now I am guaranteed that the first value will be written all the
way to memory (and to my device) before the second value is written.


Now the question is whether a cache flush includes a fence operation
(or vice versa)? The answer is "it depends". On many architectures,
the ISA has no cache control instructions - the cache controller is
mapped to reserved memory addresses or I/O ports. Some cache
controllers permit only programming replacement policy and do not
allow programs to manipulate the entries. Some controllers flush
everything rather than allowing individual lines to be flushed. It
depends.

If there is a language level API for cache control or for fencing, it
may or may not include the other operation depending on the whim of
the developer.


The upshot is this:
- "volatile" is required for any CPU.
- fences are required for an OoO CPU.
- cache control is required for a write-back cache between
CPU and main memory.


>James Kanze

George

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: James Kanze on
On Mar 26, 12:33 am, Herb Sutter <herb.sut...(a)gmail.com> wrote:
> Please remember this: Standard ISO C/C++ volatile is useless
> for multithreaded programming. No argument otherwise holds
> water; at best the code may appear to work on some
> compilers/platforms, including all attempted counterexamples
> I've seen on this thread.

I agree with you in principle, but do be careful as to how you
formulate this. Standard ISO C/C++ is useless for multithreaded
programming, at least today. With or without volatile. And in
Standard ISO C/C++, volatile is useless for just about anything;
it was always intended to be mainly a hook for implementation
defined behavior, i.e. to allow things like memory-mapped IO
while not imposing excessive loss of optimizing posibilities
everywhere.

In theory, an implementation could define volatile in a way that
would make it useful in multithreading---I think Microsoft once
proposed doing so in the standard. In my opinion, this sort of
violates the original intention behind volation, which was that
volatile is applied to a single object, and doesn't affect other
objects in the code. But it's certainly something you could
argue both ways.

[...]
> No. The reason that can't use volatiles for synchronization is that
> they aren't synchronized (QED).

:-). And the reason their not synchronized is that
synchronization involves more than one variable, and that it was
never the intent of volatile to involve more than one variable.
(On a lot of modern processors, however, it would be impossible
to fully implement the original intent of volatile without
synchronization. The only instructions available on a Sparc,
for example, to ensure that a store instruction actually results
in a write to an external device is a membar. And that
synchronizes *all* accesses of the given type.)

[...]
> (and it was a mistake to try to add those
> guarantees to volatile in VC++).

Just curious: is that Microsoft talking, or Herb Sutter (or
both)?

--
James Kanze

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: James Kanze on
On Mar 26, 12:05 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote:
> James Kanze wrote:
>> If that's the case, then the fence instruction is seriously
>> broken. The whole purpose of a fence instruction is to
>> guarantee that another CPU (with another thread) can see the
>> changes. (Of course, the other thread also needs a fence.)

> Hmm, the way I understand fences is that they introduce
> ordering and not necessarily guarantee visibility. For
> example:

> 1. Store to location 1
> 2. StoreStore fence
> 3. Store to location 2

> will guarantee only that if store to location 2 is visible to
> some thread, then the store to location 1 is guaranteed to be
> visible to the same thread as well.

A StoreStore fence guarantees that all stores issued before the
fence are visible in main memory, and that none issued after the
fence are visible (at the time the StoreStore fence is
executed).

Of course, for another thread to be guaraneed to see the results
of any store, it has to use a load fence, to ensure that the
values it sees are those after the load fence, and not some
value that it happened to pick up earlier.

> But it doesn't necessarily guarantee that the stores will be
> ever visible to some other thread. Yes, on certain CPUs fences
> are implemented as "flushes", but they don't need to be.

If you redefine fence to mean something different than it
normally means, then who knows. The normal definition requires
all writes to have propagated to main memory (supposing it is a
store fence) before the instruction procedes. This is why they
can be so slow. (And all of the processors I know guaranteed
coherence within a single core; you never need a fence if you're
single threaded.)

--
James Kanze

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]