From: Del Cecchi on
Nick Maclaren wrote:
> In article <taudnWiDALAjfKHYnZ2dnUVZ_vednZ2d(a)comcast.com>,
> "Chris Thomasson" <cristom(a)comcast.net> writes:
> |>
> |> Well, at least I think I got a direct answer from a Sun Architect about the
> |> implied barriers wrt lock-free reader patterns on the UltraSPARC T1; I made
> |> sure to explicitly state #LoadLoad and/or Data-Dependency Hints in my
> |> post...
>
> I suspect that the question you asked and the one he answered were not
> quite the same ....
>
> But I can easily believe that there is such a barrier at the hardware
> level for that CPU. Let's assume that, until and unless we see evidence
> to the contrary.
>
> |> Humm... Okay... Let's say the language is strictly Assembly, and were
> |> working with something that provides atomic stores/loads to/from properly
> |> aligned pointer sized variables... How many current/well-known architectures
> |> do you think may be lacking an implied data-dependant load barrier for
> |> atomic loads?
> |>
> |> http://groups.google.com/group/comp.programming.threads/msg/8a55976d126906a8
>
> Define 'properly' - which comes back to the language question. In the
> sense that you and Joe Seigh mean it, and at the STRICT hardware level,
> quite possibly none.
>
> But I have seen two stores by CPU A 'reverse' as seen by CPU B on many
> architectures, including when I don't think that it was allowed. What
> I suspect is that most CPU/boards have some CPU configuration state to
> allow or block store/load reordering, and not all systems were setting
> that correctly. But it could also have been associated with interrupt
> handling (including TLB miss, ECC and other), and I am only 99% sure
> that it was a genuine memory reversal, anyway.
>
> Unfortunately, for obvious reasons, it is the worst sort of problem
> imaginable to tie down precisely :-(
>
> |> If a particular architecture does not have the correct barrier for this kind
> |> of stuff... Well, I bet the Linux Kernel developers would not enjoy working
> |> that architecture very much at all; RCU performance would tank with explicit
> |> #LoadLoad barriers...! :O
>
> Quite. But, if the cost of not doing that were one non-repeatable
> failure every few months, who is going to notice? I was playing with
> SMP applications code, and so was hammering this aspect, but even then
> I rarely saw an anomaly that could plausibly be assigned to such a
> cause.
>
> |> However, I believe that there is a possible solution which involves
> |> amortizing the cost of the explicit #LoadLoad:
> |>
> |> Any thoughts on this?
>
> SOP for more decades than I care to recall.
>
> |> Well, I try to stick to the following logic for C/Assembly Language:
>
> Grin :-)
>
> Yes, I agree. My position is that I usually write portable code (over
> time AND space) and so can rely on only high-level guarantees - which,
> in my case, I have not got[*].
>
> [*] With apologies to Henry Reed ("The Naming of Parts").
>
>
> Regards,
> Nick Maclaren.

If I interpret this article http://www.linuxjournal.com/article/8211
correctly, expecting those stores to not be reordered as seen from
another cpu is unrealistic unless steps are taken in the software to
make it so. It is realistic to expect them to occur in order as seen
from the cpu where they originate.



--
Del Cecchi
"This post is my own and doesn?t necessarily represent IBM?s positions,
strategies or opinions.?
From: Chris Thomasson on
"Nick Maclaren" <nmm1(a)cus.cam.ac.uk> wrote in message
news:ehitag$mss$1(a)gemini.csx.cam.ac.uk...
> In article <taudnWiDALAjfKHYnZ2dnUVZ_vednZ2d(a)comcast.com>,
> "Chris Thomasson" <cristom(a)comcast.net> writes:
> |> Well, at least I think I got a direct answer from a Sun Architect about
> the
> |> implied barriers wrt lock-free reader patterns on the UltraSPARC T1; I
> made
> |> sure to explicitly state #LoadLoad and/or Data-Dependency Hints in my
> |> post...
>
> I suspect that the question you asked and the one he answered were not
> quite the same ....

Humm... I hope is was! Well, I was very nervous and paranoid and came up
with a half-assed test to try and see if I could wreak a little havoc;
simple pseudo-code at end of post...




> But I can easily believe that there is such a barrier at the hardware
> level for that CPU. Let's assume that, until and unless we see evidence
> to the contrary.

So far, so good... However, IMHO, it seems that this kind of stuff could be
fairly easily documented in an explicit fashion... Humm...




> |> Humm... Okay... Let's say the language is strictly Assembly, and were
> |> working with something that provides atomic stores/loads to/from
> properly
> |> aligned pointer sized variables... How many current/well-known
> architectures
> |> do you think may be lacking an implied data-dependant load barrier for
> |> atomic loads?
> |>
> |>
> http://groups.google.com/group/comp.programming.threads/msg/8a55976d126906a8
>
> Define 'properly'

Well, Okay... Let's say I read the arch docs for NewCPUFoo and they happen
to explicitly and clearly state that if you want loads/stores to be atomic,
you simply have to ensure that the variable you are loading from or storing
to is exactly equal to the size of a system pointer, and must be aligned on
a boundary that is a multiple of the size of a system pointer. For example,
a 32-bit pointer would mean that the variable has to be exactly 32-bits
wide, and it has to be aligned and be all by itself on a boundary that is a
multiple of 32-bits...




> - which comes back to the language question. In the
> sense that you and Joe Seigh mean it, and at the STRICT hardware level,
> quite possibly none.
>
> But I have seen two stores by CPU A 'reverse' as seen by CPU B on many
> architectures, including when I don't think that it was allowed. What
> I suspect is that most CPU/boards have some CPU configuration state to
> allow or block store/load reordering, and not all systems were setting
> that correctly. But it could also have been associated with interrupt
> handling (including TLB miss, ECC and other), and I am only 99% sure
> that it was a genuine memory reversal, anyway.
>
> Unfortunately, for obvious reasons, it is the worst sort of problem
> imaginable to tie down precisely :-(

Indeed!




> |> If a particular architecture does not have the correct barrier for this
> kind
> |> of stuff... Well, I bet the Linux Kernel developers would not enjoy
> working
> |> that architecture very much at all; RCU performance would tank with
> explicit
> |> #LoadLoad barriers...! :O
>
> Quite. But, if the cost of not doing that were one non-repeatable
> failure every few months, who is going to notice?

lol! :^)




> I was playing with
> SMP applications code, and so was hammering this aspect, but even then
> I rarely saw an anomaly that could plausibly be assigned to such a
> cause.
>
> |> However, I believe that there is a possible solution which involves
> |> amortizing the cost of the explicit #LoadLoad:
> |>
> |> Any thoughts on this?
>
> SOP for more decades than I care to recall.

Thought so...




> |> Well, I try to stick to the following logic for C/Assembly Language:
>
> Grin :-)
>
> Yes, I agree. My position is that I usually write portable code (over
> time AND space) and so can rely on only high-level guarantees - which,
> in my case, I have not got[*].

Here is a little pseudo-code that should fail some time if dependant load
ordering is not supported by the architecture:




#define DEPTH() 999999


typedef struct test_s test_t;


struct test_s {
void *ptrs[DEPTH()]; /* into all to 0 */
};


static test_t *p_store = /* malloc pad and align on proper boundary */
static test_t *p_load = /* align on proper boundary, and set to 0 */


void single_writer_thread() {
for (;;) {
if (! ATOMIC_LOAD(&p_load)) {
membar #LoadLoad;

int i;

for(i = 0; i < DEPTH(); ++i) {
p_store->ptrs[i] = 0;
p_store->ptrs[i] = (void*)(i + 1);
}

membar #LoadStore | #StoreStore;
ATOMIC_STORE(&p_load, p_store);

} else { sched_yield(); }
}
}


void single_reader_thread() {
for (;;) {
test_t *local = ATOMIC_LOAD(&p_load);
/* membar #LoadLoad w/ Data-Dependency Hint Hopefully Implied! */

if (local) {
int i;

for(i = 0; i < DEPTH(); ++i) {
void *val = local->ptrs[i];
assert(val && val == (void*)i + 1);
}

ATOMIC_STORE(&p_load, 0);
membar #StoreStore;
}

else { sched_yield(); }
}
}




Humm...


From: ranjit_mathews@yahoo.com on

Nick Maclaren wrote:

> In fact, that works even for threaded code run on a single CPU, but
> it DOESN'T for threaded code run on multiple CPUs (including cores),
> as the object may span a cache line. So, it's nobody's fault, because
> all of the hardware, compiler and program are following the relevant
> rules, but the resulting program doesn't work ....

www.cs.utk.edu/~rich/publications/power4.jour.ps

From: Chris Thomasson on
"Del Cecchi" <cecchinospam(a)us.ibm.com> wrote in message
news:4q4ledFlff5uU1(a)individual.net...
> Nick Maclaren wrote:
>> In article <taudnWiDALAjfKHYnZ2dnUVZ_vednZ2d(a)comcast.com>,
>> "Chris Thomasson" <cristom(a)comcast.net> writes:

[...]

>>
>> Regards,
>> Nick Maclaren.
>
> If I interpret this article http://www.linuxjournal.com/article/8211
> correctly, expecting those stores to not be reordered as seen from another
> cpu is unrealistic unless steps are taken in the software to make it so.
> It is realistic to expect them to occur in order as seen from the cpu
> where they originate.

The origin CPU must execute a #LoadStore | #StoreStore style membar (e.g.,
store release semantics') before it makes a pointer visible to another CPU.
The other CPU must execute a #LoadLoad barrier w/ Data-Dependency Hint style
membar after it loads the pointer. I believe that the latter membar is
implied on every arch except Alpha...

For instance, I think that current x86 and UltraSPARC T1 both use this
barrier for every atomic load; every explicit membar except #StoreLoad is a
nop...:

http://groups.google.com/group/comp.programming.threads/msg/b0ab2c4405d1c2c6

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/6715c3e5a73c4016


Atomic loads on x86 and UltraSPARC T1 imply a #LoadStore | #LoadLoad style
membar (e.g., load acquire semantics'), so that works well with Linux and
RCU; ala read_barrier_depends()...


Humm...



From: Nick Maclaren on

In article <4q4ledFlff5uU1(a)individual.net>,
Del Cecchi <cecchinospam(a)us.ibm.com> writes:
|>
|> If I interpret this article http://www.linuxjournal.com/article/8211
|> correctly, expecting those stores to not be reordered as seen from
|> another cpu is unrealistic unless steps are taken in the software to
|> make it so. It is realistic to expect them to occur in order as seen
|> from the cpu where they originate.

Thanks for finding that; at a quick glance, I agree, and it justifies
a more thorough perusal. The performance issue is why I have always
been somewhat suspicious of salesmen and othhers who claim complete
sequential consistency on the memory of a modern SMP system. I can't
think of how it can be done, efficiently, even in theory ....


Regards,
Nick Maclaren.