From: Chris Thomasson on
"Nick Maclaren" <nmm1(a)cus.cam.ac.uk> wrote in message
news:eh345j$1nh$1(a)gemini.csx.cam.ac.uk...
> In article <AKOdndCG_uFZkqjYnZ2dnUVZ8tSdnZ2d(a)pipex.net>,
> kenney(a)cix.compulink.co.uk writes:
> |>
> |> > Much of the load/store atomicity is unspecified, especially
> |> > with regard to multiple cores and data objects which need a
> |> > lower alignment than their size. That is a common source of
> |> > problem.
> |>
> |> Are you talking solely about factors that affect program
> |> optimisation? I ask this because most high level languages
> |> isolate the programmer from the architecture anyway. Come to that
> |> so do operating systems with the HAL. Presumably this is most
> |> important for compiler writers.
>
> No. And I am afraid that they don't. Let's take a specific example
> that has been discussed at length in the context of SC22WG21 (C++).
>
> A threaded program has a global object that is read frequently
> and updated (by simply storing into it) rarely. Under what
> circumstances is it permissible to put a lock around the update
> and leave the reads unlocked?

https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068

You need #LoadLoad w/ Data-Dependency Hint for readers, and mutex or
lock-free writer algorithm for the writers...

It's a general PDR pattern:

http://groups.google.com/group/comp.arch/msg/2a0f4163f8e13f1e


From: Chris Thomasson on
> You need #LoadLoad w/ Data-Dependency Hint for readers

It is my understanding that this barrier is implied on virtually every arch
out there, except the Alpha...


From: Nick Maclaren on

In article <0NWdnRtwppdCXqbYnZ2dnUVZ_qOdnZ2d(a)comcast.com>,
"Chris Thomasson" <cristom(a)comcast.net> writes:
|> > |>
|> > |> > Much of the load/store atomicity is unspecified, especially
|> > |> > with regard to multiple cores and data objects which need a
|> > |> > lower alignment than their size. That is a common source of
|> > |> > problem.
|> > |>
|> > |> Are you talking solely about factors that affect program
|> > |> optimisation? I ask this because most high level languages
|> > |> isolate the programmer from the architecture anyway. Come to that
|> > |> so do operating systems with the HAL. Presumably this is most
|> > |> important for compiler writers.
|> >
|> > No. And I am afraid that they don't. Let's take a specific example
|> > that has been discussed at length in the context of SC22WG21 (C++).
|> >
|> > A threaded program has a global object that is read frequently
|> > and updated (by simply storing into it) rarely. Under what
|> > circumstances is it permissible to put a lock around the update
|> > and leave the reads unlocked?
|>
|> https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068
|>
|> You need #LoadLoad w/ Data-Dependency Hint for readers, and mutex or
|> lock-free writer algorithm for the writers...

Subject to niggling about the details and which solutions are best (or
even good), yes, precisely.

|> It is my understanding that this barrier is implied on virtually every arch
|> out there, except the Alpha...

And, to my certain knowledge, it is not.

Every architecture implies some sort of a barrier, but the exact details
are often critical and hard to deduce from the documentation. It is
common for the question that the PROGRAM needs answering to be left
unspecified and the answer that the ARCHITECTURE gives to be inadequate
for the algorithm of interest.

But the posting to which I responded was talking about the languages,
and you don't need to get down to that level to hit problems. All
languages (and many hardware architectures) have basic objects that
need an alignment of X but are of size 2X or more. The language
standards define the semantics only for serial execution, and so
allocating such things at an alignment of X is fine, and that is what
the languages and libraries do.

In fact, that works even for threaded code run on a single CPU, but
it DOESN'T for threaded code run on multiple CPUs (including cores),
as the object may span a cache line. So, it's nobody's fault, because
all of the hardware, compiler and program are following the relevant
rules, but the resulting program doesn't work ....


Regards,
Nick Maclaren.
From: Chris Thomasson on
"Nick Maclaren" <nmm1(a)cus.cam.ac.uk> wrote in message
news:ehi0rn$mta$1(a)gemini.csx.cam.ac.uk...
>
> In article <0NWdnRtwppdCXqbYnZ2dnUVZ_qOdnZ2d(a)comcast.com>,
> "Chris Thomasson" <cristom(a)comcast.net> writes:
> |> > |>
> |> > |> > Much of the load/store atomicity is unspecified

[...]

> |> > A threaded program has a global object that is read frequently
> |> > and updated (by simply storing into it) rarely. Under what
> |> > circumstances is it permissible to put a lock around the update
> |> > and leave the reads unlocked?
> |>
> |>
> https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&messageID=11068
> |>
> |> You need #LoadLoad w/ Data-Dependency Hint for readers, and mutex or
> |> lock-free writer algorithm for the writers...
>
> Subject to niggling about the details and which solutions are best (or
> even good), yes, precisely.

;^)




> |> It is my understanding that this barrier is implied on virtually every
> arch
> |> out there, except the Alpha...
>
> And, to my certain knowledge, it is not.

Well, at least I think I got a direct answer from a Sun Architect about the
implied barriers wrt lock-free reader patterns on the UltraSPARC T1; I made
sure to explicitly state #LoadLoad and/or Data-Dependency Hints in my
post...


Humm... Okay... Let's say the language is strictly Assembly, and were
working with something that provides atomic stores/loads to/from properly
aligned pointer sized variables... How many current/well-known architectures
do you think may be lacking an implied data-dependant load barrier for
atomic loads?

http://groups.google.com/group/comp.programming.threads/msg/8a55976d126906a8


If a particular architecture does not have the correct barrier for this kind
of stuff... Well, I bet the Linux Kernel developers would not enjoy working
that architecture very much at all; RCU performance would tank with explicit
#LoadLoad barriers...! :O

However, I believe that there is a possible solution which involves
amortizing the cost of the explicit #LoadLoad:

http://groups.google.com/group/comp.programming.threads/msg/7b427bceff6f75da
http://groups.google.com/group/comp.programming.threads/msg/e05c889c00bc0902

http://groups.google.com/group/comp.programming.threads/msg/e7b68f55d3c87152


Any thoughts on this?




> Every architecture implies some sort of a barrier, but the exact details
> are often critical and hard to deduce from the documentation. It is
> common for the question that the PROGRAM needs answering to be left
> unspecified and the answer that the ARCHITECTURE gives to be inadequate
> for the algorithm of interest.

Murphy's Law Strikes Again!

;^)




> But the posting to which I responded was talking about the languages,
> and you don't need to get down to that level to hit problems. All
> languages (and many hardware architectures) have basic objects that
> need an alignment of X but are of size 2X or more. The language
> standards define the semantics only for serial execution, and so
> allocating such things at an alignment of X is fine, and that is what
> the languages and libraries do.
>
> In fact, that works even for threaded code run on a single CPU, but
> it DOESN'T for threaded code run on multiple CPUs (including cores),
> as the object may span a cache line. So, it's nobody's fault, because
> all of the hardware, compiler and program are following the relevant
> rules, but the resulting program doesn't work ....

Well, I try to stick to the following logic for C/Assembly Language:

http://groups.google.com/group/comp.programming.threads/msg/a8d7067bc1425ae1

http://groups.google.com/group/comp.programming.threads/msg/0afc1109d18c2991

http://groups.google.com/group/comp.programming.threads/msg/fd3b4b5a5cd7841e


that my open-source libraries make use of:

http://appcore.home.comcast.net/
http://appcore.home.comcast.net/vzoom/refcount/


Any advice, or a better technique would be greatly appreciated! Even though,
I have not had any real problems with the way I have been doing things so
far...

:O lol...


Thank You.


From: Nick Maclaren on

In article <taudnWiDALAjfKHYnZ2dnUVZ_vednZ2d(a)comcast.com>,
"Chris Thomasson" <cristom(a)comcast.net> writes:
|>
|> Well, at least I think I got a direct answer from a Sun Architect about the
|> implied barriers wrt lock-free reader patterns on the UltraSPARC T1; I made
|> sure to explicitly state #LoadLoad and/or Data-Dependency Hints in my
|> post...

I suspect that the question you asked and the one he answered were not
quite the same ....

But I can easily believe that there is such a barrier at the hardware
level for that CPU. Let's assume that, until and unless we see evidence
to the contrary.

|> Humm... Okay... Let's say the language is strictly Assembly, and were
|> working with something that provides atomic stores/loads to/from properly
|> aligned pointer sized variables... How many current/well-known architectures
|> do you think may be lacking an implied data-dependant load barrier for
|> atomic loads?
|>
|> http://groups.google.com/group/comp.programming.threads/msg/8a55976d126906a8

Define 'properly' - which comes back to the language question. In the
sense that you and Joe Seigh mean it, and at the STRICT hardware level,
quite possibly none.

But I have seen two stores by CPU A 'reverse' as seen by CPU B on many
architectures, including when I don't think that it was allowed. What
I suspect is that most CPU/boards have some CPU configuration state to
allow or block store/load reordering, and not all systems were setting
that correctly. But it could also have been associated with interrupt
handling (including TLB miss, ECC and other), and I am only 99% sure
that it was a genuine memory reversal, anyway.

Unfortunately, for obvious reasons, it is the worst sort of problem
imaginable to tie down precisely :-(

|> If a particular architecture does not have the correct barrier for this kind
|> of stuff... Well, I bet the Linux Kernel developers would not enjoy working
|> that architecture very much at all; RCU performance would tank with explicit
|> #LoadLoad barriers...! :O

Quite. But, if the cost of not doing that were one non-repeatable
failure every few months, who is going to notice? I was playing with
SMP applications code, and so was hammering this aspect, but even then
I rarely saw an anomaly that could plausibly be assigned to such a
cause.

|> However, I believe that there is a possible solution which involves
|> amortizing the cost of the explicit #LoadLoad:
|>
|> Any thoughts on this?

SOP for more decades than I care to recall.

|> Well, I try to stick to the following logic for C/Assembly Language:

Grin :-)

Yes, I agree. My position is that I usually write portable code (over
time AND space) and so can rely on only high-level guarantees - which,
in my case, I have not got[*].

[*] With apologies to Henry Reed ("The Naming of Parts").


Regards,
Nick Maclaren.