From: Chris Thomasson on
"Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message
news:453eeacf$0$1353$834e42db(a)reader.greatnowhere.com...
> Chris Thomasson wrote:
>> "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message
>> news:453ec796$0$1355$834e42db(a)reader.greatnowhere.com...
>> > Chris Thomasson wrote:
>> >> "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message
>> >> news:453e3ee4$0$1351$834e42db(a)reader.greatnowhere.com...
>> >> > Del Cecchi wrote:
>> >> > 2) It states that the x86 allows "Loads Reordered After Stores".
>> >> > He does not define this non standard terminology, but if it means
>> >> > what it sounds like then he is claiming that the x86 allows
>> >> > a later store to bypass outstanding earlier loads.
>> >> > That is wrong.
>> >>
>> >> On current x86 loads and stores are as follows:
>> >>
>> >> void* load(void **p) {
>> >> void p* = *p;
>> >> membar #LoadStore | #LoadLoad;
>> >> }

[...]


> According to your load definition, there is a membar between the
> two loads which *appears* to prevent the load &q from bypassing
> the load &p. If that is in fact what your terminology means
> then it directly contradicts the manual which states that
> "reads can be performed in any order". (7.2.2, item #1).

Okay... Well, I was under the impression that atomic loads on current x86
have load-acquire membar semantics... I know that you need explicit membar
to handle StoreLoad dependences'... However I thought that the LoadStore
dependences' was honored by atomic loads on current x86...


From: Eric P. on
Alexander Terekhov wrote:
>
> "Eric P." wrote:
> [...]
> > If by "remote write atomicity" you mean atomic global visibility
> > (all processors agree that each memory location has a single same
> > value), we discussed that here and it was determined (based on
> > 'knowledgeable sources') that x86 does have atomic global visibility.
>
> Really? IIRC, Glew went on record*** claiming that it is not true.
>
> See also
>
> http://www.decadentplace.org.uk/pipermail/cpp-threads/2006-September/001141.html
>
> ***) "WB memory is processor consistent, type II."
>
> With "type II" he meant "Extension to Dubois' Abstraction", I gather.
>
> regards,
> alexander.

(I don't know what "type II" and "Extension to Dubois' Abstraction"
mean. I can't find reference to either in Gharachorloo.)

Hmmm.... I thought it was resolved.

Joe Seigh said on x86 memory model:
http://groups.google.ca/group/comp.arch/msg/6af78be87ca29f31?hl=en&

"It turns out the x86 memory model is defined, it's just not defined
in the IA-32 manuals which is where you would expect it to be defined.
It's defined in the Itanium manuals and is equivalent to Sparc TSO
memory model."

At a moral level, with all due respect to Gharachorloo, if a cache
protocol allows processors, other than the most recent writer, to see
different values for the same memory location, then I don't think
anyone would consider that anything but broken, no matter what
kind of consistency label was attached.

I remember you had that comment that if it was PC then it would
require your AtomicCmpXchg (&x, 42, val) trick to be guaranteed
to read the most recent value.
Obviously that would be a silly thing to require programmers to do,
so I really can't see anyone designing a cache that requires it.

I also just came across this doc when searching for 'global visibility'

Fast and Generalized Polynomial Time Memory Consistency Verification
Amitabha Roy, Stephan Zeisset, Charles J. Fleckenstein, John C. Huang
Intel Corporation
http://arxiv.org/pdf/cs.AR/0605039.pdf

makes multiple references to TSO and the following statements:

"The algorithm we have developed is currently implemented in Intel?s
in house random test generator and is used by both the IA-32 and
Itanium verification teams."

"A load is considered performed (or executed) if no subsequent store
to that location (on any processor) can change the load return value.
A store is considered performed (or executed) if any subsequent
load to that location (on any processor) returns its value."

"Axiom 2 (Value Coherence)
The value returned by a read is from either the most recent store
in program order or the most recent store in global order."

These are TSO rules, not PC rules.
It seems to me that they would only develope a test program for
TSO on IA-32 if it actually worked that way.

Eric


From: Alexander Terekhov on

"Eric P." wrote:
[...]
> (I don't know what "type II" and "Extension to Dubois' Abstraction"
> mean. I can't find reference to either in Gharachorloo.)

-------
A load by Pi is considered performed at a point in time when the
issuing of a store to the same address by any P cannot affect the value
returned by the load

A store by Pi is considered performed with respect to Pk (i and k
different) before a point in time when issuing a load to the same
address by Pk returns the value defined by this store or a subsequent
store to the same address that has been performed with respect to Pk

A store by Pi eventually performs with respect to Pi.

If a load by Pi performs before the last store (in program order) to
the same address by Pi performs with respect to Pi, then the load
returns the value defined by that store. Otherwise, the load returns
the value defined by the last store to the same address (by any P)
that performed with resprct to Pi (before the load performs).

A store is performed when it is performed with respect to all
processors

Conditions for Processor Consistency

before a LOAD is allowed to perform with respect to any other
processor, all previous LOAD accesses must be performed

before a STORE is allowed to perform with respect to any other
processor, all previous accesses (LOADs and STOREs) must be
performed
------

>
> Hmmm.... I thought it was resolved.
>
> Joe Seigh said on x86 memory model:
> http://groups.google.ca/group/comp.arch/msg/6af78be87ca29f31?hl=en&
>
> "It turns out the x86 memory model is defined, it's just not defined
> in the IA-32 manuals which is where you would expect it to be defined.
> It's defined in the Itanium manuals and is equivalent to Sparc TSO
> memory model."

Itanium x86 mapping being a bit stronger ordered than x86 native won't
break anything. Just make it slow. ;-)

[... snip moral level ...]

> I also just came across this doc when searching for 'global visibility'
>
> Fast and Generalized Polynomial Time Memory Consistency Verification
> Amitabha Roy, Stephan Zeisset, Charles J. Fleckenstein, John C. Huang
> Intel Corporation
> http://arxiv.org/pdf/cs.AR/0605039.pdf
>
> makes multiple references to TSO and the following statements:
>
> "The algorithm we have developed is currently implemented in Intel?s
> in house random test generator and is used by both the IA-32 and
> Itanium verification teams."
>
> "A load is considered performed (or executed) if no subsequent store
> to that location (on any processor) can change the load return value.
> A store is considered performed (or executed) if any subsequent
> load to that location (on any processor) returns its value."
>
> "Axiom 2 (Value Coherence)
> The value returned by a read is from either the most recent store
> in program order or the most recent store in global order."
>
> These are TSO rules, not PC rules.
> It seems to me that they would only develope a test program for
> TSO on IA-32 if it actually worked that way.

The actual hardware implementation may well do TSO. So what?

"The algorithm assumes store atomicity, which is necessary for Axiom 3.
However it supports slightly relaxed consistency models which allow a
load to observe a local store which precedes it in program order,
before it is globally observed. Thus we cover all coherence protocols
that support the notion of relaxed write atomicity which can be defined
as: No store is visible to any other processor before the execution
point of the store. Based on our discussion with Intel microarchitects
we determined that all IA-32 and current generations of Itanium
microprocessors support this due to identifiable and atomic global
observation points for any store. This is mostly due to the shared bus
and single chipset."

Not very promising.

regards,
alexander.
From: Chris Thomasson on
"Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message
news:4540b737$0$1355$834e42db(a)reader.greatnowhere.com...
> Chris Thomasson wrote:
>> "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message
>> news:453f7657$0$1353$834e42db(a)reader.greatnowhere.com...
>> > If you are concerened about read bypassing side effects then
>> > add an LFENCE or MFENCE.

[...]

> Also Andy Glew had some comments on load & store ordering
> http://groups.google.com/group/comp.arch/msg/96ec4a9fb75389a2

Right... Basically, something like this:

http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9


From: Brian Hurt on
nmm1(a)cus.cam.ac.uk (Nick Maclaren) writes:


>That experience debunked the claims of the
>functional programming brigade that such methodology gave automatic
>parallelisation.

Automatic parallelization, no. You're looking for a silver bullet that
probably doesn't exist. On the other hand, functional programming makes
writting parallel code much easier to do.

The biggest problem with parallelized code is the race condition- which
arise from mutable data. Every peice of mutable data is a race condition
waiting to happen. Mutable data needs to be kept to an absolute minimum,
and then handled in such a way that it's correct in the presence of
threads.

I've come to the conclusion that functional programming is necessary-
just not sufficient. There are two languages I know of in which it may
be possible to write non-trivial parallel programs correctly and
maintainably- concurrent haskell with STM and erlang- and both are, at
their core, purely functional languages.

Brian