Intel x86 memory model question [Computer Architecture]

Prev: CPU <> Memory chip communication interface
Next: interrupting for overflow and loop termination

From: Alexander Terekhov on 3 Sep 2005 09:58

Joe Seigh wrote:
[...]
> We're assuming weakly ordered memory I think, whatever the typical multiprocessor
> Intel box meant to run Linux or windows uses. Whatever "write-back cacheable"
> is.

It means PC (apart from the non-temporal weakly ordered stuff) under x86
native (not Itanicized x86, i.e. TSO for WB instead of PC), and you don't
need LFENCE under PC.

regards,
alexander.

From: Eric P. on 3 Sep 2005 10:02

Joe Seigh wrote:
>
> Alexander Terekhov wrote:
> >
> > Neither will give you "global ordering of loads". Loads on ia32 are
> > in-order with respect to other loads and subsequent stores (by the
> > same processor). The only thing that differentiates PC from TSO is
> > the lack of remote write atomicity (in IA64 formal memory model
> > speak). Implementations (e.g. SPO) of course can do all sorts of
> > tricks to improve performance, but that doesn't change the memory
> > model. You're in denial.
> >
>
> Whatever. I'm going to use LFENCE for situations where I'd use
> #LoadLoad on sparc (generic, not assuming TSO). And it's not
> because I'm in denial. It's because nothing you say is
> comprehensible. It's possible you are making some kind of
> valid technical point but I have no way of telling.

As I understand it, the key to causal ordering is Atomic Visibility
whereby a write becomes visible simultaneously to all processors
other than the one that issued the write. According to Gharacharloo,
Processor Consistency does not require updates be Atomically Visible
and, in theory allows non causal ordering of the kind in your
example. TSO does require Atomic Visibility.

The reason PC allows this rather dubious ordering appears to be so
as to not disallow caches using a Write Update (as opposed to Write
Invalidate) coherency protocol. Imposing Atomic Visibility on a
Write Update cache would be very difficult because each cache would
receive the updated value but then each would have to prevent that
value from being used until all peers had ack'ed. Imposing Atomic
Visibility on a Write Invalidate cache is much easier - just don't
give out the new value until all invalidate ack's are received.

(Others have pointed out, however, that Write Update caches are
undesirable for other reasons so PC appears to give up atomicity in
order to gain the ability to use a cache design that no one wants to.
Go figure.)

The text of LFENCE instruction in the Intel instruction manual says
"Performs a serializing operation on all load-from-memory instructions
that were issued prior the LFENCE instruction. This serializing
operation guarantees that every load instruction that precedes in
program order the LFENCE instruction is globally visible before any
load instruction that follows the LFENCE instruction is globally
visible. The LFENCE instruction is ordered with respect to load
instructions, other LFENCE instructions,"...

seems to provide the guarantees for globally visibility and
therefore causality that you are looking for.

Eric

From: Alexander Terekhov on 3 Sep 2005 10:34

"Eric P." wrote:
[...]
> The text of LFENCE instruction in the Intel instruction manual says
> "Performs a serializing operation on all load-from-memory instructions
> that were issued prior the LFENCE instruction. This serializing
> operation guarantees that every load instruction that precedes in
> program order the LFENCE instruction is globally visible before any
> load instruction that follows the LFENCE instruction is globally
> visible. The LFENCE instruction is ordered with respect to load
> instructions, other LFENCE instructions,"...
>
> seems to provide the guarantees for globally visibility and

What does "global visibility" means for loads under PC?

> therefore causality that you are looking for.

So where do you put the fence, then?

: processor 1 stores into X
: processor 2 see the store by 1 into X and stores into Y
: processor 3 loads from Y
: processor 3 loads from X

regards,
alexander.

From: Alexander Terekhov on 3 Sep 2005 11:35

Joe Seigh wrote:

[... filters ...]

< Forward Quoted >

Newsgroups: comp.programming.threads
Subject: Re: Memory visibility and MS Interlocked instructions
From: David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>

-------- Original Message --------

David Hopwood wrote:
>
> Alexander Terekhov wrote:
> > Andy Glew of Intel (sorta) confirmed that x86 is classic PC.
> >
> > http://groups.google.de/group/comp.arch/msg/7200ec152c8cca0c
>
> Joe Seigh wrote:
> > The argument being presented in c.p.t. is that processor consistency
> > implies loads are in order, perhaps instigated by something Andy Glew
> > said about this here
> > http://groups.google.com/group/comp.arch/msg/96ec4a9fb75389a2
>
> and in another post:
> | "loads in order" means #LoadLoad between loads.
>
> > AFAICT, this is not true for 3 or more processors. E.g.
> >
> > processor 1 stores into X
> > processor 2 see the store by 1 into X and stores into Y
> >
> > So the store into Y occurred after causal reasoning.
>
> Processor consistency is weaker than causal consistency, remember.
>
> > processor 3 loads from Y
> > processor 3 loads from X
> >
> > If loads were in order you could infer that if processor 3
> > sees the new value of Y then it will see the new value of X.
>
> No.
>
> Start with X == Y == 0.
>
> P1: X := 1
>
> P2: t := X;
> if (t == 1) Y := 1
>
> P3: u := Y
> #LoadLoad // or acquire
> v := X
>
> {u == 1, v == 0} is possible. This is because P2 and P3 might see
> the stores to X and Y in a different order, because they are made
> by different processors. The #LoadLoad does not prevent this.
>
> > But the rules for processor consistency *clearly* state that
> > you will [not] necessarily see stores by different processors in
> > order.
> >
> > While there are still ordering constraints on the loads they
> > don't have to be strictly in order as Andy incorrectly infers.
>
> #LoadLoad between loads does not imply that you will necessarily
> see stores by different processors in a single global order. That
> is what you appear to be misunderstanding. In other words, there
> is nothing inconsistent between what Andy Glew's post said, and
> Alexander's assertion that load on x86 implies load.acq.
>
> --
> David Hopwood <david.nospam.hopwood(a)blueyonder.co.uk>

regards,
alexander.

From: Eric P. on 3 Sep 2005 11:30

Alexander Terekhov wrote:
>
> "Eric P." wrote:
> [...]
> > The text of LFENCE instruction in the Intel instruction manual says
> > "Performs a serializing operation on all load-from-memory instructions
> > that were issued prior the LFENCE instruction. This serializing
> > operation guarantees that every load instruction that precedes in
> > program order the LFENCE instruction is globally visible before any
> > load instruction that follows the LFENCE instruction is globally
> > visible. The LFENCE instruction is ordered with respect to load
> > instructions, other LFENCE instructions,"...
> >
> > seems to provide the guarantees for globally visibility and
>
> What does "global visibility" means for loads under PC?

Point taken.

> > therefore causality that you are looking for.
>
> So where do you put the fence, then?
>
> : processor 1 stores into X
> : processor 2 see the store by 1 into X and stores into Y
> : processor 3 loads from Y
> : processor 3 loads from X
>
> regards,
> alexander.

I was wondering that myself. How about:
P3:
LD X
LFENCE
LD Y
LFENCE
LD X

This does seem a terrible price to pay for the 'advantages' one
gets from giving up Atomic Visability.

In practice I would be surprised if this could ever really occur.
When Joe posted the example, I thought it was impossible.
I was surprised to find that it was, in theory, possible, at
least according to the Gharacharloo definition of PC.
I would be more surprised if there was one programmer in a
million who did not consider this a hardware bug or wrote
code that took this into account. I'd bet people code to TSO.

Eric

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: CPU <> Memory chip communication interface
Next: interrupting for overflow and loop termination