Intel x86 memory model question [Computer Architecture]

Prev: CPU <> Memory chip communication interface
Next: interrupting for overflow and loop termination

From: Seongbae Park on 31 Aug 2005 17:57

Seongbae Park <Seongbae.Park(a)Sun.COM> wrote:
> Joe Seigh <jseigh_01(a)xemaps.com> wrote:
> ...
>> It turns out the x86 memory model is defined, it's just not defined in the
>> IA-32 manuals which is where you would expect it to be defined. It's defined
>> in the Itanium manuals and is equivalent to Sparc TSO memory model.
>>
>> 2.1.2 Loads and Stores
>> In the Itanium architecture, a load instruction has either unordered or acquire semantics while a
>> store instruction has either unordered or release semantics. By using acquire loads (ld.acq) and
>> release stores (st.rel), the memory reference stream of an Itanium-based program can be made to
>> operate according to the IA-32 ordering model. The Itanium architecture uses this behavior to
>> provide IA-32 compatibility. That is, an Itanium acquire load is equivalent to an IA-32 load and an
>> Itanium release store is equivalent to an IA-32 store, from a memory ordering perspective.
>
> I suspect the above paragraph is stronger than what it really wanted to say.
> It seems that the intention was to say
> that Itanium can correctly emulate x86 by running effectively in a TSO mode,
> since x86's memory model is not stronger than TSO.

I take this back.
Actually the above statement depends on whether IA64 is RCsc or RCpc.
If it is RCpc, then by definition all special accesses are PC in RCpc,
and turning every accesses special accesses just turns in into PC.
If it is RCsc, then it is not really a TSO but SC which is stronger than PC
and hence can run the program correctly.

I didn't bother to look at IA64 manual - anybody care to comment on this ?
but I suspect that IA64 is RCpc and the manual is exactly correct after all.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"

From: Eric P. on 31 Aug 2005 18:02

Joe Seigh wrote:
>
> Eric P. wrote:
> > Joe Seigh wrote:
> >
> >>Joe Seigh wrote:
> >>
> >>> processor 1 stores into X
> >>> processor 2 see the store by 1 into X and stores into Y
> >>>
> >>>So the store into Y occurred after causal reasoning.
> >>>
> >>> processor 3 loads from Y
> >>> processor 3 loads from X
> >>>
> >>>If loads were in order you could infer that if processor 3
> >>>sees the new value of Y then it will see the new value of X.
> >>>But the rules for processor consistency *clearly* state that
> >>>you will necessarily see stores by different processors in
> >>>order.
> >>
> >>that should be
> >>
> >>But the rules for processor consistency *clearly* state that
> >>you will not necessarily see stores by different processors in
> >>order.
> >
> >
> > I see what you are getting at, but for this to occur the new value
> > of Y would have to arrive at P3 before the new value of X from P1,
> > implying the msg from P2 to P3 somehow passed the msg from P1 to P3.
> > This would mean that no update order at all could be concluded
> > and the whole system would break.
> >
> > Since they clearly do function, this is obviously not how they work :-)
> >
>
> It turns out the x86 memory model is defined, it's just not defined in the
> IA-32 manuals which is where you would expect it to be defined. It's defined
> in the Itanium manuals and is equivalent to Sparc TSO memory model.
>
> 2.1.2 Loads and Stores
> In the Itanium architecture, a load instruction has either unordered or acquire semantics while a
> store instruction has either unordered or release semantics. By using acquire loads (ld.acq) and
> release stores (st.rel), the memory reference stream of an Itanium-based program can be made to
> operate according to the IA-32 ordering model. The Itanium architecture uses this behavior to
> provide IA-32 compatibility. That is, an Itanium acquire load is equivalent to an IA-32 load and an
> Itanium release store is equivalent to an IA-32 store, from a memory ordering perspective.

I think the underlying question you asked about the x86 is:

Does the Intel Processor Consistency model require processors
to wait for all other processors to acknowledge receipt of their
invalidates before any are allowed to use the new value?

The section 7.2.2 memory ordering info does not define an answer.

This would likely depend on the bus protocol details.
It might be implemented by having P1 send an invalidate X to P2
and not reply to a request from P2 for a read of the new value of
X until it had received an the invalidate acknowledgment from P3.

I haven't paid any attention to the I64 acquire release mechanism
as I figure I'll never run into it, so I'm not sure if that is
the same as a release.

Eric

From: Ricardo Bugalho on 1 Sep 2005 06:22

On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote:

> I didn't bother to look at IA64 manual - anybody care to comment on this ?
> but I suspect that IA64 is RCpc and the manual is exactly correct after
> all.

It's RCpc indeed.

From: Ricardo Bugalho on 1 Sep 2005 06:36

On Wed, 31 Aug 2005 18:02:34 -0400, Eric P. wrote:

>
> I think the underlying question you asked about the x86 is:
>
> Does the Intel Processor Consistency model require processors to wait
> for all other processors to acknowledge receipt of their invalidates
> before any are allowed to use the new value?
>

It does not.
The most straightforward example is buffered store forwarding: when a CPU
writes a value into memory, it can read it again directly from the store
buffer, even before it tries to make it visible to other processors.

From: Alexander Terekhov on 1 Sep 2005 07:23

Ricardo Bugalho wrote:
>
> On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote:
>
> > I didn't bother to look at IA64 manual - anybody care to comment on this ?
> > but I suspect that IA64 is RCpc and the manual is exactly correct after
> > all.
>
> It's RCpc indeed.

Not quite. Release stores to *WB* memory are constrained to ensure
"remote write atomicity". Classic RCpc is weaker in this respect
(and that's what makes RC != TSO). You better not rely on this
property because emulating it on CELLs (for example) will make your
ports run really slow. ;-)

regards,
alexander.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: CPU <> Memory chip communication interface
Next: interrupting for overflow and loop termination