From: Andy 'Krazy' Glew on
On 6/13/2010 4:31 AM, Terje Mathisen wrote:
> Andy 'Krazy' Glew wrote:
>> Now, there are de-facto instruction fetch atomicity properties. But
>> nothing official. (E.g. on Intel (since P6), and I believe AMD, any
>> ifetch entirely within an aligned 16 bytes is atomic. And Intel (since
>> P6) will clear when the first byte is written; i.e. Intel recognizes SMC
>> immediately (as, I think, does AMD).) So I believe that the algorith, I
>> describe wll work on Intel since P6, for WB memory. I think that it also
>> will work for UC memory.

Correction: the algorithm I described should work for coherent UC memory. I.e. UC memory where writes are exposed to
other processors for snooping.

It will not work for non-coherent write-through memory. In such a case, the scenario Mitch described will cause a
failure of the protocol. In such a case, a shootdown will be necessary.
From: Andy 'Krazy' Glew on
On 6/13/2010 4:31 AM, Terje Mathisen wrote:
> Andy 'Krazy' Glew wrote:
>> Now, there are de-facto instruction fetch atomicity properties. But
>> nothing official. (E.g. on Intel (since P6), and I believe AMD, any
>> ifetch entirely within an aligned 16 bytes is atomic. And Intel (since
>> P6) will clear when the first byte is written; i.e. Intel recognizes SMC
>
> Intel have done so since the 486 I believe, definitely since the Pentium!

Are you sure? I have pretty clear recollection that there were codes that ran on P5 that failed on P6, because P6
caused the SMC to take effect immediately, at the next instruction boundary. I think on P5, if an instruction in the U
pipe overwrote the very next instruction, and that instruction was already in the V pipe, the effect was not immediately
seen.
From: Andy 'Krazy' Glew on
On 6/13/2010 6:53 AM, MitchAlsup wrote:
> On Jun 12, 7:17 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>> (Glew's rule: any architectural property should work when caches are disabled.)
>
> This rule will prevent bundling more than one instruction into a
> single atomic sequence, should you ever want such a longer atomic
> operation with multiple independent memory references. Thus something
> like ASF can only work when the instructions and data are both
> cacheable and both caches are turned on.
>
> {Note: I am not disagreeing with the general principle involved, but
> there are times.....}


That's been one of the problems I have had with ASF, and Transactional Memory, and any of the dozens of other proposals
I have seen that use similar mechanisms. (E.g. many flavors of load-linked/store-conditional. N-way LLSC. etc.)

If you disable caches for debugging, they break.

However, you can fix this. Aas I said inmy reply to Terje, you can have coherent (and ordered) UC memory.

From: Andy 'Krazy' Glew on
On 6/13/2010 6:53 AM, MitchAlsup wrote:
> On Jun 12, 7:17 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>> Now, there are de-facto instruction fetch atomicity properties. But nothing official. (E.g. on Intel (since P6), and I
>> believe AMD, any ifetch entirely within an aligned 16 bytes is atomic. And Intel (since P6) will clear when the first
>> byte is written; i.e. Intel recognizes SMC immediately (as, I think, does AMD).)<snip>
>>
>> But there is nothing official.
>
> And there IS the problem. AMD Athlons and Opterons obey the self
> modifying code checks wrt instruction fetch, but do not on third party
> modifications of code simply because there is no spec as to what is
> required (and the addresses to be checked are quite distant from the
> addresses that need checking). {This was circa '07 and may have been
> changed.} When we investigated this issue (circa '05) there was no
> code sequence that would successfuly do this that would work on both
> Intel and AMD machines {IBM java JIT compiler triggered the
> investigation.} So, the compiler had to do a CPUID early and use
> result this to pick a code sequence later, as needed.
>
>> (Glew's rule: any architectural property should work when caches are disabled.)
>
> This rule will prevent bundling more than one instruction into a
> single atomic sequence, should you ever want such a longer atomic
> operation with multiple independent memory references. Thus something
> like ASF can only work when the instructions and data are both
> cacheable and both caches are turned on.
>
> {Note: I am not disagreeing with the general principle involved, but
> there are times.....}


I just added this to

http://semipublic.comp-arch.net/wiki/Design_Principles_and_Rules_of_Thumb

If you want to add some...
From: MitchAlsup on
On Jun 13, 3:08 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> That's been one of the problems I have had with ASF, and Transactional Memory, and any of the dozens of other proposals
> I have seen that use similar mechanisms.  (E.g. many flavors of load-linked/store-conditional.  N-way LLSC. etc.)
>
> If you disable caches for debugging, they break.
>
> However, you can fix this.  Aas I said inmy reply to Terje, you can have coherent (and ordered) UC memory.

So, what is your solution for building atomic primitives that
necessarily need multiple independent memory locations (up to 5)?

{When I looked at this, I could have stopped with DCASQ. But I
suspected that every year another atomic primitive would be requested.
So, it was either be in a position to continuously adds instructions,
or deliver the cook book to SW and let then figure out what they need.
So HW developers could get on with designing other things.}

But I have an even stiffer question: Why should the debugger allow
single stepping through a series of instructions that must appear to
be atomic?

{After all, once you are single stepping, other CPUs in the MP may
interact with something that needs to appear atomic at the other end.
Arguing that the debugger can catch all the other interacting
threadds, shut them down for a while, and then single stepping,
finally waing up the other threads; will not uncover the actual cases
you will want to be debugging--the hard ones.

With ASF, by defining those memory locations that were participating
and those that were not, one could dump intermediate state to a memory
buffer that can be printed after the atomic event has transpired. It
seems to me there is no way to correctly give the illusion of
atomicity and allow single stepping through an atomic event.}

Mitch