Maximum number of operands for x86 and x64 instruction set ? [Computer Architecture]

Prev: Analog/non-light displays
Next: BSR needs to be a bit faster on AMD processors...

From: Andy 'Krazy' Glew on 13 Jun 2010 19:04

On 6/13/2010 2:26 PM, MitchAlsup wrote:
> On Jun 13, 3:08 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>> That's been one of the problems I have had with ASF, and Transactional Memory, and any of the dozens of other proposals
>> I have seen that use similar mechanisms. (E.g. many flavors of load-linked/store-conditional. N-way LLSC. etc.)
>>
>> If you disable caches for debugging, they break.
>>
>> However, you can fix this. Aas I said inmy reply to Terje, you can have coherent (and ordered) UC memory.
>
> So, what is your solution for building atomic primitives that
> necessarily need multiple independent memory locations (up to 5)?

Why stop at 5?

>
> {When I looked at this, I could have stopped with DCASQ. But I
> suspected that every year another atomic primitive would be requested.
> So, it was either be in a position to continuously adds instructions,
> or deliver the cook book to SW and let then figure out what they need.
> So HW developers could get on with designing other things.}
>
> But I have an even stiffer question: Why should the debugger allow
> single stepping through a series of instructions that must appear to
> be atomic?

I don't think it necessarily should.

Similar issues arise with transactional memory. Should a transaction be atomic wrt debugging and single stepping, or not?

Choices include

0) aborting the trasaction - going to the top of the txn (doesn't work for single steping)

1) single stepping skipping all of the atomic.

2) single stepping in the atomic region, with provision ti track state involved as you have described, and/or shoot down

>
> {After all, once you are single stepping, other CPUs in the MP may
> interact with something that needs to appear atomic at the other end.
> Arguing that the debugger can catch all the other interacting
> threadds, shut them down for a while, and then single stepping,
> finally waing up the other threads; will not uncover the actual cases
> you will want to be debugging--the hard ones.
>
> With ASF, by defining those memory locations that were participating
> and those that were not, one could dump intermediate state to a memory
> buffer that can be printed after the atomic event has transpired. It
> seems to me there is no way to correctly give the illusion of
> atomicity and allow single stepping through an atomic event.}
>
> Mitch

From: Terje Mathisen "terje.mathisen at on 14 Jun 2010 01:56

Andy 'Krazy' Glew wrote:
> On 6/13/2010 4:31 AM, Terje Mathisen wrote:
>> Andy 'Krazy' Glew wrote:
>>> Now, there are de-facto instruction fetch atomicity properties. But
>>> nothing official. (E.g. on Intel (since P6), and I believe AMD, any
>>> ifetch entirely within an aligned 16 bytes is atomic. And Intel (since
>>> P6) will clear when the first byte is written; i.e. Intel recognizes SMC
>>
>> Intel have done so since the 486 I believe, definitely since the Pentium!
>
> Are you sure? I have pretty clear recollection that there were codes
> that ran on P5 that failed on P6, because P6 caused the SMC to take
> effect immediately, at the next instruction boundary. I think on P5, if
> an instruction in the U pipe overwrote the very next instruction, and
> that instruction was already in the V pipe, the effect was not
> immediately seen.

That is possible, but quite hard to achieve, since it would require you
to setup a loop in such a way that on the first N (N can be 1)
iterations it would _not_ overwrite (or modify in any other way) the
following opcode, but on the next iteration it would.

This is due to the P5 quirk where the first iteration of any particular
piece code would bring it in to the I-cache, while marking the
instruction boundaries, and only one the second and later iterations,
while executing out of I-cache would both pipes be working.

It wouldn't be too hard to setup code to do just this, but it would
definitely never happen by accident:

mov al,'E' ; INC EBP opcode
mov ecx,2
mov edi, offset buffer + 4096 ; Start in next code page...
xor ebp,ebp
next:
mov [edi],al
buffer:
nop ; The second iteration overwrites this NOP

sub edi,4096
nop

dec ecx
jnz next

jmp done
nop
nop
nop
.... 4080 NOPs skipped
nop
nop ; The first write lands around here...
nop
nop
nop
done:
test ebp,ebp ; Did it get updated or not?

Since the (next:) target is the top of the loop, it will always run in
the U pipe, while the following NOP can run in V.

The SUB EDI,10 and NOP can pair during the next iteration, while DEC ECX
and JNZ NEXT matches up for the last.

By placing the first write 4K beyond the current code, I'm pretty
certain that no current CPU will take this as a reason to shootdown any
I-cache data for the little loop I'm running twice.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Terje Mathisen "terje.mathisen at on 14 Jun 2010 02:06

MitchAlsup wrote:
> buffer that can be printed after the atomic event has transpired. It
> seems to me there is no way to correctly give the illusion of
> atomicity and allow single stepping through an atomic event.}

Please excuse me, but DUH!

Are you saying that there are people who demand/require this ability?

I thought the definition of "atomic" was "indivisible", so obviously
even single-stepping code would have to treat such a block as a single
instruction, right?

Just like in the very old days where the single-step interrupt would
delay for an extra instruction if the first was one of those that were
defined to only occur as the first in a pair, i.e. like an update to the
stack segment register.

Until you wrote the paragraph above, I really didn't consider this to be
a problem. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: nmm1 on 14 Jun 2010 03:53

In article <fftge7-5gh.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>MitchAlsup wrote:
>> buffer that can be printed after the atomic event has transpired. It
>> seems to me there is no way to correctly give the illusion of
>> atomicity and allow single stepping through an atomic event.}
>
>Please excuse me, but DUH!
>
>Are you saying that there are people who demand/require this ability?
>
>I thought the definition of "atomic" was "indivisible", so obviously
>even single-stepping code would have to treat such a block as a single
>instruction, right?

Think about it. There are millions of script-kiddies out there who
think they are programmers, and who can't debug even the simplest
code without a debugger to help. Some of those will want (and even
need) to write atomic sections. So how are they going to debug them?

Notice "they". I know how I would, and I can guess how you would.

>Just like in the very old days where the single-step interrupt would
>delay for an extra instruction if the first was one of those that were
>defined to only occur as the first in a pair, i.e. like an update to the
>stack segment register.

That's NOT the very old days! In those, the debugger would misbehave
horribly. Yes, I know that we seem to have warped back there :-(

As an aside, I have been writing some revoltingly messy (inherently
so) code, and tried using a debugger again, as I do at intervals.
Bloody useless, as usual - not even a backtrace as soon as you make
an actual SIGSEGV mistake - dunno why, as some were read-only. So
back to the techniques I used in the 1960s and early 1970s ....

Regards,
Nick Maclaren.

From: MitchAlsup on 14 Jun 2010 10:51

On Jun 13, 6:04 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> On 6/13/2010 2:26 PM, MitchAlsup wrote:
>
> > On Jun 13, 3:08 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
> >> That's been one of the problems I have had with ASF, and Transactional Memory, and any of the dozens of other proposals
> >> I have seen that use similar mechanisms. (E.g. many flavors of load-linked/store-conditional. N-way LLSC. etc.)
>
> >> If you disable caches for debugging, they break.
>
> >> However, you can fix this. Aas I said inmy reply to Terje, you can have coherent (and ordered) UC memory.
>
> > So, what is your solution for building atomic primitives that
> > necessarily need multiple independent memory locations (up to 5)?
>
> Why stop at 5?

We actually stopped at 7 -- the number of free cache miss buffers that
still allows forward progress on the other stuff the machine might be
up to, but the longest sequence of code that proved useful in large
synchronization events used 5 independent memory locations. Here we
could move an element in a concurrent data structure from one location
to another in a single atomic event.
>
>
>
> > {When I looked at this, I could have stopped with DCASQ. But I
> > suspected that every year another atomic primitive would be requested.
> > So, it was either be in a position to continuously adds instructions,
> > or deliver the cook book to SW and let then figure out what they need.
> > So HW developers could get on with designing other things.}
>
> > But I have an even stiffer question: Why should the debugger allow
> > single stepping through a series of instructions that must appear to
> > be atomic?
>
> I don't think it necessarily should.
>
> Similar issues arise with transactional memory. Should a transaction be atomic wrt debugging and single stepping, or not?
>
> Choices include
>
> 0) aborting the trasaction - going to the top of the txn (doesn't work for single steping)
>
> 1) single stepping skipping all of the atomic.
>
> 2) single stepping in the atomic region, with provision ti track state involved as you have described, and/or shoot down

ASF used the general philosophy of 1) above and by making a
distinction between memory location participating in the atomic event,
and other memory locations not participating, we got 1) above and
enough of 2) to allow for print-level debug for after the fact
observation of what happened in the atomic event.

Mitch

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: Analog/non-light displays
Next: BSR needs to be a bit faster on AMD processors...