From: MitchAlsup on
On Jun 12, 2:33 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> If you have kernel access so can hook INT3, what's wrong with
>
> FOR i FROM lowest byte of instruction TO highest byte DO
>      *b = INT3 (single byte trap instruction)
>
> FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
>      *b = appropriate byte of new instrucftion

Consider the case where an interested CPU has already fetched the
first byte (or first several bytes) of said instruction and one of
these fetched bytes happens to be a major opcode byte, but the rest of
the instruction fetch gets delayed by this or that. There is no
architectural specification that requires the fetch process to back up
when an instruction cache line is stolen.

Now your function comes in and writes INT 3 over the rest of the
instruction, snatching the cache line from the interested CPU.

Finally the delayed CPU finished fetching and now executes the
instruction with all of its minor opcodes, mod/rm, sib, and constants
containing INT 3 byte patterns.

Nothing good will come of this.

You have to prevent the "interested" CPU from fetching the first byte
of the instruction before smearing INT 3's over the opcode space. The
only chance yo have of making this work is to align this particular
instruction on a cache line boundary.....which one cannot do for a
random instruction

Mitch
From: Andy 'Krazy' Glew on
On 6/12/2010 4:28 PM, MitchAlsup wrote:
> On Jun 12, 2:33 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>> If you have kernel access so can hook INT3, what's wrong with
>>
>> FOR i FROM lowest byte of instruction TO highest byte DO
>> *b = INT3 (single byte trap instruction)
>>
>> FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
>> *b = appropriate byte of new instrucftion
>
> Consider the case where an interested CPU has already fetched the
> first byte (or first several bytes) of said instruction and one of
> these fetched bytes happens to be a major opcode byte, but the rest of
> the instruction fetch gets delayed by this or that. There is no
> architectural specification that requires the fetch process to back up
> when an instruction cache line is stolen.
>
> Now your function comes in and writes INT 3 over the rest of the
> instruction, snatching the cache line from the interested CPU.
>
> Finally the delayed CPU finished fetching and now executes the
> instruction with all of its minor opcodes, mod/rm, sib, and constants
> containing INT 3 byte patterns.
>
> Nothing good will come of this.
>
> You have to prevent the "interested" CPU from fetching the first byte
> of the instruction before smearing INT 3's over the opcode space. The
> only chance yo have of making this work is to align this particular
> instruction on a cache line boundary.....which one cannot do for a
> random instruction


Fair enough.

There are no atomicity properties defined for instruction fetch.

Perhaps there should be.

Lacking this, the only safe way is to do s hottdown: stop all CPUs, do the write of the instruction bytes, perform a
serializing instruction on each CPU to flush all instruction caches and prefetch (that is architecturally defined), and
then restart.

Thanks, Mitch. I had gone over this with the Pin people, but had forgotten.

Now, there are de-facto instruction fetch atomicity properties. But nothing official. (E.g. on Intel (since P6), and I
believe AMD, any ifetch entirely within an aligned 16 bytes is atomic. And Intel (since P6) will clear when the first
byte is written; i.e. Intel recognizes SMC immediately (as, I think, does AMD).) So I believe that the algorith, I
describe wll work on Intel since P6, for WB memory. I think that it also will work for UC memory. (Glew's rule: any
architectural property should work when caches are disabled.)

But there is nothing official.

---

By the way, this is an example of where making self-modifying code be recognized immediately, part of the memory
ordering model, simplifies things.
From: Terje Mathisen "terje.mathisen at on
Andy 'Krazy' Glew wrote:
> Now, there are de-facto instruction fetch atomicity properties. But
> nothing official. (E.g. on Intel (since P6), and I believe AMD, any
> ifetch entirely within an aligned 16 bytes is atomic. And Intel (since
> P6) will clear when the first byte is written; i.e. Intel recognizes SMC

Intel have done so since the 486 I believe, definitely since the Pentium!

From 8088 an dup to 386 you could measure the size of the instruction
prefetch buffer by first doing a very long-running instruction that did
not touch memory (i.e. DIV), then a REP STOSB which would overwrite the
immediately following instructions with NOP bytes. Those bytes would all
start out as single-byte INC reg opcodes, so the number of them that got
executed was a pretty good indication of the size of the prefetch
buffer. (I believe this all ran within a CLI/STI block, i.e. with
interrupts disabled...)

> immediately (as, I think, does AMD).) So I believe that the algorith, I
> describe wll work on Intel since P6, for WB memory. I think that it also
> will work for UC memory. (Glew's rule: any architectural property should
> work when caches are disabled.)

That's a _very_ good rule, since the opposite could make debugging the
cpu _far_ worse.
>
> But there is nothing official.
:-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <55see7-4re.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Andy 'Krazy' Glew wrote:
>> Now, there are de-facto instruction fetch atomicity properties. But
>> nothing official. ...
>
>> (Glew's rule: any architectural property should
>> work when caches are disabled.)
>
>That's a _very_ good rule, since the opposite could make debugging the
>cpu _far_ worse.

All rules have exceptions, but they should be protected by barriers
proportional to the loss of sanity involved in breaking them. That
one should definitely be safe outside a maximum security enclosure
(e.g. features provided for use only in machine-check handlers may
need to break it, but those should NOT be used outside such code).

>> But there is nothing official.
>:-)

Personally, I think the author of any normal code (including kernel)
who relies on instruction fetch atomicity needs reeducation. Yes,
I have done that, but it was a long time ago, the constraints were
those of the 1970s, and I wouldn't do it again!


Regards,
Nick Maclaren.
From: MitchAlsup on
On Jun 12, 7:17 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> Now, there are de-facto instruction fetch atomicity properties.  But nothing official.  (E.g. on Intel (since P6), and I
> believe AMD, any ifetch entirely within an aligned 16 bytes is atomic.  And Intel (since P6) will clear when the first
> byte is written; i.e. Intel recognizes SMC immediately (as, I think, does AMD).)  <snip>
>
> But there is nothing official.

And there IS the problem. AMD Athlons and Opterons obey the self
modifying code checks wrt instruction fetch, but do not on third party
modifications of code simply because there is no spec as to what is
required (and the addresses to be checked are quite distant from the
addresses that need checking). {This was circa '07 and may have been
changed.} When we investigated this issue (circa '05) there was no
code sequence that would successfuly do this that would work on both
Intel and AMD machines {IBM java JIT compiler triggered the
investigation.} So, the compiler had to do a CPUID early and use
result this to pick a code sequence later, as needed.

>(Glew's rule: any architectural property should work when caches are disabled.)

This rule will prevent bundling more than one instruction into a
single atomic sequence, should you ever want such a longer atomic
operation with multiple independent memory references. Thus something
like ASF can only work when the instructions and data are both
cacheable and both caches are turned on.

{Note: I am not disagreeing with the general principle involved, but
there are times.....}

Mitch