From: Alexei A. Frounze on
On Jun 11, 4:11 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> Alexei A. Frounze wrote:
> > Apparently, both Windows and ReactOS patch the code containing the
> > prefetchnta instruction, e.g. RtlPrefetchMemoryNonTemporal(),
> > depending on whether or not the instruction is supported by the CPU:
> >http://www.computer.org/portal/web/csdl/doi/10.1109/HICSS.2010.182
> > (click on the PDF link, search for this fxn)
> >http://www.koders.com/c/fid4D23409E3EA6032D618D125732B4AC17A3E773DA.aspx
> > (search the code for this fxn)
> > There's a similar thing in Linux with the option of just skipping the
> > instruction (see handle_prefetch()):
> >http://lwn.net/Articles/8634/
>
> I checked the relevant link and code, and I really don'tthink the author
> understands just how hard it will be to get it right.
>
> He does note that there are multiple problem areas related to SMP
> systems,and that these make the solution much more complicated than it
> would be for a single-core setup.
>
> Anyway, his SMP hack to allow fixup of large instructions is to make
> sure that all the opcodes used, that could fault, will do so based on
> the first 4 opcode bytes only, i.e. independently of any following bytes.
>
> With this restriction he can first use simple store instructions to
> overwrite the tail, if any, and then use a locked update to fix the
> first four bytes.
>
> (He doesn't state it explicitely, but I assume he fixes 1-3 byte opcodes
> by always writing 4 bytes, rewriting the current values into the
> following bytes.
>
> The potential probem I noted here is that afaik, many systems only
> guarantee the atomicity of locked writes if they are properly aligned,
> and 75% of all opcodes will not start on a 4-byte boundary.

Validating that code or suggesting its use wasn't my intention. :) I
just shared on the topic of instruction emulation, which is an
interesting thing.

Alex
From: Terje Mathisen "terje.mathisen at on
Alexei A. Frounze wrote:
> On Jun 11, 4:11 am, Terje Mathisen<"terje.mathisen at tmsw.no">
>> The potential probem I noted here is that afaik, many systems only
>> guarantee the atomicity of locked writes if they are properly aligned,
>> and 75% of all opcodes will not start on a 4-byte boundary.
>
> Validating that code or suggesting its use wasn't my intention. :) I
> just shared on the topic of instruction emulation, which is an
> interesting thing.

Sure, I understood that!

It was indeed interesting, and it got me to think about how to solve
that particular problem (fixing up instructions without using a global
lock on an SMP machine).

I think the simplest (for some version of "simplest") method might be to
temporarily replace the debug interrupt handler with a function that
verifies the source of the interrupt, then either emulates the current
instruction or chains to the previous handler (so debuggers will keep
working etc.)

As soon as this is in place you can safely overwrite the first opcode
byte with an "INT3", then fixup the rest of the instruction and finally
set the first byte to the proper value.

With a fence between the INT3 write and the rest of the opcode bytes and
another before the final update of the first byte, this should be
perfectly safe even on SMP kernels, and the overhead would be very low
indeed.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Andy 'Krazy' Glew on
On 6/12/2010 1:41 AM, Terje Mathisen wrote:
> Alexei A. Frounze wrote:
>> On Jun 11, 4:11 am, Terje Mathisen<"terje.mathisen at tmsw.no">
>>> The potential probem I noted here is that afaik, many systems only
>>> guarantee the atomicity of locked writes if they are properly aligned,
>>> and 75% of all opcodes will not start on a 4-byte boundary.
>>
>> Validating that code or suggesting its use wasn't my intention. :) I
>> just shared on the topic of instruction emulation, which is an
>> interesting thing.
>
> Sure, I understood that!
>
> It was indeed interesting, and it got me to think about how to solve
> that particular problem (fixing up instructions without using a global
> lock on an SMP machine).

If you have kernel access so can hook INT3, what's wrong with

FOR i FROM lowest byte of instruction TO highest byte DO
*b = INT3 (single byte trap instruction)

FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
*b = appropriate byte of new instrucftion


?


(Assuming that you don't have an issue with an instruction crossing a page boundary, with the pages (in particular, the
lowest addressed pages) being aliased. But that's just a fairly simple extension.)



From: Terje Mathisen "terje.mathisen at on
Andy 'Krazy' Glew wrote:
> On 6/12/2010 1:41 AM, Terje Mathisen wrote:
>> Alexei A. Frounze wrote:
>>> On Jun 11, 4:11 am, Terje Mathisen<"terje.mathisen at tmsw.no">
>>>> The potential probem I noted here is that afaik, many systems only
>>>> guarantee the atomicity of locked writes if they are properly aligned,
>>>> and 75% of all opcodes will not start on a 4-byte boundary.
>>>
>>> Validating that code or suggesting its use wasn't my intention. :) I
>>> just shared on the topic of instruction emulation, which is an
>>> interesting thing.
>>
>> Sure, I understood that!
>>
>> It was indeed interesting, and it got me to think about how to solve
>> that particular problem (fixing up instructions without using a global
>> lock on an SMP machine).
>
> If you have kernel access so can hook INT3, what's wrong with
>
> FOR i FROM lowest byte of instruction TO highest byte DO
> *b = INT3 (single byte trap instruction)
>
> FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
> *b = appropriate byte of new instrucftion
>
> ?

Nothing at all, it is just a lot more work than really needed:

Only the first (lowest) byte need to be replaced with an INT3, then you
can write all the subsequent bytes with the new opcode bytes, and then
finally fix the INT3, i.e. just a single extra byte write.
>
>
> (Assuming that you don't have an issue with an instruction crossing a
> page boundary, with the pages (in particular, the lowest addressed
> pages) being aliased. But that's just a fairly simple extension.)

I am assuming that you're not trying to fixup an instruction which also
contains a jump target inside it, like the following code which I wrote
for the 486 once upon a time:

int func(...)
{
asm {
... prolog
; the optimal inner loop is skewed so we need to skip the top
; of the loop on the first iteration:
; JMP FIRST
; To avoid the JMP I use a dummy opcode which skips the top:

cmp ax, 1234h ;; 1-byte opcode + 2-byte immediate
ORG $-2 ;; return to just before the 2-byte immediate
next_iter:
inc si
inc di
FIRST: ;; First iteration starts here...
...

This code was optimal on the 486, but blew away the Icache instruction
boundaries on a Pentium.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Andy 'Krazy' Glew on
On 6/12/2010 2:56 PM, Terje Mathisen wrote:
> Andy 'Krazy' Glew wrote:
>> On 6/12/2010 1:41 AM, Terje Mathisen wrote:
>>> Alexei A. Frounze wrote:
>>> It was indeed interesting, and it got me to think about how to solve
>>> that particular problem (fixing up instructions without using a global
>>> lock on an SMP machine).
>>
>> If you have kernel access so can hook INT3, what's wrong with
>>
>> FOR i FROM lowest byte of instruction TO highest byte DO
>> *b = INT3 (single byte trap instruction)
>>
>> FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
>> *b = appropriate byte of new instrucftion
>>
>> ?
>
> Nothing at all, it is just a lot more work than really needed:
>
> Only the first (lowest) byte need to be replaced with an INT3, then you
> can write all the subsequent bytes with the new opcode bytes, and then
> finally fix the INT3, i.e. just a single extra byte write.

Not if there is a possibility that code might branch into the middle of the instruction. not just the first byte.

You can, of course, write combine an aligned 2, 4, or even 8 bytes' worth of INT3s - I believe the new memory model says
that 64 bit aligned writes are atomic. You can't depend on split order, however.

Ooops, you talk about jumping into the middle of instructions below.