From: Frank Kotler on
Nimai wrote:
> I'm learning to program in straight machine code, and I just finished
> reading the Intel manuals.
>
> I have a burning question that the books haven't answered, maybe I'm
> just stupid and I missed it.
>
> If I do a JMP to a bunch of garbled data, how does the prefetching
> process know where the "instruction boundaries" are? Where will EIP
> be when the inevitable invalid opcode exception is triggered?
>
> In other words, if the instructions are garbage, how much garbage is
> taken in? What are the rules?
>
> My guess is, each possible opcode byte has something like a lookup
> table entry, and after parsing a byte, the prefetcher either adds
> another byte to the instruction, adds a modr/m byte to the instruction
> and grabs displacement and immediate bytes, or ends the instruction
> and sends it to the pipeline. This is entirely based on inference, I
> can't find anything in the manuals to confirm or deny this.
>
> Whatever process it uses, it MUST be entirely deterministic, or code
> can't be. So where is it documented?

I haven't a clue. I'm with Bob Masta - try it and see! ("one test is
worth a thousand expert opinions") But I observe that guys who design
the chips hang out on comp.arch so I'll cross-post it there, in hopes
that it may get you a definitive answer (which may be "it's proprietary,
we can't tell ya"). Good luck!

Best,
Frank
From: Joe Pfeiffer on
> Nimai wrote:
>> I'm learning to program in straight machine code, and I just finished
>> reading the Intel manuals.
>>
>> I have a burning question that the books haven't answered, maybe I'm
>> just stupid and I missed it.
>>
>> If I do a JMP to a bunch of garbled data, how does the prefetching
>> process know where the "instruction boundaries" are? Where will EIP
>> be when the inevitable invalid opcode exception is triggered?
>>
>> In other words, if the instructions are garbage, how much garbage is
>> taken in? What are the rules?
>>
>> My guess is, each possible opcode byte has something like a lookup
>> table entry, and after parsing a byte, the prefetcher either adds
>> another byte to the instruction, adds a modr/m byte to the instruction
>> and grabs displacement and immediate bytes, or ends the instruction
>> and sends it to the pipeline. This is entirely based on inference, I
>> can't find anything in the manuals to confirm or deny this.
>>
>> Whatever process it uses, it MUST be entirely deterministic, or code
>> can't be. So where is it documented?

Why should it be documented? What you've described is conceptually how
it works; all that's left that matters to the programmer is how many
instructions of what type can be decoded simultaneously (since that can
affect optimization).

As for when you get a fault, that depends on just what the garbling is.
NX bit set? Immediately.

Bad opcode? Immediately.

Ends up trying to read/write data from invalid address? Immediately,
but it'll be a proetection fault on the data address.

Made it past the first "instruction"? On to the second...
--
As we enjoy great advantages from the inventions of others, we should
be glad of an opportunity to serve others by any invention of ours;
and this we should do freely and generously. (Benjamin Franklin)
From: nedbrek on
Hello,
Welcome comp.lang.asm.x86!

"Joe Pfeiffer" <pfeiffer(a)nospicedham.cs.nmsu.edu> wrote in message
news:1br5jg214l.fsf(a)snowball.wb.pfeifferfamily.net...
>> Nimai wrote:
>>> If I do a JMP to a bunch of garbled data, how does the prefetching
>>> process know where the "instruction boundaries" are? Where will EIP
>>> be when the inevitable invalid opcode exception is triggered?
>>>
>>> In other words, if the instructions are garbage, how much garbage is
>>> taken in? What are the rules?
>>>
>>> My guess is, each possible opcode byte has something like a lookup
>>> table entry, and after parsing a byte, the prefetcher either adds
>>> another byte to the instruction, adds a modr/m byte to the instruction
>>> and grabs displacement and immediate bytes, or ends the instruction
>>> and sends it to the pipeline. This is entirely based on inference, I
>>> can't find anything in the manuals to confirm or deny this.
>>>
>>> Whatever process it uses, it MUST be entirely deterministic, or code
>>> can't be. So where is it documented?
>
> As for when you get a fault, that depends on just what the garbling is.
> NX bit set? Immediately.
>
> Bad opcode? Immediately.
>
> Ends up trying to read/write data from invalid address? Immediately,
> but it'll be a proetection fault on the data address.
>
> Made it past the first "instruction"? On to the second...

That about sums it up!


Two aspects, architectural (what software sees) and hardware (what actually
happens).

The hardware is just going to shovel bits into the execution engine. An
advanced machine doesn't even look at the bits at first. Hardware further
down the line interprets the bits into instructions.

This part of the machine is very speculative, so it can never be sure a bad
branch somewhere won't make everything right. The machine won't flag any
bad decode until it is sure that the architectural path goes that way.

Any machine has to come to the same result as a simple, one
instruction-at-a-time machine would (maintaining the architectural
illusion). There are all sorts of nifty tricks to make this happen, but
rest assured the fault will be deterministic.

However, architecturally, there is only one invalid opcode instruction (0f
08) so anything else might run for a while. Also, new instructions get
added - so what happens to be invalid today might be a real instruction
tomorrow.

You might even manage to fall into an infinite loop! (jmp byte -2, eb fe)
Hope your environment has preemptive multitasking!

Hope that helps!
Ned


From: Andy 'Krazy' Glew on
On 7/6/2010 6:07 AM, Frank Kotler wrote:
> Nimai wrote:
>> I'm learning to program in straight machine code, and I just finished
>> reading the Intel manuals.
>>
>> I have a burning question that the books haven't answered, maybe I'm
>> just stupid and I missed it.
>>
>> If I do a JMP to a bunch of garbled data, how does the prefetching
>> process know where the "instruction boundaries" are? Where will EIP
>> be when the inevitable invalid opcode exception is triggered?
>>
>> In other words, if the instructions are garbage, how much garbage is
>> taken in? What are the rules?
>>
>> My guess is, each possible opcode byte has something like a lookup
>> table entry, and after parsing a byte, the prefetcher either adds
>> another byte to the instruction, adds a modr/m byte to the instruction
>> and grabs displacement and immediate bytes, or ends the instruction
>> and sends it to the pipeline. This is entirely based on inference, I
>> can't find anything in the manuals to confirm or deny this.
>>
>> Whatever process it uses, it MUST be entirely deterministic, or code
>> can't be. So where is it documented?

Nimai's guess is a fairly accurate description of what is treated as the defacto architectural definition.

The actual hardware is more like:
fetch 1 or 2 blocks of instructions (typically 16 byte aligned) containing the branch target
decode in parallel several instructions in those blocks starting at the branch target

i.e. it is done in parallel. Although there have been machines that could only decode one instruction at a time, if
never seen before. typically those machines have instruction predecode bits in the instruction cache, maybe even the L2,
and have rather poor performance on code they haaven't seen before.

But most modern machines can at least decode multiple bytes of a given instruction within a cycle. Typically via

Option 1:
assume first byte is an opcode byte
assume second is a modrm
assume 3rd-6th are an offset
Option 2:
assume first byte is a REX prefix or some other ptefix
assume second byte is an opcode byte
assume third is a modrm
assume 4rd-7th are an offset
..

and so on, in parallel, using whichever option matches.


But, the semantics are as if looked at a byte at a time.
From: MitchAlsup on
On Jul 6, 8:07 am, Frank Kotler <fbkot...(a)nospicedham.myfairpoint.net>
wrote:
> Nimai wrote:
> > I'm learning to program in straight machine code, and I just finished
> > reading the Intel manuals.
>
> > I have a burning question that the books haven't answered, maybe I'm
> > just stupid and I missed it.
>
> > If I do a JMP to a bunch of garbled data, how does the prefetching
> > process know where the "instruction boundaries" are?  Where will EIP
> > be when the inevitable invalid opcode exception is triggered?

The EIP will point to the first instruction that has detectable
garbage. The key word, here, is detectable, as so very many byte
sequences are legal (if not very useable) opcodes.

> > In other words, if the instructions are garbage, how much garbage is
> > taken in?  What are the rules?

It is wise to assume that at least 3 cache lines of garbage are
fetched before garbage is decoded.

> > My guess is, each possible opcode byte has something like a lookup
> > table entry, and after parsing a byte, the prefetcher either adds
> > another byte to the instruction, adds a modr/m byte to the instruction
> > and grabs displacement and immediate bytes, or ends the instruction
> > and sends it to the pipeline.  This is entirely based on inference, I
> > can't find anything in the manuals to confirm or deny this.
>
> > Whatever process it uses, it MUST be entirely deterministic, or code
> > can't be.  So where is it documented?

It ends up different on different architectures.

But your logic is sound, you are just not thinking in parallel. What
generally happens is that at least 4 bytes are fully decoded in to 256
signals per byte. Then various logic condenses the 256 signals (times
the number of bytes) to 50-ish, expecially fereting out prefixes (with
respect to operating mode). Then another layer of logic identifies the
major opcode byte. And the rest is simply a cascade of multiplexers.
One end result of all ths multiplexing is the start pointer for the
next instruction.

The major opcode byte specifies whether there are opcode bits in the
minor opcode byte (if present) and modr/m and SIB encodings. Knowing
if a minor, modr/m, or SIB is present and whether an immediate is
present gives you all that is necessary (prior to SSE4) to detrmine
the subsequent instruction boundary.

Bad opcodes are generally about another whole pipe stage down the pipe
from instruction parsing. There is no reason to clutter up a hard
problem with an intractable problem in a gate limited and fanlimited
pipestage. You still have at least 5 pipe stages before any damage is
done to machine state. Plenty of time to stumble accorss the myriad of
subtle invalid opcodes due to improper use of modr/m or SIBs or prefix
violations. And NO reason to get cute and try to do them earlier.

{I happen to know how to do this in 12 gate delays from RAW bytes and
3 instructions at a time in 8 gates with end pointer bits.}

All of this is also dependent on some sequencing decisions made in the
pipeline.

Mitch