From: Andy Glew on
On 7/10/2010 8:05 AM, Owen Shepherd wrote:
> Andy Glew wrote:

>> Finally, to avoid memory SIEBK, to have it in registers, soke code will
>> have to be responsible for swapping it to memory.
>>
>> I think this has to be PALcode.
>>
>> It cannot be OS code.
>>
>> there used to be a fad in RISCs to take interrupts in a FLIH, first
>> level interrupt handler, that was responsible e.gl for saving register
>> state to menory if needed.
>
> Many architectures have extra registers for use in the various levels of
> interrupt/exception handlers. A prime example of this is ARM, which provides
> separate modes for each exception type and avoids this issue. Instead, for
> each exception/interrupt mode it provides an extra link register (into which
> the exception return address is stored), and an SPSR (stored program status
> register, into which it stores the interrupted program's CPSR). Most
> exceptions quickly save state and then switch to system mode, which has no
> special registers.
>
> (OK, I'm ignoring that they also provide a stack pointer per exception type;
> this is because most operating systems ignore it these days! Some don't, of
> course, particularly for FIQs which add a bunch of extra GPRs too)
>
> So a register interface is certainly possible - you just need to add more
> hardware registers (which would be used in the guest version of each of the
> normal modes) and teach the OS to save them.
>
> In fact there is good reason to avoid the processor having to store this
> state to the SIEBK on interrupt: some applications can't tolerate the
> latencies that a cache miss on SDRAM while taking an interrupt would cause.
> These real-time applications tend to be run out of on chip SRAM, while the
> majority of the system uses off chip memory.

I knew we'd go here. (Why did this conversation have to start on the
last few days of my vacation, when I am frantically trying to wrap up my
sysadminning attempts?)

Extra register sets for interrupts - good idea.

Blocking interrupts on entry to interrupt handler - good idea.

Assuming that because interrupts are blocked, you have atomic,
exclusive, access to certain I/O devices and processor registers - bad idea.

One of the lessons of x86 virtual machines VMX and SMM/SMI have been
that, whatever privilege level you are at, there will always,
eventually, arise a level more privileged than you are. So you can't
necessarily assume that, just because interrupts are blocked in your
FLIH, that no other code will run on the same processor.

I.e. the interrupts you know about may be blocked. But there may arise
other interrupts and other events that you don't know about.

E.g. OS code can get interrupted by SMIs. And OS code can get
interrupted by VM interrupts. And SMIs can get interrupted by VM
interrupts. Or is it vice versa?

Now, many OSes used to have to special case NMI. They could assume "my
FLIH is started with all interrupts blocked except NMI, which I can't
block. But, because I have control of the NMI code, I can ensure that
the MI handler won't do anything long-running that will interfere with
the programming of this I/O device." (Like, imagine that the I/O
device gives an error if too long a time lasts between writing IO reg A
and IO reg B.)

Or if the OS doesn't control NMI, because by convention the BIOS does,
at least the contract between OS and BIOS says that the BIOS doesn't
interfere.

Worse still, the OS may set a flag that the NMI handler checks. This
only works if cooperating.

Introducing new privilege levels means that there are new NMI-like
things that the OS doesn't know about. Things break. But IMHO they
break only because of a bad design, where the OS assumes that it is the
most privileged possible thing.



From: Owen Shepherd on
Andy Glew wrote:

> On 7/10/2010 8:05 AM, Owen Shepherd wrote:
>
> I knew we'd go here. (Why did this conversation have to start on the
> last few days of my vacation, when I am frantically trying to wrap up my
> sysadminning attempts?)
>
> Extra register sets for interrupts - good idea.
>
> Blocking interrupts on entry to interrupt handler - good idea.
>
> Assuming that because interrupts are blocked, you have atomic,
> exclusive, access to certain I/O devices and processor registers - bad
> idea.
>
> One of the lessons of x86 virtual machines VMX and SMM/SMI have been
> that, whatever privilege level you are at, there will always,
> eventually, arise a level more privileged than you are. So you can't
> necessarily assume that, just because interrupts are blocked in your
> FLIH, that no other code will run on the same processor.

This has already occurred once on ARM - there is now a "secure mode" which I
haven't looked deeply into, which slots in above the normal system modes.
It's optional and really quite rare.

> I.e. the interrupts you know about may be blocked. But there may arise
> other interrupts and other events that you don't know about.
>
> E.g. OS code can get interrupted by SMIs. And OS code can get
> interrupted by VM interrupts. And SMIs can get interrupted by VM
> interrupts. Or is it vice versa?

From my understanding of AMD's documentation (I haven't looked at Intel's),
its either and/or both. This strikes me as both convoluted and broken (Isn't
SMI supposed to be essentially transparent to the OS developer?!)

(Which, unfortunately it isn't. Buggy BIOSes are the rule, not the
exception...)

> Now, many OSes used to have to special case NMI. They could assume "my
> FLIH is started with all interrupts blocked except NMI, which I can't
> block. But, because I have control of the NMI code, I can ensure that
> the MI handler won't do anything long-running that will interfere with
> the programming of this I/O device." (Like, imagine that the I/O
> device gives an error if too long a time lasts between writing IO reg A
> and IO reg B.)
>
> Or if the OS doesn't control NMI, because by convention the BIOS does,
> at least the contract between OS and BIOS says that the BIOS doesn't
> interfere.
>
> Worse still, the OS may set a flag that the NMI handler checks. This
> only works if cooperating.
>
> Introducing new privilege levels means that there are new NMI-like
> things that the OS doesn't know about. Things break. But IMHO they
> break only because of a bad design, where the OS assumes that it is the
> most privileged possible thing.

On the other hand, by and large this is the appearance that these mechanisms
are trying to maintain.
From: Andy Glew on
On 7/10/2010 5:43 PM, Owen Shepherd wrote:
> Andy Glew wrote:

>> E.g. OS code can get interrupted by SMIs. And OS code can get
>> interrupted by VM interrupts. And SMIs can get interrupted by VM
>> interrupts. Or is it vice versa?
>
> From my understanding of AMD's documentation (I haven't looked at Intel's),
> its either and/or both. This strikes me as both convoluted and broken (Isn't
> SMI supposed to be essentially transparent to the OS developer?!)
>
> (Which, unfortunately it isn't. Buggy BIOSes are the rule, not the
> exception...)

My somewhat ironic comment about VMX and SMI is based on how much
thrashing there occurred in this part of the design.

SMM (System Management Mode) with its all powerful SMI is basically an
early, leaky, non-transparent, virtual machine layer.

The problems with SMM and VMX interaction arise becaise people do not
recognize this. Either you need to provide support for multiople
virtual machine layers - of which SMM is a special case. Or, you need
to eliminate SMM.

Not providing support for multiple virtual machine layers is stupid and
shortsighted.

Even if you have support, you have to decide if SMM is below the VMM or
above it.


>> Introducing new privilege levels means that there are new NMI-like
>> things that the OS doesn't know about. Things break. But IMHO they
>> break only because of a bad design, where the OS assumes that it is the
>> most privileged possible thing.
>
> On the other hand, by and large this is the appearance that these mechanisms
> are trying to maintain.

Yes. But...

There is no problem when the OS is working on its own behalf.

The problem arises when the OS is interacting directly with something
that the VMM doesn't want to assume control over. E.g. some external
hardware that will produce errors if a control register programming
sequence is interrupted or not completed in a timely manner.

This may be an I/O hardware interface design problem. Along the line of
"all the chips that fit". The I/O device should not be designed on the
assumption that the guy programming it is the most privileged guy in the
system, able to block all other events.

If not - then the VMM needs no know about such broken devices.

Now, the whole story of modern VMMs is incomplete virtualization.
Paravirtualization. I argue that it should be possible to provide
complete virtualization. But not necessarily to every device. Many
people argue that the ability to provide complete birtualization is not
necessary any more.
From: Andy Glew on
On 7/10/2010 8:32 AM, Andy Glew wrote:
> On 7/10/2010 8:05 AM, Owen Shepherd wrote:
>> Andy Glew wrote:
>
>>> Finally, to avoid memory SIEBK, to have it in registers, some code will
>>> have to be responsible for swapping it to memory.
>>>
>>> I think this has to be PALcode.
>>>
>>> It cannot be OS code.
>>>
>>> there used to be a fad in RISCs to take interrupts in a FLIH, first
>>> level interrupt handler, that was responsible e.gl for saving register
>>> state to menory if needed.
>>
>> Many architectures have extra registers for use in the various levels of
>> interrupt/exception handlers.
>>
>> So a register interface is certainly possible - you just need to add
more
>> hardware registers (which would be used in the guest version of each
>> of the
>> normal modes) and teach the OS to save them.

>
> I knew we'd go here. (Why did this conversation have to start on the
> last few days of my vacation, when I am frantically trying to wrap up my
> sysadminning attempts?)
>
> Extra register sets for interrupts - good idea.
>
> Blocking interrupts on entry to interrupt handler - good idea.
>
> Assuming that because interrupts are blocked, you have atomic,
> exclusive, access to certain I/O devices and processor registers - bad
> idea.
>
> One of the lessons of x86 virtual machines VMX and SMM/SMI have been
> that, whatever privilege level you are at, there will always,
> eventually, arise a level more privileged than you are. So you can't
> necessarily assume that, just because interrupts are blocked in your
> FLIH, that no other code will run on the same processor.


Let me try to be more coherent:

If you have a "save state in registers" interface, such as switching a
set of registers, for an interrupt whose handler is entered at privilege
level P,

Then, even though P was at one time the most privileged level in the
machine, you cannot assume that it always will be.

Eventually there may be added a privilege level Q that is more
privileged than P.

Now, where should Q save its interruptee's state (the state of P) and
get Q's run state?

You *could* do yet another register switch. Which amounts to providing
a register (sub) set per privilege level.

Trouble is, that doesn't scale well as the number of privilege levels
grows. And it will grow.

Or, Q could use memory to save its interruptee's state (the state of P).
This scales. You don't even need to have a base register for the P-Q
save area: you only need one base register that points to a
datastructure of all the state save areas. Said datastructure being
maintained by the true most privileged level.

Note: Swapping register sets by changing a pointer doesn't scale at all
well: the register file is the sum of all the sub register files in
size. So as to not penalize ordinary operation, you need to truly copy
registers in and out of the fast register file. Possibly to me. Or,
possibly to a set of slow registers.

There is an intermediate point: copy to a region of locked down memory.
On chip SRAM, embedded DRAM, or even external DRAM. LOcked into the
cache and TLB.


---

What I want: an architecture that supports scalability - which I think
means, as I argue above, a memory datastructure.

But which also supports efficiently locking down pieces of memory for
such save/restore, for the embedded guys who want to take no cache misses.

Which similar supports copy-in/out to a set of slow registers (which
won't be much faster than a suitably locked down memory region).
And possibly even register bank swapping, if you are willing to make the
rest of the processor slower to make interrupts faster.

I.e. allow register interrupt tricks to be used for important stuff.
But allow a scalable interface as well.



From: Andy Glew on
On 7/10/2010 8:32 AM, Andy Glew wrote:
> On 7/10/2010 8:05 AM, Owen Shepherd wrote:
>> Andy Glew wrote:
>
>>> Finally, to avoid memory SIEBK, to have it in registers, some code will
>>> have to be responsible for swapping it to memory.
>>>
>>> I think this has to be PALcode.
>>>
>>> It cannot be OS code.
>>>
>>> there used to be a fad in RISCs to take interrupts in a FLIH, first
>>> level interrupt handler, that was responsible e.gl for saving register
>>> state to menory if needed.
>>
>> Many architectures have extra registers for use in the various levels of
>> interrupt/exception handlers.
>>
>> So a register interface is certainly possible - you just need to add
more
>> hardware registers (which would be used in the guest version of each
>> of the
>> normal modes) and teach the OS to save them.

>
> I knew we'd go here. (Why did this conversation have to start on the
> last few days of my vacation, when I am frantically trying to wrap up my
> sysadminning attempts?)
>
> Extra register sets for interrupts - good idea.
>
> Blocking interrupts on entry to interrupt handler - good idea.
>
> Assuming that because interrupts are blocked, you have atomic,
> exclusive, access to certain I/O devices and processor registers - bad
> idea.
>
> One of the lessons of x86 virtual machines VMX and SMM/SMI have been
> that, whatever privilege level you are at, there will always,
> eventually, arise a level more privileged than you are. So you can't
> necessarily assume that, just because interrupts are blocked in your
> FLIH, that no other code will run on the same processor.


Let me try to be more coherent:

If you have a "save state in registers" interface, such as switching a
set of registers, for an interrupt whose handler is entered at privilege
level P,

Then, even though P was at one time the most privileged level in the
machine, you cannot assume that it always will be.

Eventually there may be added a privilege level Q that is more
privileged than P.

Now, where should Q save its interruptee's state (the state of P) and
get Q's run state?

You *could* do yet another register switch. Which amounts to providing
a register (sub) set per privilege level.

Trouble is, that doesn't scale well as the number of privilege levels
grows. And it will grow.

Or, Q could use memory to save its interruptee's state (the state of P).
This scales. You don't even need to have a base register for the P-Q
save area: you only need one base register that points to a
datastructure of all the state save areas. Said datastructure being
maintained by the true most privileged level.

Note: Swapping register sets by changing a pointer doesn't scale at all
well: the register file is the sum of all the sub register files in
size. So as to not penalize ordinary operation, you need to truly copy
registers in and out of the fast register file. Possibly to me. Or,
possibly to a set of slow registers.

There is an intermediate point: copy to a region of locked down memory.
On chip SRAM, embedded DRAM, or even external DRAM. LOcked into the
cache and TLB.


---

What I want: an architecture that supports scalability - which I think
means, as I argue above, a memory datastructure.

But which also supports efficiently locking down pieces of memory for
such save/restore, for the embedded guys who want to take no cache misses.

Which similar supports copy-in/out to a set of slow registers (which
won't be much faster than a suitably locked down memory region).
And possibly even register bank swapping, if you are willing to make the
rest of the processor slower to make interrupts faster.

I.e. allow register interrupt tricks to be used for important stuff. But
allow a scalable interface as well.