From: "Andy "Krazy" Glew" on
I wrote th following for my wiki,
http://semipublic.comp-arch.net/wiki/SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET
and thought thgat USEnet comp.arch might be interested:



My bad.

I defined the Intel P6's SYSENTER/SYSEXIT instructions. As I will
explain below, they went a bit too far. AMD's more conventional
SYSCALL/SYSRET instructions were more successful.

No secrets here: the instructions have obviously been published, and the
motivations have been documented in the patents.

SYSENTER/SYSEXIT were motivated by the following:

System calls are really just function calls. With security. They have
to switch stacks, etc.

Call instructions are really quite CISCy. They save the current
instruction pointer, and then transfer. That's at least two operations.

Many RISC instruction sets have given up CALL instructions that use a
stack in favor of [[branch-and-link with call hint]] - an instruction
that saves the current program counter in a register, and then branches.
It has a hint to indicate that it is likely to return to after the
call point.

On the P6 microarchitecture, even branch-and-link really wanted to be
two uops:
# reg := current_IP
# IP := target



In my previous job at Gould I had maintained and optimized the low level
code - the user level library stubs (which I had worked on inlining),
and the code they transferred to in the kernel.

The target for a system call is not specified by the user. Instead, it
tends to be a single hardwired entry point. Sometimes the user
specifies a vector number, i.e. a system call number, which might be
used to vector code - but usually most of the different system calls
share common code at the beginning, the system call prologue, so direct
vectoring is not a win. Indeed, one often sees directly vectored system
calls immediately save the vector number into a register, and then
branch to a comon routine, and only much later vector diverge again.

At Gould, at least, there was only a single library that was called by
the user to transfer to the kernel. It was always mapped at the same
address, whether it was shared or process private. In this situation I
observed that it was redundant to have the system call instruction save
the instruction pointer. The instruction pointer of the user code that
was calling the system call was already saved on the user stack,
and the system call stub libraries were always at the same address.
Similarly, it was also known what address to return to.

Therefore, I decided to create, not SYSCALL/SYSRET, but
SYSENTER/SYSEXIT. I considered something almost if not exactly the
same as AMD SYSCALL/SYSEXIT, but decided to be more aggressive, more
RISCy, and define SYSENTER/SYSEXIT. SYSENTER just changed privilege
levels, transferring to an address specified in a register. SYSRET did
the reverse. Because x86 required the stack to be changed,
SYSENTER/SYSEXIT had to define SS:ESP as well as CS:EIP. Hardwired
values were used whenever possible.

Observe: the original idea was to change just the program counter. Of
course, in x86 this becomes CS:EIP, which is reasonable. But we start
sliding down the slippery slope when we have to change SS:ESP as well.
We aren't allowed to just block interrupts while kernel code loads a new
SS:ESP. It turns out that there
are consistency checks; e.g. NMI might panic if it occurred when there
was a privileged CS:EIP and an unpriviliged SS:ESP.

I have observed that this concern about interrupts that cannot be
blocked is a key source of complexity in system architecture. The RISC
approach may be to assume that all interrupts, even NMIs, can be
blocked, broefly, as the syscall code sets things up. But the advent of
things like virtual machines, SMIs, etc., means that you can't make this
assumption.

So, the original concept for SYSENTER was
CS:EIP := some-register + some-hardwired-values

which you might have been able to do in a single uop on a P6 that had
[[segment-register-renaming]],
and which co-renamed CS:EIP together.

But this became
CS:EIP := some-register + some-hardwired-values
SS:ESP := some-register + some-hardwired-values

Now, this is unlikely to ever be a single uop on a reasonable
micro-dataflow OOO machine
that is limited to writing only one destination at a time. But it is
still reasonably fast.

But then we go down the slippery slope:
segment-register-renaming was die dieted out of the original P6,
and the above had to be expressed using existing P6 segment microcode.
P6 decided not to rename the privilege level, or the interrupt blocking
flag,
so pipeline draining was required.
So, what I had hoped could be a single uop, possibly single cycle,
SYSENTER instruction,
became first 2 uops, then ... many more. I think the fastest it could
have been on P6 was 15 cycles.

Faced with these lossages, the extra overhead of SYSCALL is negligible.
ecx := EIP
CS:EIP := some-register + some-hardwired-values
SS:ESP := some-register + some-hardwired-values

In the best case, 3 single cycle uops rather than 2. At one point in
the design of P6, this would have restricted SYSCALL quite a bit, since
the original plan was for P6 to have a 422 decoder templte - and a 3 uop
syscall could only have executed on decoder 0, whereas a 2 uop syscall
could have executed on any decoder.
But when P6 adopted a 411 decoder template, this putative advantage for
SYSENTER was lost.
And when SYSENTER and SYSCALL both turned into microcoded monstrosities...

While SYSENTER might have had a performance advantage over SYSCALL in
some reasonably aggressive implementations, in the actual implementation
the advantage was neglible.
And SYSCALL, saving the user IP, was just plain more familiar.

Conclusion: there were reasons for SYSENTER,
but it was probably a step too far.

--

I still think that both Intel and AMD missed a big opportunity, to make
system calls truly
as fast as function calls. Chicken and egg.
Nobody wants to make the investment in hardware without a proven
software benefit,
but existing software is optimized to avoid expensive system call
privilege level changes.

--

See
[http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21086.pdf
AMD SYSCALL/SYSEXIT definition]
From: Anton Ertl on
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>I still think that both Intel and AMD missed a big opportunity, to make
>system calls truly
>as fast as function calls. Chicken and egg.
>Nobody wants to make the investment in hardware without a proven
>software benefit,
>but existing software is optimized to avoid expensive system call
>privilege level changes.

But given that system calls have to do much more sanity checking on
their arguments, and there is the common prelude that you mentioned
(what is it for?), I don't see system calls ever becoming as fast as
function calls, even with fast system call and system return
instructions.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: nmm1 on
In article <2010Jan18.105904(a)mips.complang.tuwien.ac.at>,
Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote:
>"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes:
>>I still think that both Intel and AMD missed a big opportunity, to make
>>system calls truly
>>as fast as function calls. Chicken and egg.
>>Nobody wants to make the investment in hardware without a proven
>>software benefit,
>>but existing software is optimized to avoid expensive system call
>>privilege level changes.
>
>But given that system calls have to do much more sanity checking on
>their arguments, and there is the common prelude that you mentioned
>(what is it for?), I don't see system calls ever becoming as fast as
>function calls, even with fast system call and system return
>instructions.

It's been done, and the gains can be fairly high - unfortunately,
more in maintainability than performance, so benchmarketing classifies
such changes as undesirable :-(

The key is to have a clean system design, so the amount of sanity
checking and the size of a standard prelude are minimal. For example,
a high proportion of system calls in many applications can be very
simple, 'unprivileged' ones like reading the clock or debugger hooks.
There is no reason that the former shouldn't be as fast as a function
call! Whether you can get there starting from here (i.e. the x86) is
another matter ....

Nobody cares much about the cost of the very heavyweight ones, because
any application that uses them much is broken by design.


Regards,
Nick Maclaren.
From: Terje Mathisen "terje.mathisen at on
Anton Ertl wrote:
> "Andy \"Krazy\" Glew"<ag-news(a)patten-glew.net> writes:
>> I still think that both Intel and AMD missed a big opportunity, to make
>> system calls truly
>> as fast as function calls. Chicken and egg.
>> Nobody wants to make the investment in hardware without a proven
>> software benefit,
>> but existing software is optimized to avoid expensive system call
>> privilege level changes.
>
> But given that system calls have to do much more sanity checking on
> their arguments, and there is the common prelude that you mentioned
> (what is it for?), I don't see system calls ever becoming as fast as
> function calls, even with fast system call and system return
> instructions.

_Some_ system calls don't need that checking code!

I.e. using a very fast syscall(), you can return an OS timestamp within
a few nanoseconds, totally obviating the need for application code to
develop their own timers, based on RDTSC() (single-core/single-cpu
systems only), ACPI timers or whatever else is available.

Even if this is only possible for system calls that deliver very simple
result, and where the checking code is negligible, this is till an
important subset.

The best solution today is to take away all attempts on security and
move all those calls into a user-level library, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: mac on

> I have observed that this concern about interrupts that cannot be
> blocked is a key source of complexity in system architecture. The RISC
> approach may be to assume that all interrupts, even NMIs, can be
> blocked, broefly, as the syscall code sets things up. But the advent of
> things like virtual machines, SMIs, etc., means that you can't make this
> assumption.


Didn't Alpha PALcode have someting like this? Special execution
environment, no interrupts, priveleged register access?
I don't know much about it, but it looked like a clever hook for CISC
operations.