SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET [Computer Architecture]

Prev: Internet connection for 722K
Next: Access to IrDA Transceiver in PC on IrPHY Level

From: robertwessel2 on 19 Jan 2010 16:36

On Jan 19, 2:33 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> Stephen Fuld wrote:
> > In this particular common case, isn't a better solution to have a
> > hardware "clock" register, one per "system" that is readable by a user
> > mode instruction? With multi core processors, one register per chip, and
>
> Oh, absolutely!
>
> IBM mainframes have had this since pretty much forever, afaik, in the
> form of a global counter running at something like 1MHz.

The S/370 clock increments at a rate proportional to the instruction
execution rate of the machine, but the format is fixed. More
precisely, bit position 51 (counting from bit zero at the high end),
effectively increments every microsecond. The *actual* increment
could be either left or right of bit 51, but on a 1000MIPS machine,
the TOD clock will increment approximately a billion times per second
(perhaps by incrementing at bit position 61 every ns).

So the clock resolution scales with the CPU speed, and can be used for
timing both long and short intervals. Note the applications needing
date/time values usually use system services, which (can) take things
like time zones and whatnot into account. Not surprisingly those
frond end the hardware TOD clock.

A complication is that CPU speed have increased enough that IBM has
defined a 128 bit format (which extends the old 64 bit format on both
ends).

Other attributes of the TOD clock are that the values are global,
unique, and monotonically increasing as viewed by *all* CPUs in the
system. That allows timing to happen across CPUs, things to be given
globally unique timestamps, etc. The TOD clock also provides the
basis for timer interrupts on the CPU.

It's very handy.

From: robertwessel2 on 19 Jan 2010 17:09

On Jan 19, 2:57 pm, n...(a)cam.ac.uk wrote:
> In article <9hhh27-1af....(a)ntp.tmsw.no>,
> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
> >Stephen Fuld wrote:
> >> In this particular common case, isn't a better solution to have a
> >> hardware "clock" register, one per "system" that is readable by a user
> >> mode instruction? With multi core processors, one register per chip, and
>
> >Oh, absolutely!
>
> >IBM mainframes have had this since pretty much forever, afaik, in the
> >form of a global counter running at something like 1MHz.
>
> Actually, I strongly disagree.
>
> While that works for clocks, it's not generalisable and not scalable.
> What I would do is to have a kernel page (or pages) that are readable
> to all processes, and one field would be the current time. The
> normal memory mechanisms would be used to keep it in sync. Depending
> on other details, those pages might or might not be cachable.
>
> Yes, there would be a hardware clock, but that would not be directly
> visible to ordinary code, and might not run at a precisely known rate.
> The mapping between that would be done by NTP-like code somewhere in
> the system.
>
> On a machine with a LOT of cores, you could update it directly.
> On one without, you would want a special loop which would take the
> hardware clock and the constants maintained by the NTP-like code,
> and update the clock field in memory once every microsecond. That
> would behave exactly like a separate core. And, because updating
> the memory field is a kernel operation, the implementation could be
> changed transparently.

Since the user visible TOD clocks are high resolution (defined to be
comparable to the instruction execution rate of the machine), *and* of
a fixed format (bit position 51 effectively increments every
microsecond), and are synchronized system-wide, all you need is the
constants that define the current offsets from the hardware time to
the wall clock time, which will change only rarely. Or just ask the
OS for the cooked values as is traditional.

From: Stephen Fuld on 19 Jan 2010 20:18

On 1/19/2010 12:57 PM, nmm1(a)cam.ac.uk wrote:
> In article<9hhh27-1af.ln1(a)ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> Stephen Fuld wrote:
>>> In this particular common case, isn't a better solution to have a
>>> hardware "clock" register, one per "system" that is readable by a user
>>> mode instruction? With multi core processors, one register per chip, and
>>
>> Oh, absolutely!
>>
>> IBM mainframes have had this since pretty much forever, afaik, in the
>> form of a global counter running at something like 1MHz.
>
> Actually, I strongly disagree.
>
> While that works for clocks, it's not generalisable

While it might not be generalizable, it is such a frequent case that it
may be worth a specialized solution.

> and not scalable.

Why not? What I am proposing, and what Terje mentioned are both at
least as scalable as what you proposed. Both have a single location to
be read but mine is in a CPU but available to all CPUs, Terje's is in a
separate chip available to all CPUs and yours in also in a separate
chip, a memory chip, available to all CPUs.

> What I would do is to have a kernel page (or pages) that are readable
> to all processes, and one field would be the current time.

I understand that. But what would you use the rest of the page for? If
there are good uses, it might make sense, but without that, it just
wastes some resources. Also, I would rather have the register
implemented in CPU circuitry, which scales better over time than memory
circuitry.

> The
> normal memory mechanisms would be used to keep it in sync. Depending
> on other details, those pages might or might not be cachable.

You have added a lot to the memory coherency traffic. The other
proposals only require traffic when the value is read, not every time it
is updated.

> Yes, there would be a hardware clock, but that would not be directly
> visible to ordinary code, and might not run at a precisely known rate.
> The mapping between that would be done by NTP-like code somewhere in
> the system.
>
> On a machine with a LOT of cores, you could update it directly.

That code does extra work that is not required in the other proposals.

> On one without, you would want a special loop which would take the
> hardware clock and the constants maintained by the NTP-like code,
> and update the clock field in memory once every microsecond.

Is this on some specialized core hardware? If not, how is it different
from the above? If a specialized core, why not just implement the
algorithm in hardware once and be done with it?

> That
> would behave exactly like a separate core. And, because updating
> the memory field is a kernel operation, the implementation could be
> changed transparently.

While that is a potential advantage, it seems to come with a large cost.
I hope that the implementation of a basic time clock shouldn't need to
be changed often :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

From: "Andy "Krazy" Glew" on 19 Jan 2010 21:57

Tim McCaffrey wrote:
> In article <4B540900.4060107(a)patten-glew.net>, ag-news(a)patten-glew.net says...
>> I wrote th following for my wiki,
>> http://semipublic.comp-arch.net/wiki/SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET
>> and thought thgat USEnet comp.arch might be interested:
>>
>>
>>
>>
>
> I liked the 68000 model better, with a System and a User stack pointer. The
> mechanism allowed nested system calls and/or interrupts without all the busy
> work the x86 does.
>
> It seems like you could have optimized out the security checks when you load
> the system registers with the SysEnter CS & SS values, instead of when the
> values were actually sent to the segment registers.

I suspect that is done. There are patents which go into detail.

From: "Andy "Krazy" Glew" on 19 Jan 2010 23:40

> ...lots of discussion about system call alternatives

> lots of discussion, which I will excerpt below my post
> (ooo, bottom quoting, kill me) and summarize briefly
> (ooo, hole in the middle quoting, worst of all!)

> ... lots of discussion about system call alternatives
> like, avoiding lots of overhead for certain special cases

> ... lots of discussion about alternatives to syscalls
> like, arranging so that many "big" syscalls can get done in user libs
>like, getpid and others reading from a mapped page
> like, gettime in a user library
> like, gettime instructions.

I'd love to write an essay on the topic, and put it on my wiki.

But I'm tired, and have to make my nightly phone call to my family.

So, briefly: consider compatibility, and atomicity.

If you do lots of stuff in user mode libraries, then there are
implications for compatibility.

You may not be supporting binary compatibility. You may not be ensuring
that all old binaries continue to work - since some of those binaries
may have inlined the user library. (Heck, a JIT may have inlined it for
them.)

Basically, your interface to the OS becomes the data structures that the
user code that is accomplishing the "syscall like" behavior expects. Or
else you say "if you inline this stuff, you are on your own."

Atomicity. Consider the timer. Say you have an instruction like RDTSC,
but you want to add an offset that is in a mapped page. Or say that you
have a 64 bit machine, and that the timer is 2 64 bit words, 128 bit
total. Now say you want to read the time. But you can only read 64
bits at a time, not 128 bits. So now the user code must somehow handle
the possibility of being context switched or interrupted between reading
the first part and the second.

Usually we don't allow users to block interrupts.

There are ways of coding this. E.g. read-high, read-low, read-high.
Assuming ordered or serialized.

But my point is that when you do stuff in a user level page, anything
that consists of more than one word, then you must handle interrupts or
context switches. And probably other atomicity violations. Stuff like
this works well on "embedded" machines, like supercomputers, where you
are not running general purpose multitasking OSes, and where you are not
migrating processes between machines.

One poster said that the timer instructions should not be changing that
quickly. ... Well, that certainly has not been true at Intel and
AMD!!!! It remains to be seen whether the most recent attempts to
define a new timer instruction - what is it called, RDTSC2 - will e the
last word. I doubt it.

Anyway: yes, there are system call alternatives and alternatives to
system calls. I have implemented some of them, and advocated others.
But, if system calls could be made fast, then many kluges might not be
necessary.

======================================================
=== Excerpted stuff begins ==========================
======================================================

sjjjjj
> ...moving trivial system calls to user li

----> Anton
But given that system calls have to do much more sanity checking on
their arguments, and there is the common prelude that you mentioned
(what is it for?), I don't see system calls ever becoming as fast as
function calls, even with fast system call and system return
instructions.

----> Nick
It's been done, and the gains can be fairly high - unfortunately,
more in maintainability than performance, so benchmarketing classifies
such changes as undesirable :-(
- anton

---->Bernd Paysan
Actually, with a clean system design, many of those unprivileged ones
can be simple unprivileged library calls. rdtsc is unprivileged, all you
need is a factor (clocks per second) and a global offset - then you can
do your gettimeofday() completely in userland (AFAIK, people have
already done that).

This can go a lot further. In effect, you can do most system stuff in
userland, including even reading and writing file data, and schedule
file metadata for changes ("schedule" means that finally, when
committed, the data is sanity checked by the kernel - but that doesn't
need to be too frequently). All the system needs to do for you is to
map those parts of the disk which you can read or write into your
memory map - read only stuff read-only, read-write data stuff on the
disk read-write. The actual reads and writes from and to the disk still
happen in kernel land, but as long as the program works from cache, no
OS intervention necessary.

---->
_Some_ system calls don't need that checking code!

I.e. using a very fast syscall(), you can return an OS timestamp within
a few nanoseconds, totally obviating the need for application code to
develop their own timers, based on RDTSC() (single-core/single-cpu
systems only), ACPI timers or whatever else is available.

Even if this is only possible for system calls that deliver very simple
result, and where the checking code is negligible, this is till an
important subset.

The best solution today is to take away all attempts on security and
move all those calls into a user-level library, right?

---->Stephen Fuld
In this particular common case, isn't a better solution to have a
hardware "clock" register, one per "system" that is readable by a user
mode instruction?

----> Nick
What I would do is to have a kernel page (or pages) that are readable
to all processes, and one field would be the current time. The
normal memory mechanisms would be used to keep it in sync. Depending
on other details, those pages might or might not be cachable.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Prev: Internet connection for 722K
Next: Access to IrDA Transceiver in PC on IrPHY Level