SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET [Computer Architecture]

Prev: Internet connection for 722K
Next: Access to IrDA Transceiver in PC on IrPHY Level

From: nmm1 on 20 Jan 2010 05:43

In article <hj5ll0$1hm$1(a)news.eternal-september.org>,
Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote:
>
>>>> In this particular common case, isn't a better solution to have a
>>>> hardware "clock" register, one per "system" that is readable by a user
>>>> mode instruction? With multi core processors, one register per chip, and
>>>
>>> Oh, absolutely!
>>>
>>> IBM mainframes have had this since pretty much forever, afaik, in the
>>> form of a global counter running at something like 1MHz.
>>
>> Actually, I strongly disagree.
>>
>> While that works for clocks, it's not generalisable
>
>While it might not be generalizable, it is such a frequent case that it
>may be worth a specialized solution.

Agreed. I don't think that it is, but it depends on the rest of the
system's design.

> > and not scalable.
>
>Why not? What I am proposing, and what Terje mentioned are both at
>least as scalable as what you proposed. Both have a single location to
>be read but mine is in a CPU but available to all CPUs, Terje's is in a
>separate chip available to all CPUs and yours in also in a separate
>chip, a memory chip, available to all CPUs.

Consider a machine with a thousand cores and a five level hierarchy.
A single location is a serious bottleneck, so you need a distribution
mechanism. Easy enough, but yet more clutter. I am proposing that
the existing memory system be used for distribution.

>> What I would do is to have a kernel page (or pages) that are readable
>> to all processes, and one field would be the current time.
>
>I understand that. But what would you use the rest of the page for? If
>there are good uses, it might make sense, but without that, it just
>wastes some resources. Also, I would rather have the register
>implemented in CPU circuitry, which scales better over time than memory
>circuitry.

There are quite a lot. Most global quantities are immutable (machine
identification, memory and page sizes, etc. etc.), but a few are
mutable but read-only to applications (time, load levels, system
state etc.) Also, you want to keep the time in a cache line on its
own, as it is the most performance-critical of those. There are also
per-process quantities, mutable and immutable.

A more subtle issue is virtualisation. Let's say that you want to
do GOOD virtualisation, so that you can simulate a system running at
another time, or so that it sees time pass at the normal rate (even
though some operations take a lot longer, as they have to be emulated).
With my proposal, that needs just a page table tweak - with yours,
extra hardware mechanisms are needed.

> > The
>> normal memory mechanisms would be used to keep it in sync. Depending
>> on other details, those pages might or might not be cachable.
>
>You have added a lot to the memory coherency traffic. The other
>proposals only require traffic when the value is read, not every time it
>is updated.

Not necessarily. With the advent of scalable shared memory threading,
that issue is going to have to be faced, anyway. Most of the time,
the location will be owned by the updating 'thread' - exactly the
same is true for 'local' memory in a thread, and POSIX/C++/etc.
allow other threads to read those.

>> Yes, there would be a hardware clock, but that would not be directly
>> visible to ordinary code, and might not run at a precisely known rate.
>> The mapping between that would be done by NTP-like code somewhere in
>> the system.
>>
>> On a machine with a LOT of cores, you could update it directly.
>
>That code does extra work that is not required in the other proposals.

Oh, yes, it is. Have you ANY idea how inaccurate the system clocks
shipped with most computers are? Worse than el cheapo mechanical
alarm clocks :-( Even with a good clock, some machines run for
years, continually, and so will need corrections.

>> On one without, you would want a special loop which would take the
>> hardware clock and the constants maintained by the NTP-like code,
>> and update the clock field in memory once every microsecond.
>
>Is this on some specialized core hardware? If not, how is it different
>from the above? If a specialized core, why not just implement the
>algorithm in hardware once and be done with it?

It doesn't change my point, which is about the distribution, and not
about the clocking algorithm.

>> That
>> would behave exactly like a separate core. And, because updating
>> the memory field is a kernel operation, the implementation could be
>> changed transparently.
>
>While that is a potential advantage, it seems to come with a large cost.
> I hope that the implementation of a basic time clock shouldn't need to
>be changed often :-)

Hope springs infernal in the human breast :-)

Regards,
Nick Maclaren.

From: Terje Mathisen "terje.mathisen at on 20 Jan 2010 05:52

robertwessel2(a)yahoo.com wrote:
> Other attributes of the TOD clock are that the values are global,
> unique, and monotonically increasing as viewed by *all* CPUs in the
> system. That allows timing to happen across CPUs, things to be given
> globally unique timestamps, etc. The TOD clock also provides the
> basis for timer interrupts on the CPU.
>
> It's very handy.

It is also the "Right Stuff", i.e. as I wrote earlier the correct way to
handle this particular problem.

The only real remaining problem is related to NTP, i.e. when you want to
sync this system-global TOD clock to UTC/TAI.

Afair IBM does have a (very expensive!) hw solution for this, instead of
the trivial sw needed for a RDTSC-based clock which I outlined earlier.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: nmm1 on 20 Jan 2010 06:32

In article <ds3j27-tsm.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>robertwessel2(a)yahoo.com wrote:
>> Other attributes of the TOD clock are that the values are global,
>> unique, and monotonically increasing as viewed by *all* CPUs in the
>> system. That allows timing to happen across CPUs, things to be given
>> globally unique timestamps, etc. The TOD clock also provides the
>> basis for timer interrupts on the CPU.
>>
>> It's very handy.
>
>It is also the "Right Stuff", i.e. as I wrote earlier the correct way to
>handle this particular problem.

Yes. But see below.

>The only real remaining problem is related to NTP, i.e. when you want to
>sync this system-global TOD clock to UTC/TAI.

No, not at all. There are two problems. That's one.

The other is maintaining global uniqueness and monotonicity while
increasing the precision to nanoseconds and the number of cores
to thousands. All are needed, but it is probably infeasible to
deliver all of them, simultaneously :-(

Regards,
Nick Maclaren.

From: Terje Mathisen "terje.mathisen at on 20 Jan 2010 08:26

Andy "Krazy" Glew wrote:
> So, briefly: consider compatibility, and atomicity.
>
> If you do lots of stuff in user mode libraries, then there are
> implications for compatibility.
>
> You may not be supporting binary compatibility. You may not be ensuring
> that all old binaries continue to work - since some of those binaries
> may have inlined the user library. (Heck, a JIT may have inlined it for
> them.)
>
> Basically, your interface to the OS becomes the data structures that the
> user code that is accomplishing the "syscall like" behavior expects. Or
> else you say "if you inline this stuff, you are on your own."

This is perfectly OK, as long as those data structures are general
enough to work well even 25 years from now.

If you know that you are that prescient, you should have worked for the
Alpha team when they wrote that original 25-year road map. :-(

> Atomicity. Consider the timer. Say you have an instruction like RDTSC,
> but you want to add an offset that is in a mapped page. Or say that you
> have a 64 bit machine, and that the timer is 2 64 bit words, 128 bit
> total. Now say you want to read the time. But you can only read 64 bits
> at a time, not 128 bits. So now the user code must somehow handle the
> possibility of being context switched or interrupted between reading the
> first part and the second.
>
> Usually we don't allow users to block interrupts.
>
> There are ways of coding this. E.g. read-high, read-low, read-high.
> Assuming ordered or serialized.

This is exactly how my sample code did it.
>
> But my point is that when you do stuff in a user level page, anything
> that consists of more than one word, then you must handle interrupts or
> context switches. And probably other atomicity violations. Stuff like
> this works well on "embedded" machines, like supercomputers, where you
> are not running general purpose multitasking OSes, and where you are not
> migrating processes between machines.

The only reasonable way to keep multiple independent motherboards in
sync is to use a dedicated process to handle it, i.e. something like
NTP, with or without hw assists (Ethernet hw timers, shared PPS signal).

The real problem in such an environment is that you might have to handle
the situation where a user process can migrate from one box to another
between any two instructions!

At that point a setup based on an OS-maintained clock counter could be
quite problematical, i.e. you'd need to protect the code that loads the
various variables and reads the cpu-local tick counter in such a way
that a process migration would cause a reload:

do {
ost = os_tick_counter_from_last_interrupt;
tsc = os_tsc_count_from_last_interrupt;
scale = os_tsc_scale_multiplier;
shift = os_tsc_shift_count;
t = rdtsc();
} while (ost != os_tick_counter_from_last_interrupt);

return (ost + (t-tsc)*scale >> shift);

The total running time of such a call should be about 20 clock cycles if
rdtsc takes 10.

If the fast system call can handle the user->kernel->user transitions in
the same or less time, then there's really no need for funky unprotected
code to be visible in user-mode libraries!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Terje Mathisen "terje.mathisen at on 20 Jan 2010 08:32

nmm1(a)cam.ac.uk wrote:
> On a machine with a LOT of cores, you could update it directly.
> On one without, you would want a special loop which would take the
> hardware clock and the constants maintained by the NTP-like code,
> and update the clock field in memory once every microsecond. That
> would behave exactly like a separate core. And, because updating
> the memory field is a kernel operation, the implementation could be
> changed transparently.

It could not:

Anything that updates a real memory location every us is a performance bug!

If you instead use a memory-mapped timer chip register, then you've
still got the cost of a real bus transaction instead of a couple of
core-local instructions.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: Internet connection for 722K
Next: Access to IrDA Transceiver in PC on IrPHY Level