SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET [Computer Architecture]

Prev: Internet connection for 722K
Next: Access to IrDA Transceiver in PC on IrPHY Level

From: Tim McCaffrey on 20 Jan 2010 18:39

In article <43ck27-mao.ln1(a)ntp.tmsw.no>, "terje.mathisenattmsw.no" says...
>
>nmm1(a)cam.ac.uk wrote:
>> In article<b6gj27-5bn.ln1(a)ntp.tmsw.no>,
>> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>>> You and I have both written NTP-type code, so as I wrote in another
>>> message: Separate motherboards should use NTP to stay in sync, with or
>>> without hw assists like ethernet timing hw and/or a global PPS source.
>>
>> Yes, but I thinking of a motherboard with a thousand cores on it.
>> While it could use NTP-like protocols between cores, and for each
>> core to maintain its own clock, that's a fairly crazy approach.
>>
>> All right, realistically, it would be 64 groups of 16 cores, or
>> whatever, but the point stands. Having to use TWO separate
>> protocols on a single board isn't nice.
>
>I agree.
>
>Anything located on a single board should be able to share a common
>timing reference, i.e. core crystal.
>
>That only leaves the OS with the task of syncing up the base counter
>values during startup.
>
And add a distributed TSC reset signal and the sync up is simple as well.
(When I mention this to the hardware guys, they say "well, they could still be
a couple of clocks off", which my reply is "closer than any other approach",
including, if I'm not mistaken, an NTP-like sync up method).

RDTSC is nice, but it tells you the wall clock time (more or less), I would
like a TSC that is part of the thread state. I would also like one that
tracks time in user and kernel modes, but that isn't as necessary.

- Tim

From: Stephen Fuld on 20 Jan 2010 19:03

On 1/20/2010 2:43 AM, nmm1(a)cam.ac.uk wrote:
> In article<hj5ll0$1hm$1(a)news.eternal-september.org>,
> Stephen Fuld<SFuld(a)Alumni.cmu.edu.invalid> wrote:
>>
>>>>> In this particular common case, isn't a better solution to have a
>>>>> hardware "clock" register, one per "system" that is readable by a user
>>>>> mode instruction? With multi core processors, one register per chip, and
>>>>
>>>> Oh, absolutely!
>>>>
>>>> IBM mainframes have had this since pretty much forever, afaik, in the
>>>> form of a global counter running at something like 1MHz.
>>>
>>> Actually, I strongly disagree.
>>>
>>> While that works for clocks, it's not generalisable
>>
>> While it might not be generalizable, it is such a frequent case that it
>> may be worth a specialized solution.
>
> Agreed. I don't think that it is, but it depends on the rest of the
> system's design.
>
>>> and not scalable.
>>
>> Why not? What I am proposing, and what Terje mentioned are both at
>> least as scalable as what you proposed. Both have a single location to
>> be read but mine is in a CPU but available to all CPUs, Terje's is in a
>> separate chip available to all CPUs and yours in also in a separate
>> chip, a memory chip, available to all CPUs.
>
> Consider a machine with a thousand cores and a five level hierarchy.

I'm not sure whether you are talking about a single chip with 1,000
cores or multiple chips on a single board, each with multiple cores that
add up to 1,000, so I will deal with both cases. Note that I explicitly
excluded multiple board systems, i.e. clusters, as they require an extra
layer of solution.

> A single location is a serious bottleneck,

I don't think so. You admit in another post that the actual
implementation could be SRAM within a chip, i.e. a register. Such a
register could easily sustain a read every nanosecond (probably many
more), so throughput shouldn't be a problem. Yes, there would be
latency issues with any one request, but if you implement the uniqueness
algorithm I described in another post, this shouldn't be a problem.

With a single 1,000 core chip, the routing is already on the chip to
allow all those cores to get to the memory interface, and the register
read is certainly much faster than the external memory, so the read
traffic shouldn't hurt the memory traffic much. Note that the updates
do not require any bandwidth as they are entirely local to the one register.

With multiple chips, there is already some mechanism to allow the cores
on one chip to access memory that is physically on another chip. We
just piggy back on this mechanism for the register reads to the one chip
that has the active register. Again, no bandwidth is required for the
updates. Yes, the time stamp reads will be faster for the cores on the
chip with the active register, but again, with uniqueness guaranteed,
and a granularity of one microsecond, we should be well within
reasonable timing.

> so you need a distribution
> mechanism.

No, see above

> Easy enough, but yet more clutter. I am proposing that
> the existing memory system be used for distribution.

My proposal uses the "access to memory" system for off chip reads, but
doesn't need coherency mechanisms, etc. as there is no distributed
implementation.

>>> What I would do is to have a kernel page (or pages) that are readable
>>> to all processes, and one field would be the current time.
>>
>> I understand that. But what would you use the rest of the page for? If
>> there are good uses, it might make sense, but without that, it just
>> wastes some resources. Also, I would rather have the register
>> implemented in CPU circuitry, which scales better over time than memory
>> circuitry.
>
> There are quite a lot. Most global quantities are immutable (machine
> identification, memory and page sizes, etc. etc.),

I like what has been done in the past and have these "preloaded" by the
kernel in the initial register contents whenever a task is started.
Then, if the application thinks it might need them, it can store them
wherever it wants. Very low overhead, easy, etc.

> but a few are
> mutable but read-only to applications (time, load levels, system
> state etc.) Also, you want to keep the time in a cache line on its
> own, as it is the most performance-critical of those.

Again, you already said your implementation would not use actual memory.
You certainly don't want to cache the time stamp. :-)

> There are also
> per-process quantities, mutable and immutable.
>
> A more subtle issue is virtualisation.

I agree. This is something that I didn't consider, and don't know
enough about to comment further.

snip

> Oh, yes, it is. Have you ANY idea how inaccurate the system clocks
> shipped with most computers are?

They are bad. But that doesn't mean they have to be bad. Non-PCs have
accurate TOD clocks for decades.

> Worse than el cheapo mechanical
> alarm clocks :-( Even with a good clock, some machines run for
> years, continually, and so will need corrections.

Again, a solved problem in mainframes for decades. basically
instructions to slow down or speed up the clock by a very small amount
to allow correction over a reasonable time.

snip

>> I hope that the implementation of a basic time clock shouldn't need to
>> be changed often :-)
>
> Hope springs infernal in the human breast :-)

Remember, my proposed implementation is a simple counter of say elapsed
microseconds (with a pre-loadable start value to correlate it to real
time). Anything further is in software. I think the definition of a
second is pretty stable, and the definition of 1,000,000 to get to
microseconds is stable, even for large values of 1,000,000. :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

From: "Andy "Krazy" Glew" on 20 Jan 2010 21:00

Terje Mathisen wrote:

> If the fast system call can handle the user->kernel->user transitions in
> the same or less time, then there's really no need for funky unprotected
> code to be visible in user-mode libraries!

We want some operations (many operations, most operations) to be as fast as possible.

There are often several different ways of implementing such operations:

Doing a System Call
"Naturally" atomic wrt context switches, easy to make atomic against interrupts
Syscall overhead.
Can be adapted to changing implementations without affecting user code.

User Level Code, Library or Inlined
Probably less overhead
More atomicity issues.
Compatibility issues.

Instructions
"Naturally" atomic wrt context switches and interrupts.
Overhead of transferring to microcode.
Easy to adapt to changing microarchitectures.

From: "Andy "Krazy" Glew" on 20 Jan 2010 21:12

Tim McCaffrey wrote:
> RDTSC is nice, but it tells you the wall clock time (more or less), I would
> like a TSC that is part of the thread state. I would also like one that
> tracks time in user and kernel modes, but that isn't as necessary.

How about a single timer, but maintain offset registers?

You can always read the "absolute" TSC (if you are allowed to do so by security - timing channels).

Whenever you go into the kernel, save the current offset/time, and on return adjust the offset, so that the adjusted
user time sees no time in the kernel.

Ditto on interrupts.

Ditto on virtual machines.

Ditto by threads of an application. (Note that hardware threads are another matter.)

Ditto on interrupt blocked time.

....

I.e. you can multiplex a single timer by adjusting offsets. I.e. you can virtualize timers.

No need to keep adding more and more timers for dedicated purposes.

Heck, you can generate timer traps the same way. You can get interrupt every 1s of user time, 1s of real time, etc., etc.

But, if you do this as a user library, the datastructure of offsets will change over time. Some will get added, some
removed. Worse, as your CPU microarchitecture changes, sometimes you won't be able to use the offset trick. (SMT.)

If you do this as an instruction - well, it is unlikely that hardware guys will ever define all the sorts of virtual
timers that software will want.

If you do this in a system call.... ahh. But the syscall better be fast.

One can imagine a hybrid: let the OS know about the actual hardware timer structure, which timers are dedicated, which
are virtualized by offsets. Let the OS manipulate the timer offsets. But let the timer offsets be placed in a memnory
datastructure for an instruction to use.

One can imagine a hybrid... but maybe it is just plain simpler to make syscalls fast.

From: "Andy "Krazy" Glew" on 20 Jan 2010 21:29

Terje Mathisen wrote:
> nmm1(a)cam.ac.uk wrote:
>> In article<1531844.zBA62FjkXi(a)elfi.zetex.de>,
>> Bernd Paysan<bernd.paysan(a)gmx.de> wrote:
>>> It's not so bad as you think. As long as your uncertainty of time is
>>> smaller than the communication delay between the nodes, you are fine,
>>> i.e.
>>> your values are unique - you only have to make sure that the adjustments
>>> propagate through the shortest path.
>>
>> Er, no. How do you stop two threads delivering the same timestamp
>> if they execute a 'call' at the same time without having a single
>> time server? Ensuring global uniqueness is the problem.
>
> No!
>
> Global uniqueness is a separate, but also quite important problem.
>
> It is NOT fair to saddle every single timestamp call with the overhead
> required for a globally unique value!

Amen, brother!

Too many timestamp and time related functions are rolled into one.

There are TIMESTAMPS, e.g. for databases. Wanting global uniqueness and monotonicity. They do not even necessarily
need time, although it is pleasant to be able to compute timestamp deltas and calculate time elapsed.

And then there are TIMERS for performance measurement. And certain protocol tunings. And real-time. Here you want low
overhead, and don't really care about uniqueness.

TIMERS subdivide. Sometimes you want timers where is is guaranteed that all earlier instructions have finished
execution, and no later instructions have started execution. I.e. you want serializing timers. But serializing has
overhead. At other times you want timers that are semi-serialized: all earlier instructions have retired, and no later
instructions have retired, although they may have begin execution. And still faster timers may be fully out-of-order:
they just execute at some point in the pipeline. Not ncessarily even monotonic according to Von Neuman order. But very
low overhead. Sufficient for program tracing.

But mixing up and rolling up several different TIMESTAMP and TIMER functions, you obtain timers that do not satisfy
everyone.

--

Further subdivisions:

Single processor timers have different constraints than timers used for cross-processor performance measurements. E.g.
if measuring time to make a store visible from one processor to another:

P1:
mem := 0
loop:
time1.1 = rdtsc
reg := read mem
time1.2 := rdtsc
if reg = 0 goto loop

P2:
time2 := rdtsc
mem := 1

where
time1.2 - time2 = time to send write from P2 to P1
time1.2 - time1.1 = related to time necessary to receive write
(e.g. time to lose an S state line, and then get it back)

For this to be meaningful, the timers on P1 and P2 should be in synch.

Unless synchronized structurally, it is often necessary to "time warp" in an NTP like algorithmm.

But, such time warping reduces accuracy for single processor measurements.

--

Posted to http://semipublic.comp-arch.net/wiki/The_difference_between_Timestamps_and_Timers

First | Prev | Next | Last
Pages: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Prev: Internet connection for 722K
Next: Access to IrDA Transceiver in PC on IrPHY Level