From: Stephen Fuld on 19 Jan 2010 11:19
On 1/18/2010 4:13 AM, Terje Mathisen wrote:
> Anton Ertl wrote:
>> But given that system calls have to do much more sanity checking on
>> their arguments, and there is the common prelude that you mentioned
>> (what is it for?), I don't see system calls ever becoming as fast as
>> function calls, even with fast system call and system return
> _Some_ system calls don't need that checking code!
> I.e. using a very fast syscall(), you can return an OS timestamp within
> a few nanoseconds, totally obviating the need for application code to
> develop their own timers, based on RDTSC() (single-core/single-cpu
> systems only), ACPI timers or whatever else is available.
> Even if this is only possible for system calls that deliver very simple
> result, and where the checking code is negligible, this is till an
> important subset.
> The best solution today is to take away all attempts on security and
> move all those calls into a user-level library, right?
In this particular common case, isn't a better solution to have a
hardware "clock" register, one per "system" that is readable by a user
mode instruction? With multi core processors, one register per chip,
and and if you have multiple chips per board, a hardware mechanism to
insure all reads come from a single one of these (to prevent the RDTSC
problems). There should be a privileged instruction to set the initial
value, and it should be incremented by the hardware based on real time.
There are issues about what granularity, whether to allow duplicate
return values, how to keep it synchronized to real time, etc. but these
can be worked out. Even if it took several CPU cycles to execute the
instruction it would be better than a system call, and more useful and
simpler than a core local cycle counter.
- Stephen Fuld
(e-mail address disguised to prevent spam)
From: Tim McCaffrey on 19 Jan 2010 11:33
In article <4B540900.4060107(a)patten-glew.net>, ag-news(a)patten-glew.net says...
>I wrote th following for my wiki,
>and thought thgat USEnet comp.arch might be interested:
I liked the 68000 model better, with a System and a User stack pointer. The
mechanism allowed nested system calls and/or interrupts without all the busy
work the x86 does.
It seems like you could have optimized out the security checks when you load
the system registers with the SysEnter CS & SS values, instead of when the
values were actually sent to the segment registers.
From: MitchAlsup on 19 Jan 2010 14:30
On Jan 18, 10:17 pm, EricP <ThatWouldBeTell...(a)thevillage.com> wrote:
> Andy "Krazy" Glew wrote:
> > EricP wrote:
> >> but the x86
> >> provides no easy method load load an arbitrary offset of the
> >> current EIP into EDX except that kludgey call +0, pop edx method.
> >> So to use SysEnter you have to preload EDX with a constant restart
> >> EIP and that presumes the entry sequence is at a predefined
> >> location and that limits the utility of the SysEnter somewhat.
> > Agreed.
> > Note that x86 eventually got around to adding READ_EIP instruction.
> Where is that? I find no reference to such an instruction.
LDA EDX,[RIP+offset] //in long mode
The PDP-11 taught us well how useful it was to have the ability to get-
at the program counter, and how terrible it was to have the ability to
write random computations into the program counter.
From: Terje Mathisen "terje.mathisen at on 19 Jan 2010 15:33
Stephen Fuld wrote:
> In this particular common case, isn't a better solution to have a
> hardware "clock" register, one per "system" that is readable by a user
> mode instruction? With multi core processors, one register per chip, and
IBM mainframes have had this since pretty much forever, afaik, in the
form of a global counter running at something like 1MHz.
> and if you have multiple chips per board, a hardware mechanism to insure
> all reads come from a single one of these (to prevent the RDTSC
The problem is that there are currently far too many different solutions
to this issue, with the best being the suggested (but so far rarely
implemented) HPET timers.
Next in line is an x86 where there's a RDTSC counter which runs at
constant speed, independent of all the power-related speed stepping of
each individual core.
With such a counter, and an OS driver which makes sure that all cpus
share a common starting point (i.e. designate a master cpu and sync the
others to this one), user processes can just do a regular RDTSC and
depend on the result.
At the same time a library function can grab the global OS tick value,
the RDTSC saved during the last interrupt and the current RDTSC value:
/* These two counters are updated by the OS on every timer tick: */
extern volatile int64_t os_tick;
extern volatile int64_t os_tsc;
/* The following two values are initialized by the OS so that the
multiplication will never overflow, even in the case of N lost timer
interrupts in a row:
I.e. with a 3 GHz core clock and 100 ms since the last update, the
maximum RDTSC delta value would be 3e7, so the multiplier can be up to
1.4e11. Using a maximum scale value that's less than 2^32 would allow a
32-bit multiplier which might make it faster.
In case of updates to the OS tick rate, i.e. due to NTP clock tweaking,
the multiplier should ideally also be updated at the same time!
extern int64_t os_tsc_scale;
extern int8_t os_tsc_shift;
int64_t get_timestamp(void) // return sec.frac in 32:32 NTP format
int64_t tick, tsc, t;
tick = os_tick;
tsc = os_tsc;
t = rdtsc();
// Need to verify that the compiler actually reloads the os_tick value!
} while (tick != os_tick); // In case of updates!
t -= tsc; // Delta count since last interrupt
t *= os_tsc_scale;
t >>= os_tsc_shift;
tick += t;
> problems). There should be a privileged instruction to set the initial
> value, and it should be incremented by the hardware based on real time.
See above, this is what is supposed to happen.
> There are issues about what granularity, whether to allow duplicate
Duplicate values are impossible if the timer updates at least once per
RDTSC instruction, on all existing x86 cpus this opcode take something
like 10-30 cycles.
> return values, how to keep it synchronized to real time, etc. but these
> can be worked out. Even if it took several CPU cycles to execute the
> instruction it would be better than a system call, and more useful and
> simpler than a core local cycle counter.
Absolutely, hopefully we'll get there soon. :-)
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on 19 Jan 2010 15:57
In article <9hhh27-1af.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Stephen Fuld wrote:
>> In this particular common case, isn't a better solution to have a
>> hardware "clock" register, one per "system" that is readable by a user
>> mode instruction? With multi core processors, one register per chip, and
>IBM mainframes have had this since pretty much forever, afaik, in the
>form of a global counter running at something like 1MHz.
Actually, I strongly disagree.
While that works for clocks, it's not generalisable and not scalable.
What I would do is to have a kernel page (or pages) that are readable
to all processes, and one field would be the current time. The
normal memory mechanisms would be used to keep it in sync. Depending
on other details, those pages might or might not be cachable.
Yes, there would be a hardware clock, but that would not be directly
visible to ordinary code, and might not run at a precisely known rate.
The mapping between that would be done by NTP-like code somewhere in
On a machine with a LOT of cores, you could update it directly.
On one without, you would want a special loop which would take the
hardware clock and the constants maintained by the NTP-like code,
and update the clock field in memory once every microsecond. That
would behave exactly like a separate core. And, because updating
the memory field is a kernel operation, the implementation could be