x86/pvclock: add vsyscall implementation [Kernel]

Prev: [PATCH 01/10] KVM: SVM: Notify nested hypervisor of lost event injections
Next: [PATCH 1/8] firewire: sbp2: provide fallback if mgt_ORB_timeout is missing

From: Dan Magenheimer on 7 Oct 2009 17:00

> We can support them by falling back to the kernel. I'm a bit worried
> about the kernel playing with the hypervisor's version field. It's
> better to introduce yet a new version for the kernel, and check both.

On Nehalem, apps that need timestamp information at a high
frequency will likely use rdtsc/rdtscp directly.

I very much support Jeremy's efforts to make vsyscall+pvclock
work fast on processors other than the very newest ones.

Dan

> -----Original Message-----
> From: Avi Kivity [mailto:avi(a)redhat.com]
> Sent: Wednesday, October 07, 2009 4:26 AM
> To: Jeremy Fitzhardinge
> Cc: Jeremy Fitzhardinge; Dan Magenheimer; Xen-devel; Kurt Hackel; the
> arch/x86 maintainers; Linux Kernel Mailing List; Glauber de Oliveira
> Costa; Keir Fraser; Zach Brown; Chris Mason
> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
>
> On 10/06/2009 08:46 PM, Jeremy Fitzhardinge wrote:
> >
> >> Instead of using vgetcpu() and rdtsc() independently, you can use
> >> rdtscp to read both atomically. This removes the need for
> the preempt
> >> notifier.
> >>
> > rdtscp first appeared on Intel with Nehalem, so we need to
> support older
> > Intel chips.
> >
>
> We can support them by falling back to the kernel. I'm a bit worried
> about the kernel playing with the hypervisor's version field. It's
> better to introduce yet a new version for the kernel, and check both.
>
> > You could use rdscp to get (tsc,cpu) atomically, but that's not
> > sufficient to be able to get a consistent snapshot of (tsc,
> time_info)
> > because it doesn't give you the pvclock_vcpu_time_info
> version number.
> > If TSC_AUX contained that too, it might be possible.
> Alternatively you
> > could compare the tsc with pvclock.tsc_timestamp, but
> unfortunately the
> > ABI doesn't specify that tsc_timestamp is updated in any particular
> > order compared to the rest of the fields, so you still
> can't use that to
> > get a consistent snapshot (we can revise the ABI, of course).
> >
> > So either way it doesn't avoid the need to iterate.
> vgetcpu will use
> > rdtscp if available, but I agree it is unfortunate we need to do a
> > redundant rdtsc in that case.
> >
> >
>
> def try_pvclock_vtime():
> tsc, p0 = rdtscp()
> v0 = pvclock[p0].version
> tsc, p = rdtscp()
> t = pvclock_time(pvclock[p], tsc)
> if p != p0 or pvclock[p].version != v0:
> raise Exception("Processor or timebased change under our feet")
> return t
>
> def pvclock_time():
> while True:
> try:
> return try_pvlock_time()
> except:
> pass
>
> So, two rdtscps and two compares.
>
> >>> + for (cpu = 0; cpu< nr_cpu_ids; cpu++)
> >>> + pvclock_vsyscall_time_info[cpu].version = ~0;
> >>> +
> >>> + __set_fixmap(FIX_PVCLOCK_TIME_INFO,
> >>> __pa(pvclock_vsyscall_time_info),
> >>> + PAGE_KERNEL_VSYSCALL);
> >>> +
> >>> + preempt_notifier_init(&pvclock_vsyscall_notifier,
> >>> +&pvclock_vsyscall_preempt_ops);
> >>> + preempt_notifier_register(&pvclock_vsyscall_notifier);
> >>> +
> >>>
> >> preempt notifiers are per-thread, not global, and will
> upset the cycle
> >> counters.
> >>
> > Ah, so I need to register it on every new thread? That's a
> bit awkward.
> >
>
> It's used to manage processor registers, much like the fpu.
> If a thread
> uses a register that's not saved and restored by the normal context
> switch code, it can register a preempt notifier to do that instead.
>
> > This is intended to satisfy the cycle-counters who want to do
> > gettimeofday a million times a second, where I guess the tradeoff of
> > avoiding a pile of syscalls is worth a bit of
> context-switch overhead.
> >
>
> It's sufficient to increment a version counter on thread
> migration, no
> need to do it on context switch.
>
> --
> Do not meddle in the internals of kernels, for they are
> subtle and quick to panic.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 7 Oct 2009 19:00

> Then they will get incorrect timing once they are live migrated.

I've posted a proposed (OS-independent) solution for that and
am (slowly) in the process of implementing it.

> -----Original Message-----
> From: Avi Kivity [mailto:avi(a)redhat.com]
> Sent: Wednesday, October 07, 2009 3:08 PM
> To: Dan Magenheimer
> Cc: Jeremy Fitzhardinge; Jeremy Fitzhardinge; Xen-devel; Kurt Hackel;
> the arch/x86 maintainers; Linux Kernel Mailing List; Glauber
> de Oliveira
> Costa; Keir Fraser; Zach Brown; Chris Mason
> Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
>
> On 10/07/2009 10:48 PM, Dan Magenheimer wrote:
> >> We can support them by falling back to the kernel. I'm a
> bit worried
> >> about the kernel playing with the hypervisor's version field. It's
> >> better to introduce yet a new version for the kernel, and
> check both.
> >>
> > On Nehalem, apps that need timestamp information at a high
> > frequency will likely use rdtsc/rdtscp directly.
> >
> >
>
> Then they will get incorrect timing once they are live migrated.
>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 29 Oct 2009 09:10

On Thu, Oct 29, 2009 at 02:13:50PM +0200, Avi Kivity wrote:
> On 10/28/2009 07:47 PM, Jeremy Fitzhardinge wrote:
> >>Much better to have an API for this. Life is hacky enough already.
> >My point is that if an app cares about property X then it should just
> >measure property X. The fact that gettimeofday is a vsyscall is just an
> >implementation detail that apps don't really care about. What they care
> >about is whether gettimeofday is fast or not.
>
> But we can not make a reliable measurement.

I can't imagine how we'd decide what fast is? Please don't make the
applications guess.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 29 Oct 2009 12:20

On a related note, though some topic drift, many of
the problems that occur in virtualization due to migration
could be better addressed if Linux had an architected
interface to allow it to be signaled if a migration
occurred, and if Linux could signal applications of
the same. I don't have any cycles (pun intended) to
think about this right now, but if anyone else starts
looking at it, I'd love to be cc'ed.

Thanks,
Dan

> -----Original Message-----
> From: Dan Magenheimer
> Sent: Thursday, October 29, 2009 9:56 AM
> To: Avi Kivity
> Cc: Jeremy Fitzhardinge; Jeremy Fitzhardinge; Kurt Hackel; Glauber
> Costa; the arch/x86 maintainers; Linux Kernel Mailing List; Glauber de
> Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Ingo Molnar; Chris
> Mason
> Subject: RE: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> implementation
>
>
> > From: Avi Kivity [mailto:avi(a)redhat.com]
> > Sent: Thursday, October 29, 2009 9:07 AM
> > To: Dan Magenheimer
> > Cc: Jeremy Fitzhardinge; Glauber Costa; Jeremy Fitzhardinge; Kurt
> > Hackel; the arch/x86 maintainers; Linux Kernel Mailing List;
> > Glauber de
> > Oliveira Costa; Xen-devel; Keir Fraser; Zach Brown; Chris
> Mason; Ingo
> > Molnar
> > Subject: Re: [Xen-devel] Re: [PATCH 3/5] x86/pvclock: add vsyscall
> > implementation
> >
> >
> > On 10/29/2009 04:46 PM, Dan Magenheimer wrote:
> > > No, the apps I'm familiar with (a DB and a JVM) need a timestamp
> > > not a monotonic counter. The timestamps must be relatively
> > > accurate (e.g. we've been talking about gettimeofday generically,
> > > but these apps would use clock_gettime for nsec resolution),
> > > monotonically increasing, and work properly across a VM
> > > migration. The timestamps are taken up to a 100K/sec or
> > > more so the apps need to ensure they are using the fastest
> > > mechanism available that meets those requirements.
> >
> > Out of interest, do you know (and can you relate) why those
> apps need
> > 100k/sec monotonically increasing timestamps?
>
> I don't have any public data available for this DB usage, but
> basically
> assume it is measuring transactions at a very high throughput, some
> of which are to a memory-resident portion of the DB. Anecdotally,
> I'm told the difference between non-vsyscall gettimeofday
> and native rdtsc (on a machine with Invariant TSC support) can
> affect overall DB performance by as much as 10-20%.
>
> I did find the following public link for the JVM:
>
> http://download.oracle.com/docs/cd/E13188_01/jrockit/tools/int
ro/jmc3.html

Search for "flight recorder". This feature is intended to
be enabled all the time, but with non-vsyscall gettimeofday
the performance impact is unacceptably high, so they are using
rdtscp instead (on those machines where it is available). With
rdtscp, the performance impact is not measureable.

Though the processor/server vendors have finally fixed the
"unsynced TSC" problem on recent x86 platforms, thus allowing
enterprise software to obtain timestamps at rdtsc performance,
the problem comes back all over again with virtualization
because of migration. Jeremy's vsyscall+pvclock is a great
solution if the app can ensure that it is present; if not,
the apps will instead continue to use rdtsc as even emulated
rdtsc is 2-3x faster than non-vsyscall gettimeofday.

Does that help?

_______________________________________________
Xen-devel mailing list
Xen-devel(a)lists.xensource.com
http://lists.xensource.com/xen-devel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dan Magenheimer on 2 Nov 2009 10:40

> From: Avi Kivity [mailto:avi(a)redhat.com]
>
> On 10/29/2009 06:15 PM, Dan Magenheimer wrote:
> > On a related note, though some topic drift, many of
> > the problems that occur in virtualization due to migration
> > could be better addressed if Linux had an architected
> > interface to allow it to be signaled if a migration
> > occurred, and if Linux could signal applications of
> > the same. I don't have any cycles (pun intended) to
> > think about this right now, but if anyone else starts
> > looking at it, I'd love to be cc'ed.
>
> IMO that's not a good direction. The hypervisor should not depend on
> the guest for migration (the guest may be broken, or
> malicious, or being
> debugged, or slow). So the notification must be asynchronous, which
> means that it will only be delivered to applications after
> migration has
> completed.

I definitely agree that the hypervisor can't wait for a guest
to respond.

You've likely thought through this a lot more than I have,
but I was thinking that if the kernel received the notification
as some form of interrupt, it could determine immediately
if any running threads had registered for "SIG_MIGRATE"
and deliver the signal synchronously.

> Instead of a "migration has occured, run for the hills" signal we're
> better of finding out why applications want to know about
> this event and
> addressing specific needs.

Perhaps. It certainly isn't warranted for this one
special case of timestamp handling. But I'll bet 5-10 years
from now, after we've handled a few special cases, we'll
wish that we would have handled it more generically.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH 01/10] KVM: SVM: Notify nested hypervisor of lost event injections
Next: [PATCH 1/8] firewire: sbp2: provide fallback if mgt_ORB_timeout is missing