Add a global synchronization point for pvclock [Kernel]

Prev: Sound not transferred to docking station after resume from StR
Next: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

From: Glauber Costa on 22 Apr 2010 09:20

On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote:
> On 04/20/2010 11:54 AM, Avi Kivity wrote:
> > On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
> >> On 04/20/2010 02:31 AM, Avi Kivity wrote:
> >>
> >>> btw, do you want this code in pvclock.c, or shall we keep it kvmclock
> >>> specific?
> >>>
> >> I think its a pvclock-level fix. I'd been hoping to avoid having
> >> something like this, but I think its ultimately necessary.
> >>
> >
> > Did you observe drift on Xen, or is this "ultimately" pointing at the
> > future?
>
> People are reporting weirdnesses that "clocksource=jiffies" apparently
> resolves. Xen and KVM are faced with the same hardware constraints, and
> it wouldn't surprise me if there were small measurable
> non-monotonicities in the PV clock under Xen. May as well be safe.
>
> Of course, it kills any possibility of being able to usefully export
> this interface down to usermode.
>
> My main concern about this kind of simple fix is that if there's a long
> term systematic drift between different CPU's tscs, then this will
> somewhat mask the problem while giving really awful time measurement on
> the "slow" CPU(s). In that case it really needs to adjust the scaling
> factor to correct for the drift (*not* update the offset). But if we're
> definitely only talking about fixed, relatively small time offsets then
> it is fine.
Can you by any chance run ingo's time warp test on those machines?

You need to define TOD to 1, and leave out the TSC test.

For me, warps exists on every machine out there, but the nehalems, so far
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jeremy Fitzhardinge on 23 Apr 2010 15:30

On 04/23/2010 02:34 AM, Avi Kivity wrote:
>>
>> diff -rup a/time-warp-test.c b/time-warp-test.c
>> --- a/time-warp-test.c 2010-04-15 16:30:13.955981607 -1000
>> +++ b/time-warp-test.c 2010-04-15 16:35:37.777982377 -1000
>> @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
>> {
>> DECLARE_ARGS(val, low, high);
>>
>> - asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
>> + asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) ::
>> "ebx", "ecx");
>>
>>
>
>
> Plus, replace cpuid by lfence/mfence. cpuid will trap.

I presume the clobbers are needed for cpuid; if you use [lm]fence then
they shouldn't be needed, right?

J

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 23 Apr 2010 15:30

On 04/23/2010 10:22 PM, Jeremy Fitzhardinge wrote:
> On 04/23/2010 02:34 AM, Avi Kivity wrote:
>
>>> diff -rup a/time-warp-test.c b/time-warp-test.c
>>> --- a/time-warp-test.c 2010-04-15 16:30:13.955981607 -1000
>>> +++ b/time-warp-test.c 2010-04-15 16:35:37.777982377 -1000
>>> @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
>>> {
>>> DECLARE_ARGS(val, low, high);
>>>
>>> - asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
>>> + asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) ::
>>> "ebx", "ecx");
>>>
>>>
>>>
>>
>> Plus, replace cpuid by lfence/mfence. cpuid will trap.
>>
> I presume the clobbers are needed for cpuid; if you use [lm]fence then
> they shouldn't be needed, right?
>
>

Right. But I'm not sure lfence/mfence are sufficient - from what I
understand rdtsc can pass right through them.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zachary Amsden on 22 Apr 2010 21:50

On 04/22/2010 03:11 AM, Glauber Costa wrote:
> On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote:
>
>> On 04/20/2010 11:54 AM, Avi Kivity wrote:
>>
>>> On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote:
>>>
>>>> On 04/20/2010 02:31 AM, Avi Kivity wrote:
>>>>
>>>>
>>>>> btw, do you want this code in pvclock.c, or shall we keep it kvmclock
>>>>> specific?
>>>>>
>>>>>
>>>> I think its a pvclock-level fix. I'd been hoping to avoid having
>>>> something like this, but I think its ultimately necessary.
>>>>
>>>>
>>> Did you observe drift on Xen, or is this "ultimately" pointing at the
>>> future?
>>>
>> People are reporting weirdnesses that "clocksource=jiffies" apparently
>> resolves. Xen and KVM are faced with the same hardware constraints, and
>> it wouldn't surprise me if there were small measurable
>> non-monotonicities in the PV clock under Xen. May as well be safe.
>>
>> Of course, it kills any possibility of being able to usefully export
>> this interface down to usermode.
>>
>> My main concern about this kind of simple fix is that if there's a long
>> term systematic drift between different CPU's tscs, then this will
>> somewhat mask the problem while giving really awful time measurement on
>> the "slow" CPU(s). In that case it really needs to adjust the scaling
>> factor to correct for the drift (*not* update the offset). But if we're
>> definitely only talking about fixed, relatively small time offsets then
>> it is fine.
>>
> Can you by any chance run ingo's time warp test on those machines?
>
> You need to define TOD to 1, and leave out the TSC test.
>
> For me, warps exists on every machine out there, but the nehalems, so far
>
Or apply this patch.

From: Avi Kivity on 21 Apr 2010 04:10

On 04/21/2010 03:01 AM, Zachary Amsden wrote:
>>> on this machine Glauber mentioned, or even on a multi-core Core 2 Duo),
>>> but the delta calculation is very hard (if not impossible) to get
>>> right.
>>>
>>> The timewarps i've seen were in the 0-200ns range, and very rare (once
>>> every 10 minutes or so).
>>
>> Might be due to NMIs or SMIs interrupting the rdtsc(); ktime_get()
>> operation which establishes the timeline. We could limit it by
>> having a loop doing rdtsc(); ktime_get(); rdtsc(); and checking for
>> some bound, but it isn't worthwhile (and will break nested
>> virtualization for sure). Better to have the option to calibrate
>> kvmclock just once on machines with
>> X86_FEATURE_NONSTOP_TRULY_RELIABLE _TSC_HONESTLY.
> Yes. So its not as if the guest visible TSCs go out of sync (they don't
>
> There's a perfect way to do this and it still fails to stop
> timewarps. You can set the performance counters to overflow if more
> instructions are issued than your code path, run an assembly
> instruction stream and if the performance interrupt hits, restart the
> calibration.

It's completely impractical. The PMU is a global resource that is
already shared among users and the host; programming and restoring it is
expensive; and in a virtualized environment it the whole scheme may fail.

>
> The calibration happens not just once, but on every migration, and
> currently, I believe, on every VCPU switch. Even if we reduce the
> number of calibrations to the bare minimum and rule out SMIs and NMIs,
> there will still be variation due to factors beyond our control
> because of the unpredictable nature of cache and instruction issue.

Right.

>
> However, X86_FEATURE_NONSTOP_TRULY_RELIABLE_TSC_HONESTLY does imply
> one key feature which the code is missing today: on SMP VMs, the
> calibration of kvmclock needs to be done only once, and the clock can
> then be used for all VCPUs. That, I think, stops Glauber's bug from
> appearing on the server side.

That's the plan.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Sound not transferred to docking station after resume from StR
Next: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?