Add a global synchronization point for pvclock [Kernel]

Prev: Sound not transferred to docking station after resume from StR
Next: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?

From: Avi Kivity on 19 Apr 2010 07:50

On 04/19/2010 02:19 PM, Peter Zijlstra wrote:
>
>> Still have two cmpxchgs in the common case. The first iteration will
>> fail, fetching last_value, the second will work.
>>
>> It will be better when we have contention, though, so it's worthwhile.
>>
> Right, another option is to put the initial read outside of the loop,
> that way you'll have the best of all cases, a single LOCK'ed op in the
> loop, and only a single LOCK'ed op for the fast path on sensible
> architectures ;-)
>
> last = atomic64_read(&last_value);
> do {
> if (ret< last)
> return last;
> last = atomic64_cmpxchg(&last_value, last, ret);
> } while (unlikely(last != ret));
>
> Or so.
>
>

Yes, looks optimal when !NONSTOP_TSC.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 20 Apr 2010 05:40

On 04/19/2010 09:35 PM, Zachary Amsden wrote:
>>>> Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)?
>>> Not sure, IIRC we clear that when the TSC sync test fails, eg when we
>>> mark the tsc clocksource unusable.
>>
>> Worrying. By the time we detect this the guest may already have
>> gotten confused by clocks going backwards.
>
>
> Upstream, we are marking the TSC unstable preemptively when hardware
> which will eventually sync test is detected, so this should be fine.

ENOPARSE?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 20 Apr 2010 05:40

On 04/19/2010 07:18 PM, Jeremy Fitzhardinge wrote:
> On 04/19/2010 07:46 AM, Peter Zijlstra wrote:
>
>> What avi says! :-)
>>
>> On a 32bit machine a 64bit read are two 32bit reads, so
>>
>> last = last_value;
>>
>> becomes:
>>
>> last.high = last_value.high;
>> last.low = last_vlue.low;
>>
>> (or the reverse of course)
>>
>> Now imagine a write getting interleaved with that ;-)
>>
>>
> You could explicitly do:
>
> do {
> h = last.high;
> barrier();
> l = last.low;
> barrier();
> } while (last.high != h);
>
>
> This works because we expect last to be always increasing, so the only
> worry is low wrapping and incrementing high, and is more efficient than
> making the read fully atomic (the write is still cmpxchg64). But it's
> pretty ugly to open code just for 32b architectures; its something that
> might be useful to turn into a general abstraction (monotonic_read_64
> FTW!). I already have code like this in the Xen time code, so I could
> make immediate use of it.
>

I don't think this is worthwhile - the cmpxchg is not that expensive on
most kvm capable hosts (the exception is the Pentium D).

btw, do you want this code in pvclock.c, or shall we keep it kvmclock
specific?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Glauber Costa on 20 Apr 2010 09:00

On Tue, Apr 20, 2010 at 12:35:19PM +0300, Avi Kivity wrote:
> On 04/20/2010 04:57 AM, Marcelo Tosatti wrote:
> >
> >>Marcelo can probably confirm it, but he has a nehalem with an appearently
> >>very good tsc source. Even this machine warps.
> >>
> >>It stops warping if we only write pvclock data structure once and forget it,
> >>(which only updated tsc_timestamp once), according to him.
> >Yes. So its not as if the guest visible TSCs go out of sync (they don't
> >on this machine Glauber mentioned, or even on a multi-core Core 2 Duo),
> >but the delta calculation is very hard (if not impossible) to get right.
> >
> >The timewarps i've seen were in the 0-200ns range, and very rare (once
> >every 10 minutes or so).
>
> Might be due to NMIs or SMIs interrupting the rdtsc(); ktime_get()
> operation which establishes the timeline. We could limit it by
> having a loop doing rdtsc(); ktime_get(); rdtsc(); and checking for
> some bound, but it isn't worthwhile (and will break nested
> virtualization for sure). Better to have the option to calibrate
> kvmclock just once on machines with
> X86_FEATURE_NONSTOP_TRULY_RELIABLE _TSC_HONESTLY.
For the record, we can only even do that in those machines. If we try to update
time structures only once in machines with the
X86_FEATURE_TSC_SAYS_IT_IS_OKAY_BUT_IN_REALITY_IS_NOT_OKAY feature flag, guests
won't even boot.

We can detect that, and besides doing calculation only once, also export some
bit indicating that to the guest. Humm... I'm thinking now, that because of
migration, we should check this bit every time, because we might have changed host.
So instead of using an expensive cpuid check, we should probably use some bit in
the vcpu_time_info structure, and use a cpuid bit just to say it is enabled.

Jeremy,

are you okay in turning one of the pad fields in the structure into a flags field?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Glauber Costa on 19 Apr 2010 10:30

On Fri, Apr 16, 2010 at 01:36:34PM -0700, Jeremy Fitzhardinge wrote:
> On 04/15/2010 11:37 AM, Glauber Costa wrote:
> > In recent stress tests, it was found that pvclock-based systems
> > could seriously warp in smp systems. Using ingo's time-warp-test.c,
> > I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
> >
>
> Is that "1.5 million"?
Yes it is.

But as I said, this seem to be a very deep worst case scenario. Most of boxes
are not even close to being that bad.

>
> > (to be fair, it wasn't that bad in most of them). Investigating further, I
> > found out that such warps were caused by the very offset-based calculation
> > pvclock is based on.
> >
>
> Is the problem that the tscs are starting out of sync, or that they're
> drifting relative to each other over time? Do the problems become worse
> the longer the uptime? How large are the offsets we're talking about here?
The offsets usually seem pretty small, under a microsecond. So I don't think
it has anything to do with tscs starting out of sync. Specially because the
delta-based calculation has the exact purpose of handling that case.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Sound not transferred to docking station after resume from StR
Next: Process-shared futexes on hugepages puts the kernel in an infinite loop in 2.6.32.11; is this fixed now?