Robust TSC compensation [Kernel]

Prev: x86, memblock: Use memblock_debug to control debug message print out
Next: Build log scripts

From: Rik van Riel on 14 Jul 2010 18:40

On 07/12/2010 10:25 PM, Zachary Amsden wrote:
> Make the match of TSC find TSC writes that are close to each other
> instead of perfectly identical; this allows the compensator to also
> work in migration / suspend scenarios.
>
> Signed-off-by: Zachary Amsden<zamsden(a)redhat.com>

I don't see a real alternative, so ...

Reviewed-by: Rik van Riel <riel(a)redhat.com>

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 18 Jul 2010 11:00

On 07/13/2010 05:25 AM, Zachary Amsden wrote:
> Make the match of TSC find TSC writes that are close to each other
> instead of perfectly identical; this allows the compensator to also
> work in migration / suspend scenarios.
>
>

What scenario exactly?

> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -926,21 +926,27 @@ void guest_write_tsc(struct kvm_vcpu *vcpu, u64 data)
> struct kvm *kvm = vcpu->kvm;
> u64 offset, ns, elapsed;
> struct timespec ts;
> + s64 sdiff;
>
> spin_lock(&kvm->arch.tsc_write_lock);
> offset = data - native_read_tsc();
> ns = get_kernel_ns();
> elapsed = ns - kvm->arch.last_tsc_nsec;
> + sdiff = data - kvm->arch.last_tsc_write;
> + if (sdiff< 0)
> + sdiff = -sdiff;
>
> /*
> - * Special case: identical write to TSC within 5 seconds of
> + * Special case: close write to TSC within 5 seconds of
> * another CPU is interpreted as an attempt to synchronize
> - * (the 5 seconds is to accomodate host load / swapping).
> + * The 5 seconds is to accomodate host load / swapping as
> + * well as any reset of TSC during the boot process.
> *
> * In that case, for a reliable TSC, we can match TSC offsets,
> - * or make a best guest using kernel_ns value.
> + * or make a best guest using elapsed value.
> */
> - if (data == kvm->arch.last_tsc_write&& elapsed< 5ULL * NSEC_PER_SEC) {
> + if (sdiff< nsec_to_cycles(5ULL * NSEC_PER_SEC)&&
> + elapsed< 5ULL * NSEC_PER_SEC) {
> if (!check_tsc_unstable()) {
> offset = kvm->arch.last_tsc_offset;
> pr_debug("kvm: matched tsc offset for %llu\n", data);
>

Don't we have to adjust offset to the required different between tsc?
Or do we assume, that if the guest wrote close enough values, it is
trying to cleverly compensate for IPI latency?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zachary Amsden on 19 Jul 2010 16:50

On 07/18/2010 04:52 AM, Avi Kivity wrote:
> On 07/13/2010 05:25 AM, Zachary Amsden wrote:
>> Make the match of TSC find TSC writes that are close to each other
>> instead of perfectly identical; this allows the compensator to also
>> work in migration / suspend scenarios.
>>
>
> What scenario exactly?

After migration, qemu will write back MSRs, including TSC to the VCPUs.
They won't have exactly matching values, because they get read out at
different times (actually, because the TSC for the VCPUs never stops,
they can have wildly different times if there was some host overload /
swap / suspend event).

When restarting the CPUs, qemu will try to write back the TSC and then
we end up desynchronizing the system.

It's an ugly problem, and this is an ugly solution.

Better would be to "stop" the VCPUs (requires some kernel
synchronization to determine TSC stop point), or to simply take the
maximum TSC in qemu and write that to all of the CPUs (this assumes the
guest wants to have TSCs in sync at all).

Both methods have to assume small deltas in TSC are unintentional
effects in order to correctly resynchronize.

>
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -926,21 +926,27 @@ void guest_write_tsc(struct kvm_vcpu *vcpu, u64
>> data)
>> struct kvm *kvm = vcpu->kvm;
>> u64 offset, ns, elapsed;
>> struct timespec ts;
>> + s64 sdiff;
>>
>> spin_lock(&kvm->arch.tsc_write_lock);
>> offset = data - native_read_tsc();
>> ns = get_kernel_ns();
>> elapsed = ns - kvm->arch.last_tsc_nsec;
>> + sdiff = data - kvm->arch.last_tsc_write;
>> + if (sdiff< 0)
>> + sdiff = -sdiff;
>>
>> /*
>> - * Special case: identical write to TSC within 5 seconds of
>> + * Special case: close write to TSC within 5 seconds of
>> * another CPU is interpreted as an attempt to synchronize
>> - * (the 5 seconds is to accomodate host load / swapping).
>> + * The 5 seconds is to accomodate host load / swapping as
>> + * well as any reset of TSC during the boot process.
>> *
>> * In that case, for a reliable TSC, we can match TSC offsets,
>> - * or make a best guest using kernel_ns value.
>> + * or make a best guest using elapsed value.
>> */
>> - if (data == kvm->arch.last_tsc_write&& elapsed< 5ULL *
>> NSEC_PER_SEC) {
>> + if (sdiff< nsec_to_cycles(5ULL * NSEC_PER_SEC)&&
>> + elapsed< 5ULL * NSEC_PER_SEC) {
>> if (!check_tsc_unstable()) {
>> offset = kvm->arch.last_tsc_offset;
>> pr_debug("kvm: matched tsc offset for %llu\n", data);
>
> Don't we have to adjust offset to the required different between tsc?
> Or do we assume, that if the guest wrote close enough values, it is
> trying to cleverly compensate for IPI latency?
>

No, we have to assume that any small (small being defined as < 5 second)
difference is unintentional. It's not perfect and is certainly error
prone (without one of the two assists from qemu that I mention above).

I think qemu should probably take the maximum TSC and apply it to all VCPUs.

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2
Prev: x86, memblock: Use memblock_debug to control debug message print out
Next: Build log scripts