use unfair spinlock when running on hypervisor. [Kernel]

Prev: sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
Next: [PATCH 1/2] squashfs: xattr_handler don't inline

From: Valdis.Kletnieks on 1 Jun 2010 13:50

On Tue, 01 Jun 2010 19:52:28 +0300, Avi Kivity said:
> On 06/01/2010 07:38 PM, Andi Kleen wrote:
> >>> Your new code would starve again, right?
> > Try it on a NUMA system with unfair memory.

> We are running everything on NUMA (since all modern machines are now
> NUMA). At what scale do the issues become observable?

My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the
perfectly-running NUMA=n kernel I'm running.

Or did you mean a less broad phrase than "all modern machines"?

From: john cooper on 1 Jun 2010 14:20

Avi Kivity wrote:
> On 06/01/2010 07:38 PM, Andi Kleen wrote:
>>>> Your new code would starve again, right?
>>>>
>>>>
>>> Yes, of course it may starve with unfair spinlock. Since vcpus are not
>>> always running there is much smaller chance then vcpu on remote memory
>>> node will starve forever. Old kernels with unfair spinlocks are running
>>> fine in VMs on NUMA machines with various loads.
>>>
>> Try it on a NUMA system with unfair memory.
>>
>
> We are running everything on NUMA (since all modern machines are now
> NUMA). At what scale do the issues become observable?
>
>>> I understand that reason and do not propose to get back to old spinlock
>>> on physical HW! But with virtualization performance hit is unbearable.
>>>
>> Extreme unfairness can be unbearable too.
>>
>
> Well, the question is what happens first. In our experience, vcpu
> overcommit is a lot more painful. People will never see the NUMA
> unfairness issue if they can't use kvm due to the vcpu overcommit problem.

Gleb's observed performance hit seems to be a rather mild
throughput depression compared with creating a worst case by
enforcing vcpu overcommit. Running a single guest with 2:1
overcommit on a 4 core machine I saw over an order of magnitude
slowdown vs. 1:1 commit with the same kernel build test.
Others have reported similar results.

How close you'll get to that scenario depends on host
scheduling dynamics, and statistically the number of opened
and stalled lock held paths waiting to be contended. So
I'd expect to see quite variable numbers for guest-guest
aggravation of this problem.

> What I'd like to see eventually is a short-term-unfair, long-term-fair
> spinlock. Might make sense for bare metal as well. But it won't be
> easy to write.

Collecting the contention/usage statistics on a per spinlock
basis seems complex. I believe a practical approximation
to this are adaptive mutexes where upon hitting a spin
time threshold, punt and let the scheduler reconcile fairness.

-john

--
john.cooper(a)third-harmonic.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 1 Jun 2010 15:40

> Collecting the contention/usage statistics on a per spinlock
> basis seems complex. I believe a practical approximation
> to this are adaptive mutexes where upon hitting a spin
> time threshold, punt and let the scheduler reconcile fairness.

That would probably work, except: how do you get the
adaptive spinlock into a paravirt op without slowing
down a standard kernel?

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Dumazet on 1 Jun 2010 17:40

Le mardi 01 juin 2010 à 19:52 +0300, Avi Kivity a écrit :

> What I'd like to see eventually is a short-term-unfair, long-term-fair
> spinlock. Might make sense for bare metal as well. But it won't be
> easy to write.
>

This thread rings a bell here :)

Yes, ticket spinlocks are sometime slower, especially in workloads where
a spinlock needs to be taken several times to handle one unit of work,
and many cpus competing.

We currently have kind of a similar problem in network stack, and we
have a patch to speedup xmit path by an order of magnitude, letting one
cpu (the consumer cpu) to get unfair access to the (ticket) spinlock.
(It can compete with no more than one other cpu)

Boost from ~50.000 to ~600.000 pps on a dual quad core machine (E5450
@3.00GHz) on a particular workload (many cpus want to xmit their
packets)

( patch : http://patchwork.ozlabs.org/patch/53163/ )

It could be possible to write such a generic beast, with a cascade or
regular ticket spinlocks ?

One ticket spinlock at first stage (only if some conditions are met, aka
slow path), then an 'primary' spinlock at second stage.

// generic implementation
// (x86 could use 16bit fields for users_in & user_out)
struct cascade_lock {
atomic_t users_in;
int users_out;
spinlock_t primlock;
spinlock_t slowpathlock; // could be outside of this structure, shared by many 'cascade_locks'
};

/*
* In kvm case, you might call hypervisor when slowpathlock is about to be taken ?
* When a cascade lock is unlocked, and relocked right after, this cpu has unfair
* priority and could get the lock before cpus blocked in slowpathlock (especially if
* an hypervisor call was done)
*
* In network xmit path, the dequeue thread would use highprio_user=true mode
* In network xmit path, the 'contended' enqueueing thread would set a negative threshold,
* to force a 'lowprio_user' mode.
*/
void cascade_lock(struct cascade_lock *l, bool highprio_user, int threshold)
{
bool slowpath = false;

atomic_inc(&l->users_in); // no real need for atomic_inc_return()
if (atomic_read(&l->users_in) - l->users_out > threshold && !highprio_user)) {
spin_lock(&l->slowpathlock);
slowpath = true;
}
spin_lock(&l->primlock);
if (slowpath)
spin_unlock(&l->slowpathlock);
}

void cascade_unlock(struct cascade_lock *l)
{
l->users_out++;
spin_unlock(&l->primlock);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 1 Jun 2010 22:50

On 06/01/2010 08:39 PM, Valdis.Kletnieks(a)vt.edu wrote:
>> We are running everything on NUMA (since all modern machines are now
>> NUMA). At what scale do the issues become observable?
>>
> My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the
> perfectly-running NUMA=n kernel I'm running.
>
> Or did you mean a less broad phrase than "all modern machines"?
>
>

All modern two socket and above boards, sorry.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
Next: [PATCH 1/2] squashfs: xattr_handler don't inline