futex: FUTEX_LOCK with optional adaptive spinning [Kernel]

Prev: [BUG] x86info fails on 2.6.34 ?
Next: [GIT PULL] 9p file system bug fixes for 2.6.34-rc3

From: Alan Cox on 6 Apr 2010 13:00

> That works for the uncontended case. For the contended case, the waiter
> and the owner have to go into the kernel and back out to transfer
> ownership.

If you insist on doing it that way yes, but knowing the lock owner is
likely to be away for a while also lets you do things like punt work
either by picking another work package in the meantime, or by queueing
the work you can't do on a list the pre-empted lock owner will review
before dropping the lock.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 6 Apr 2010 14:30

On Tue, 6 Apr 2010, Alan Cox wrote:

> > > IMO the best solution is to spin in userspace while the lock holder is
> > > running, fall into the kernel when it is scheduled out.
> >
> > That's just not realistic as user space has no idea whether the lock
> > holder is running or not and when it's scheduled out without a syscall :)
>
> Which is the real problem that wants addressing and can be addressed very
> cheaply. That would bring us up to par with 1970s RTOS environments ;)

Well, 1970's RTOSes had other features as well like preemption disable
mechanisms and other interesting stuff. I hope you're not going to
propose that next to bring us up to par with those :)

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 6 Apr 2010 15:40

On Tue, 6 Apr 2010, Alan Cox wrote:
> > Do you feel some of these situations would also benefit from some kernel
> > assistance to stop spinning when the owner schedules out? Or are you
> > saying that there are situations where pure userspace spinlocks will
> > always be the best option?
>
> There are cases its the best option - you are assuming for example that
> the owner can get scheduled out. Eg nailing one thread per CPU in some
> specialist high performance situations means they can't.

Fair enough, but that's not the problem Darren is targeting.

> > If the latter, I'd think that they would also be situations where
> > sched_yield() is not used as part of the spin loop. If so, then these
> > are not our target situations for FUTEX_LOCK_ADAPTIVE, which hopes to
> > provide a better informed mechanism for making spin or sleep decisions.
> > If sleeping isn't part of the locking construct implementation, then
> > FUTEX_LOCK_ADAPTIVE doesn't have much to offer.
>
> I am unsure about the approach. As Avi says knowing that the lock owner is
> scheduled out allows for far better behaviour. It doesn't need complex
> per lock stuff or per lock notifier entries on pre-empt either.
>
> A given task is either pre-empted or not and in the normal case of things
> you need this within a process so you've got shared pages anyway. So you
> only need one instance of the 'is thread X pre-empted' bit somewhere in a
> non swappable page.

I fear we might end up with a pinned page per thread to get this
working properly and it restricts the mechanism to process private
locks.

> That gives you something along the lines of
>
> runaddr = find_run_flag(lock);
> do {
> while(*runaddr == RUNNING) {
> if (trylock(lock))
> return WHOOPEE;
> cpu relax
> }
> yield (_on(thread));

That would require a new yield_to_target() syscall, which either
blocks the caller when the target thread is not runnable or returns an
error code which signals to go into the slow path.

> } while(*runaddr != DEAD);
>
>
> which unlike blindly spinning can avoid the worst of any hit on the CPU
> power and would be a bit more guided ?

I doubt that the syscall overhead per se is large enough to justify
all of the above horror. We need to figure out a more efficient way to
do the spinning in the kernel where we have all the necessary
information already. Darren's implementation is suboptimal AFAICT.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Darren Hart on 6 Apr 2010 17:30

Darren Hart wrote:
> Avi Kivity wrote:
>
>>> > At 10%
>>>> duty cycle you have 25 waiters behind the lock on average. I don't
>>>> think this is realistic, and it means that spinning is invoked only
>>>> rarely.
>>>
>>> Perhaps some instrumentation is in order, it seems to get invoked
>>> enough to achieve some 20% increase in lock/unlock iterations.
>>> Perhaps another metric would be of more value - such as average wait
>>> time?
>>
>> Why measure an unrealistic workload?
>
> No argument there, thus my proposal for an alternate configuration below.
>
>>>> I'd be interested in seeing runs where the average number of waiters
>>>> is 0.2, 0.5, 1, and 2, corresponding to moderate-to-bad contention.
>>>> 25 average waiters on compute bound code means the application needs
>>>> to be rewritten, no amount of mutex tweaking will help it.
>>>
>>> Perhaps something NR_CPUS threads would be of more interest?
>>
>> That seems artificial.
>
> How so? Several real world applications use one thread per CPU to
> dispatch work to, wait for events, etc.
>
>>
>>> At 10% that's about .8 and at 25% the 2 of your upper limit. I could
>>> add a few more duty-cycle points and make 25% the max. I'll kick that
>>> off and post the results... probably tomorrow, 10M iterations takes a
>>> while, but makes the results relatively stable.
>>
>> Thanks. But why not vary the number of threads as well?
>
> Absolutely, I don't disagree that all the variables should vary in order
> to get a complete picture. I'm starting with 8 - it takes several hours
> to collect the data.

While this might be of less interest after today's discussion, I
promised to share the results of a run with 8 threads with a wider
selection of lower duty-cycles. The results are very poor for adaptive
and worse for aas (multiple spinners) compared to normal FUTEX_LOCK. As
Thomas and Peter have pointed out, the implementation is sub-optimal.
Before abandoning this approach I will see if I can find the bottlenecks
and simplify the kernel side of things. My impression is that I am doing
a lot more work in the kernel, especially in the adaptive loop, than is
really necessary.

Both the 8 and 256 Thread plots can be viewed here:

http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v4/

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Thomas Gleixner on 6 Apr 2010 19:20

On Tue, 6 Apr 2010, Ulrich Drepper wrote:

> On Tue, Apr 6, 2010 at 12:31, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> > We need to figure out a more efficient way to
> > do the spinning in the kernel where we have all the necessary
> > information already.
>
> Really? The owner information isn't in general available in the
> kernel. Futex operation doesn't require the value used to be the PID
> (or negative of the PID). That is a dramatic limitation of the
> usefulness of futexes.

I know that you can do any weird stuff with the futex value, but I
don't see the "dramatic" limitation. Care to elaborate ?

> At userlevel there is access to other fields of the data structure
> which can contain the owner information.
>
> I would like to see the method using a per-thread pinned page and an
> update of a memory location on scheduling. For benchmarking at least.

The per thread pinned page would be unconditional, right ?

I agree that benchmarking would be interesting, but OTOH I fear that
we open up a huge can of worms with exposing scheduler details and the
related necessary syscalls like sys_yield_to: User space thread
management/scheduling comes to my mind and I hope we agree that we do
not want to revisit that.

> I agree that a sys_yield_to() syscall would be at the very least
> useful as well. But it's useful for other things already.

Useful for what ?

What are the exact semantics of such a syscall ?

How does that fit into the various scheduling constraints ?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [BUG] x86info fails on 2.6.34 ?
Next: [GIT PULL] 9p file system bug fixes for 2.6.34-rc3