futex: FUTEX_LOCK with optional adaptive spinning [Kernel]

Prev: [BUG] x86info fails on 2.6.34 ?
Next: [GIT PULL] 9p file system bug fixes for 2.6.34-rc3

From: Darren Hart on 6 Apr 2010 19:40

Thomas Gleixner wrote:
> On Tue, 6 Apr 2010, Ulrich Drepper wrote:
>
>> On Tue, Apr 6, 2010 at 12:31, Thomas Gleixner <tglx(a)linutronix.de> wrote:
>>> We need to figure out a more efficient way to
>>> do the spinning in the kernel where we have all the necessary
>>> information already.
>> Really? The owner information isn't in general available in the
>> kernel. Futex operation doesn't require the value used to be the PID
>> (or negative of the PID). That is a dramatic limitation of the
>> usefulness of futexes.
>
> I know that you can do any weird stuff with the futex value, but I
> don't see the "dramatic" limitation. Care to elaborate ?
>
>> At userlevel there is access to other fields of the data structure
>> which can contain the owner information.
>>
>> I would like to see the method using a per-thread pinned page and an
>> update of a memory location on scheduling. For benchmarking at least.
>
> The per thread pinned page would be unconditional, right ?
>
> I agree that benchmarking would be interesting, but OTOH I fear that
> we open up a huge can of worms with exposing scheduler details and the
> related necessary syscalls like sys_yield_to: User space thread
> management/scheduling comes to my mind and I hope we agree that we do
> not want to revisit that.
>
>> I agree that a sys_yield_to() syscall would be at the very least
>> useful as well. But it's useful for other things already.
>
> Useful for what ?
>
> What are the exact semantics of such a syscall ?
>
> How does that fit into the various scheduling constraints ?

I believe this comes back to the discussions of a directed yield. The
idea being that a thread yields its remaining timeslice to a thread of
it's choosing - usually because the target thread holds a resource the
yielding thread needs access to. This makes the yield more explicit so
the yielding thread is more likely to get some benefit out of yielding.

I believe the arguments would be either a TID or a thread group -
however that is specified. I believe the KVM guys would like to see
something like this as well - which might be the "other things" referred
to above.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Avi Kivity on 7 Apr 2010 01:40

On 04/06/2010 10:31 PM, Thomas Gleixner wrote:
>
>> That gives you something along the lines of
>>
>> runaddr = find_run_flag(lock);
>> do {
>> while(*runaddr == RUNNING) {
>> if (trylock(lock))
>> return WHOOPEE;
>> cpu relax
>> }
>> yield (_on(thread));
>>
> That would require a new yield_to_target() syscall, which either
> blocks the caller when the target thread is not runnable or returns an
> error code which signals to go into the slow path.
>

Note, directed yield is something that kvm wants for its own ends. As
an internal kernel api, not a syscall, but it's apparently useful.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: john cooper on 7 Apr 2010 02:40

Avi Kivity wrote:
> On 04/06/2010 07:14 PM, Thomas Gleixner wrote:
>>
>>> IMO the best solution is to spin in userspace while the lock holder is
>>> running, fall into the kernel when it is scheduled out.
>>>
>> That's just not realistic as user space has no idea whether the lock
>> holder is running or not and when it's scheduled out without a syscall :)
>>
>
> The kernel could easily expose this information by writing into the
> thread's TLS area.
>
> So:
>
> - the kernel maintains a current_cpu field in a thread's tls
> - lock() atomically writes a pointer to the current thread's current_cpu
> when acquiring
> - the kernel writes an invalid value to current_cpu when switching out
> - a contended lock() retrieves the current_cpu pointer, and spins as
> long as it is a valid cpu

There are certainly details to sort through in the packaging
of the mechanism but conceptually that should do the job.
So here the application has chosen a blocking lock as being
the optimal synchronization operation and we're detecting a
scenario where we can factor out the aggregate overhead of two
context switch operations.

There is also the case where the application requires a
polled lock with the rational being the assumed lock
hold/wait time is substantially less than the above context
switch overhead. But here we're otherwise completely
open to indiscriminate scheduling preemption even though
we may be holding a userland lock.

The adaptive mutex above is an optimization beyond what
is normally expected for the associated model. The preemption
of a polled lock OTOH can easily inflict latency several orders
of magnitude beyond what is expected in that model. Two use
cases exist here which IMO aren't related except for the latter
unintentionally degenerating into the former.

-john

--
john.cooper(a)third-harmonic.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Darren Hart on 7 Apr 2010 23:40

john cooper wrote:
> Avi Kivity wrote:
>> On 04/06/2010 07:14 PM, Thomas Gleixner wrote:
>>>> IMO the best solution is to spin in userspace while the lock holder is
>>>> running, fall into the kernel when it is scheduled out.
>>>>
>>> That's just not realistic as user space has no idea whether the lock
>>> holder is running or not and when it's scheduled out without a syscall :)
>>>
>> The kernel could easily expose this information by writing into the
>> thread's TLS area.
>>
>> So:
>>
>> - the kernel maintains a current_cpu field in a thread's tls
>> - lock() atomically writes a pointer to the current thread's current_cpu
>> when acquiring
>> - the kernel writes an invalid value to current_cpu when switching out
>> - a contended lock() retrieves the current_cpu pointer, and spins as
>> long as it is a valid cpu
>
> There are certainly details to sort through in the packaging
> of the mechanism but conceptually that should do the job.
> So here the application has chosen a blocking lock as being
> the optimal synchronization operation and we're detecting a
> scenario where we can factor out the aggregate overhead of two
> context switch operations.

I didn't intend to change the behavior of an existing blocking call with
adaptive spinning if that is what you are getting at here. Initially
there would be a new futex op, something like FUTEX_LOCK_ADAPTIVE or
maybe just FUTEX_WAIT_ADAPTIVE. Applications can use this directly to
implement adaptive spinlocks. Ideally glibc would make use of this via
either the existing adaptive spinning NP API or via a new one. Before we
even go there, we need to see if this can provide a real benefit.

>
> There is also the case where the application requires a
> polled lock with the rational being the assumed lock
> hold/wait time is substantially less than the above context
> switch overhead.

Polled lock == userspace spinlock?

> But here we're otherwise completely
> open to indiscriminate scheduling preemption even though
> we may be holding a userland lock.

That's true with any userland lock.

> The adaptive mutex above is an optimization beyond what
> is normally expected for the associated model. The preemption
> of a polled lock OTOH can easily inflict latency several orders
> of magnitude beyond what is expected in that model. Two use
> cases exist here which IMO aren't related except for the latter
> unintentionally degenerating into the former.

Again, my intention is not to replace any existing functionality, so
applications would have to explicitly request this behavior.

If I'm missing your point, please elaborate.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Darren Hart on 7 Apr 2010 23:50

drepper(a)gmail.com wrote:
> On Tue, Apr 6, 2010 at 16:16, Thomas Gleixner <tglx(a)linutronix.de> wrote:
>> I know that you can do any weird stuff with the futex value, but I
>> don't see the "dramatic" limitation. Care to elaborate ?
>
> If we have to fill in the PID we can represent only three states in a
> futex: 0, PID, -PID. Today we can represent 2^32 states. Quite a
> difference.

For general futexes sure, but not for robust or PI mutexes. Having
adaptive futexes be based on the TID|WAITERS_FLAG policy certainly isn't
breaking new ground, and is consistent with the other kernel-side futex
locking implementations.

However, I agree that a FUTEX_SET_WAIT_ADAPTIVE sort of call might be
very powerful. However, that might be just academic until I can show an
improvement with adaptive futexes.

>> The per thread pinned page would be unconditional, right ?
>
> Only if the process would be using these adaptive mutexes. It could be
> conditional.

What about the concern of this TLS approach only working for process
private locks? I would very much like to make this work for both shared
and private locks.

Thanks,

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [BUG] x86info fails on 2.6.34 ?
Next: [GIT PULL] 9p file system bug fixes for 2.6.34-rc3