From: H. Peter Anvin on
On 02/17/2010 03:42 AM, Luca Barbieri wrote:
> This patch uses SSE movlps to perform 64-bit atomic reads and writes.
>
> According to Intel manuals, all aligned 64-bit reads and writes are
> atomically, which should include movlps.
>
> To do this, we need to disable preempt, clts if TS was set, and
> restore TS.
>
> If we don't need to change TS, using SSE is much faster.
>
> Otherwise, it should be essentially even, with the fastest method
> depending on the specific architecture.
>
> Another important point is that with SSE atomic64_read can keep the
> cacheline in shared state.
>
> If we could keep TS off and reenable it when returning to userspace,
> this would be even faster, but this is left for a later patch.
>
> We use SSE because we can just save the low part %xmm0, whereas using
> the FPU or MMX requires at least saving the environment, and seems
> impossible to do fast.
>
> Signed-off-by: Luca Barbieri <luca(a)luca-barbieri.com>

I'm a bit unhappy about this patch. It seems to violate the assumption
that we only ever use the FPU state guarded by
kernel_fpu_begin()..kernel_fpu_end() and instead it uses a local hack,
which seems like a recipe for all kinds of very subtle problems down the
line.

Unless the performance advantage is provably very compelling, I'm
inclined to say that this is not worth it.

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Luca Barbieri on
> I'm a bit unhappy about this patch. It seems to violate the assumption
> that we only ever use the FPU state guarded by
> kernel_fpu_begin()..kernel_fpu_end() and instead it uses a local hack,
> which seems like a recipe for all kinds of very subtle problems down the
> line.

kernel_fpu_begin saves the whole FPU state, but to use SSE we don't
really need that, since we can just save the %xmm registers we need,
which is much faster.
This is why SSE is used instead of just using an FPU double read.
We could however add a kernel_sse_begin_nosave/kernel_sse_end_nosave to do this.

> Unless the performance advantage is provably very compelling, I'm
> inclined to say that this is not worth it.
There is the advantage of not taking the cacheline for writing in atomic64_read.
Also locked cmpxchg8b is slow and if we were to restore the TS flag
lazily on userspace return, it would significantly improve the
function in all cases (with the current code, it depends on how fast
the architecture does clts/stts vs lock cmpxchg8b).
Of course the big-picture impact depends on the users of the interface.

Anyway, feel free to ignore this patch for now (and the next one as well).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: H. Peter Anvin on
On 02/17/2010 04:41 PM, Luca Barbieri wrote:
>> I'm a bit unhappy about this patch. It seems to violate the assumption
>> that we only ever use the FPU state guarded by
>> kernel_fpu_begin()..kernel_fpu_end() and instead it uses a local hack,
>> which seems like a recipe for all kinds of very subtle problems down the
>> line.
>
> kernel_fpu_begin saves the whole FPU state, but to use SSE we don't
> really need that, since we can just save the %xmm registers we need,
> which is much faster.
> This is why SSE is used instead of just using an FPU double read.
> We could however add a kernel_sse_begin_nosave/kernel_sse_end_nosave to do this.
>

We could, and that would definitely better than open-coding the operation.

>> Unless the performance advantage is provably very compelling, I'm
>> inclined to say that this is not worth it.
> There is the advantage of not taking the cacheline for writing in atomic64_read.
> Also locked cmpxchg8b is slow and if we were to restore the TS flag
> lazily on userspace return, it would significantly improve the
> function in all cases (with the current code, it depends on how fast
> the architecture does clts/stts vs lock cmpxchg8b).
> Of course the big-picture impact depends on the users of the interface.

It does, and I would prefer to not take it until there is a user of the
interface which motivates the performance. Ingo, do you have a feel for
how performance-critical this actually is?

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
Luca Barbieri <luca(a)luca-barbieri.com> writes:

> This patch uses SSE movlps to perform 64-bit atomic reads and writes.

You seem to have forgotten to add benchmark results that show this is
actually worth while? And is there really any user on 32bit
that needs 64bit atomic_t?

I have to agree with Peter on it being a bad idea very likely
to mess with FPU state like this.

I'm also suspicious of your use of global register variables.
This means they won't be saved on entry/exit of the functions.
Does that really work?

It's also a not widely used gcc feature and those are always
somewhat risky.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Luca Barbieri on
Anyway, how about the previous patches?

The SSE part is not essential and can be delayed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/