x86-32: use SSE for atomic64_read/set if available [Kernel]

Prev: boot times, not mdadm (was: Linux mdadm superblock question.)
Next: [PATCH 1/8] PCI PM: Add function for checking PME status of devices

From: Avi Kivity on 18 Feb 2010 05:00

On 02/18/2010 02:47 AM, H. Peter Anvin wrote:
>
>>> Unless the performance advantage is provably very compelling, I'm
>>> inclined to say that this is not worth it.
>>>
>> There is the advantage of not taking the cacheline for writing in atomic64_read.
>> Also locked cmpxchg8b is slow and if we were to restore the TS flag
>> lazily on userspace return, it would significantly improve the
>> function in all cases (with the current code, it depends on how fast
>> the architecture does clts/stts vs lock cmpxchg8b).
>> Of course the big-picture impact depends on the users of the interface.
>>
> It does, and I would prefer to not take it until there is a user of the
> interface which motivates the performance. Ingo, do you have a feel for
> how performance-critical this actually is?
>

One heavy user is set_64() in the pagetable code. That's already in an
expensive operation due to the page fault so the impact will be quite
low, probably.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Luca Barbieri on 18 Feb 2010 05:10

> One heavy user is set_64() in the pagetable code. �That's already in an
> expensive operation due to the page fault so the impact will be quite low,
> probably.
It currently does not use the atomic64_t infrastructure and thus won't
be affected currently, but can very easily be converted to cast the
pointer to atomic64_t* and use atomic64_set.

I think we set ptes in other places than the page fault handler.
Is any of them performance critical?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Luca Barbieri on 18 Feb 2010 05:30

> CR changes are slow and synchronize the CPU. The later is always slow.
>
> It sounds like you didn't time it?
I didn't, because I think it strongly depends on the microarchitecture
and I don't have a comprehensive set of machines to test on, so it
would just be a single data point.

The lock prefix on cmpxchg8b is also serializing so it might be as bad.

Anyway, if we use this, we should keep TS cleared in kernel mode and
lazily restore it on return to userspace.
This would make clts/stts performance mostly moot.

I agree that this feature would need to added too before putting the
SSE atomic64 code in a released kernel.

> It'll generate worse code because gcc can't use these registers
> at all in the C code. Some gcc versions also tend to give up when they run
> out of registers too badly.
Yes, but the C implementations are small and simple, and are only used
on 386/486.
Furthermore, the data in the global register variables is the main
input to the computation.

> So why don't you simply use normal asm inputs/outputs?
I do, on the caller side.

In the callee, I don't see any other robust way to implement parameter
passing in ebx/esi other than global register variables (without
resorting to pure assembly, which would prevent reusing the generic
atomic64 implementation).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Luca Barbieri on 18 Feb 2010 06:00

On Thu, Feb 18, 2010 at 11:25 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
> On Wed, 2010-02-17 at 12:42 +0100, Luca Barbieri wrote:
>> +DEFINE_PER_CPU_ALIGNED(struct sse_atomic64_percpu, sse_atomic64_percpu);
>> +
>> +/* using the fpu/mmx looks infeasible due to the need to save the FPU environment, which is very slow
>> + * SSE2 is slightly slower on Core 2 and less compatible, so avoid it for now
>> + */
>> +long long sse_atomic64_read_cx8call(long long dummy, const atomic64_t *v)
>> +{
>> + � � � long long res;
>> + � � � unsigned long cr0 = 0;
>> + � � � struct thread_info *me = current_thread_info();
>> + � � � preempt_disable();
>> + � � � if (!(me->status & TS_USEDFPU)) {
>> + � � � � � � � cr0 = read_cr0();
>> + � � � � � � � if (cr0 & X86_CR0_TS)
>> + � � � � � � � � � � � clts();
>> + � � � }
>> + � � � asm volatile(
>> + � � � � � � � � � � � "movlps %%xmm0, " __percpu_arg(0) "\n\t"
>> + � � � � � � � � � � � "movlps %3, %%xmm0\n\t"
>> + � � � � � � � � � � � "movlps %%xmm0, " __percpu_arg(1) "\n\t"
>> + � � � � � � � � � � � "movlps " __percpu_arg(0) ", %%xmm0\n\t"
>> + � � � � � � � � � � � � � : "+m" (per_cpu__sse_atomic64_percpu.xmm0_low), "=m" (per_cpu__sse_atomic64_percpu.low), "=m" (per_cpu__sse_atomic64_percpu.high)
>> + � � � � � � � � � � � � � : "m" (v->counter));
>> + � � � if (cr0 & X86_CR0_TS)
>> + � � � � � � � write_cr0(cr0);
>> + � � � res = (long long)(unsigned)percpu_read(sse_atomic64_percpu.low) | ((long long)(unsigned)percpu_read(sse_atomic64_percpu.high) << 32);
>> + � � � preempt_enable();
>> + � � � return res;
>> +}
>> +EXPORT_SYMBOL(sse_atomic64_read_cx8call);
>
> Care to explain how this is IRQ and NMI safe?

Unfortunately it isn't, due to the per-CPU variables, and thus needs
to be fixed to align the stack and use it instead
(__attribute__((force_align_arg_pointer)) should do the job).
Sorry for this, I initially used the stack and later changed it to
guarantee alignment without rechecking the IRQ/NMI safety.

If we use the stack instead of per-CPU variables, all IRQs and NMIs
preserve CR0 and the SSE registers, so this would be safe, right?

kernel_fpu_begin/end cannot be used in interrupts, so that shouldn't
be a concern.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Luca Barbieri on 18 Feb 2010 07:40

>> If we use the stack instead of per-CPU variables, all IRQs and NMIs
>> preserve CR0 and the SSE registers, so this would be safe, right?
>
> You'd have to take special care to deal with nested IRQs I think.

Could you elaborate on that?
Which issue could there be with nested IRQs?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: boot times, not mdadm (was: Linux mdadm superblock question.)
Next: [PATCH 1/8] PCI PM: Add function for checking PME status of devices