From: H. Peter Anvin on
On 11/10/2009 12:16 PM, Willy Tarreau wrote:
>
> Indeed, but there is a difference between [cmpxchg, bswap, cmov, nopl]
> on one side and [sse*] on the other : distros are built assuming the
> former are always available while they are not always. And the distro
> which make the difference have to provide an dedicated build for earlier
> systems just for compatibility. SSE*, 3dnow* etc... are only used by a
> handful of media players/converters/encoders which are able to detect
> themselves what to use and already have the necessary fallbacks because
> these instruction sets vary too much between processors and vendors.
>

That is increasingly not true since gcc is now doing autovectorization.

> One could argue that cmpxchg/bswap/xadd are supported by 486 and that
> implementing them for 386 is almost useless now (though it costs almost
> nothing to provide them, I did a few years ago).
>
> CMOV/NOPL are rarely used, thus have no reason to cause a massive
> performance drop, but are frequent enough (at least cmov) for almost
> any program to have at least one or two inside, making it incompatible
> with a given processor, and are almost obvious to implement too.

I could 970 cmovs in libc out of 322660 instructions. That is one in
333 instruction. In other words, a trap-and-emulate of some 500 cycles
would add some two cycles *per instruction* during execution -- hardly
an insignificant number. All in all, any of this is really only useful
as a limp.

> SSE*/3dnow* would be much much harder and would only serve very few
> programs, and serve them badly because when they're used, it would
> be intensive.
>
> I personally am not against being able to emulate every optional
> instruction, quite the opposite instead. It's just that if in order
> to do this, we add cost to the other obvious ones, we lose what we
> expected to win (simplicity and efficiency).

I don't see any particular subset as being more obvious than the other,
with the *possible* exception of NOPL, simply because NOPL was
undocumented for so long.

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Pavel Machek on
Hi!

> Indeed, but there is a difference between [cmpxchg, bswap, cmov, nopl]
> on one side and [sse*] on the other : distros are built assuming the
> former are always available while they are not always. And the
> distro

Well, fix the distros...

> which make the difference have to provide an dedicated build for earlier
> systems just for compatibility. SSE*, 3dnow* etc... are only used by a
> handful of media players/converters/encoders which are able to detect
> themselves what to use and already have the necessary fallbacks because
> these instruction sets vary too much between processors and vendors.
>
> One could argue that cmpxchg/bswap/xadd are supported by 486 and that
> implementing them for 386 is almost useless now (though it costs almost
> nothing to provide them, I did a few years ago).
>
> CMOV/NOPL are rarely used, thus have no reason to cause a massive
> performance drop, but are frequent enough (at least cmov) for almost

*One* CMOV in the inner loop will make your performance go down 20x.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Willy Tarreau on
On Tue, Nov 10, 2009 at 09:54:45PM +0100, Pavel Machek wrote:
> Hi!
>
> > Indeed, but there is a difference between [cmpxchg, bswap, cmov, nopl]
> > on one side and [sse*] on the other : distros are built assuming the
> > former are always available while they are not always. And the
> > distro
>
> Well, fix the distros...

you know like me that it's as easy as useless to point the finger at
distros, because people running on low end want something that works
and people running on high end want something that runs fast. In order
to satisfy every one, you would have to build with optimizations for
every CPU around, which does not make sense. Simply count the number
of CPU variants in the kernel, and imagine that many CDs/DVDs for a
single platform distro.

However, targetting the most common denominator of high end machines
(basically i686) and having the lower end systems experience a tiny
slowdown is not stupid at all since performance is not what matters
the most there. The higher end systems will simply be able to run
CPU-specific optimizations per-program as they already do right now.

(...)
> > CMOV/NOPL are rarely used, thus have no reason to cause a massive
> > performance drop, but are frequent enough (at least cmov) for almost
>
> *One* CMOV in the inner loop will make your performance go down 20x.

yes, just like with emulated FPU or trapped unaligned accesses. It's
just like flying fishes. They exist but they aren't the most common
ones. If people encounter these cases on a specific program, then
they just have to recompile it if it is a problem. At least they
don't rebuild the whole distro. And once again, I've been using
cmpxchg/bswap emulation for years on my i386 without feeling any
need for a rebuild, and CMOV emulation for years now on my mini-itx
C3 without any problem either. These are real experiences, not just
fears of imaginary problems. Yes I can design a program to run 400
times slower on these machines if I want. I just don't feel the need
to do so and apparently existing programs' authors didn't either.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: H. Peter Anvin on
On 11/10/2009 01:12 PM, Willy Tarreau wrote:
> yes, just like with emulated FPU or trapped unaligned accesses. It's
> just like flying fishes. They exist but they aren't the most common
> ones. If people encounter these cases on a specific program, then
> they just have to recompile it if it is a problem. At least they
> don't rebuild the whole distro. And once again, I've been using
> cmpxchg/bswap emulation for years on my i386 without feeling any
> need for a rebuild, and CMOV emulation for years now on my mini-itx
> C3 without any problem either. These are real experiences, not just
> fears of imaginary problems. Yes I can design a program to run 400
> times slower on these machines if I want. I just don't feel the need
> to do so and apparently existing programs' authors didn't either.

Willy, perhaps you can come up with a list of features you think should
be emulated, together with an explanation of why you opted for that list
of features and *did not* opt for others.

Note: emulated FPU is a special subcase. The FPU operations are
heavyweight enough that the overhead of trapping versus library calls is
relatively insignificant.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Matt Thrailkill on
On Tue, Nov 10, 2009 at 12:54 PM, Pavel Machek <pavel(a)ucw.cz> wrote:
> *One* CMOV in the inner loop will make your performance go down 20x.

This is 20x slower than not running at all, right?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/