From: Piotr Wyderski on
Hello,

how exactly do SSE2 functional units operate in mixed data type mode
on modern processors, i.e. Core2+/Phenom? For instance, it is much more
convinient to use pshufd instead of shufps to shuffle single-precision
floating
point data vectors, as it saves one movaps instruction, but should I expect
a penalty for crossing the floating-point/integer boundary? If yes, then how
big?

Best regards
Piotr Wyderski

From: Terje Mathisen "terje.mathisen at on
Piotr Wyderski wrote:
> Hello,
>
> how exactly do SSE2 functional units operate in mixed data type mode
> on modern processors, i.e. Core2+/Phenom? For instance, it is much more
> convinient to use pshufd instead of shufps to shuffle single-precision
> floating
> point data vectors, as it saves one movaps instruction, but should I expect
> a penalty for crossing the floating-point/integer boundary? If yes, then
> how big?

Afaik all implementations up to now have used the same storage for both
types, so there hasn't been any penalty, so far. (I might be wrong though.)

However, the fact that Intel/AMD have implemented separate opcodes for
these instructions, even when the effect is identical, seems to indicate
that they expect they will need to separate them at some point in the
future, even if they haven't done so by now.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Piotr Wyderski on
Terje Mathisen wrote:

> Afaik all implementations up to now have used the same storage for both

Yes, the storage is the same, but I wonder if the XMM registers
are just dynamic aliases to internal int_XMM and float_XMM sets
and thus mixing them is wrong.

> so there hasn't been any penalty

My tests confirm that, but it's always better to ask. :-)

Best regards
Piotr Wyderski

From: Niels Fröhling on
Terje Mathisen wrote:
> Piotr Wyderski wrote:
>> Hello,
>>
>> how exactly do SSE2 functional units operate in mixed data type mode
>> on modern processors, i.e. Core2+/Phenom? For instance, it is much more
>> convinient to use pshufd instead of shufps to shuffle single-precision
>> floating
>> point data vectors, as it saves one movaps instruction, but should I
>> expect
>> a penalty for crossing the floating-point/integer boundary? If yes, then
>> how big?
>
> Afaik all implementations up to now have used the same storage for both
> types, so there hasn't been any penalty, so far. (I might be wrong though.)
>
> However, the fact that Intel/AMD have implemented separate opcodes for
> these instructions, even when the effect is identical, seems to indicate
> that they expect they will need to separate them at some point in the
> future, even if they haven't done so by now.

As I see it it's the reverse. Those functions (pshufw) where MMX function
which where pure integer. The other function (shufps) was a XMMX function which
was pure floating-point. There was no way to mix-and-match.

When AMD was offering 3DNow, mix-and-match of int/float was _intended_. Some
functions (pswapd) where introduced to help float-movement put had a integer
(pswap"d") identifier.

When Intel was 5 years late to the party with SSE2 they mapped all MMX
instruction onto XMMX registers which created a multitude of identically
behaving op-codes.

Mix-and-match is intended and has severe (positive) performance implications.
Nobody will split this in the future again.

My personal opinion about the why (not mapping the pshufd mnemonics on the
shufps opcode) is, that you can make a processor which has no floating-point
support at all simply removing all floatingpoint-implied functions, making the
instruction decoder easier and so on.

Ciao
Niels
From: Piotr Wyderski on

Terje Mathisen wrote:

> However, the fact that Intel/AMD have implemented separate opcodes for
> these instructions, even when the effect is identical, seems to indicate
> that they expect they will need to separate them at some point in the
> future, even if they haven't done so by now.

Intel seems to recommend mixed-mode calculations, as
they use it in their dot product code here:

http://www.intel.com/technology/itj/2008/v12i3/3-paper/6-examples.htm

haddps xmm0, xmm0
movaps xmm1, xmm0
psrlq xmm0, 32

So IMHO mixing can be considered blessed :-)

Best regards,
Piotr Wyderski