C++: 64 bit performance vs. 32 bit [Computer Architecture]

Prev: interrupting for overflow and loop termination
Next: "Livermore Loops" on x86 Linux

From: Stephen Sprunk on 10 Mar 2006 12:42

"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message
news:2006Mar5.094233(a)mips.complang.tuwien.ac.at...
> "Stephen Sprunk" <stephen(a)sprunk.org> writes:
>>There is no guarantee that any given pointer returned by malloc() or any
>>syscalls is guaranteed to be in the 32-bit address space.
>
> Yes, one would have to extend a few system calls (those that allocate
> address space) with a 32-bit flag (or have 32-bit variants of these
> system calls); malloc() only gets its memory from system calls. As
> Greg Lindahl wrote, this has been done in many OSs (and the IA-32
> (compatibility mode) support in AMD64 long mode OSs requires much
> more effort).

If it were done in the syscalls, that would help, but I was assuming there
were no API changes, and that any reduction from 64 to 32 bits would have to
be handled by the compiler. In that case, every pointer return from an
external function would have to be checked and possibly remapped.
Additionally, any API function that wanted/returned a 64-bit int would need
to be adjusted, which costs a few cycles. There might be some workloads
that benefit from all this work, but it seems like a high price to pay to
reduce the cache impact of 64-bit pointers.

> But my question was this: in which cases would the compiler have to
> use more instructions for ILP32 programs in 64-bit mode than for
> I32LP64 programs?

Other than what I mentioned previously, I can't think of any. There's no
32-bit instructions that were _removed_ in the AMD64 set, though there's a
few that require a longer encoding (e.g. INC, DEC).

>>While the specific workload does cause variations, the cache effects of
>>64-bit pointers are usually more than offset by the performance gained by
>>having eight extra GPRs.
>
> In 64-bit mode, the compiler can use these GPRs (as well as the 8
> extra XMM registers).

Right; I was comparing 64-bit mode to 32-bit mode and trying to explain why
it's often worth it to use full 64-bit mode instead of 32-bit mode even if
you had no use for 64-bit pointers or ints.

Interestingly, the Linux kernel does something similar to what the OP asked.
Due to sign extension, one can count on the kernel being located at -2GB to
0 in both 32-bit and 64-bit modes. The kernel does use 64-bit pointers when
dealing with userland, but AFAIK only (negative) 32-bit pointers for its
internals. Until the kernel exceeds 2GB, this makes supporting both modes
transparently much easier. GCC even has a special compilation mode to do
things this way (but it only works for kernel code).

>>Slower cache, but less need to use it.
>
> Sounds like a fallacy to me. The cache accesses that the additional
> registers avoid would all be cache hits. The cache misses that the
> bigger pointers cause are not reduced by having more registers.

Cache hits still cost a few cycles. If a program is constrained by having
only six or seven registers available, it's going to have many loads from
the stack; eliminating or reducing those loads improves performance.
Register spill loads/stores also take up fetch/decode slots. Beyond that,
having more registers means there should be fewer false dependencies, and
compilers are free to hoist loads further up in the block (beyond the CPU's
OOO window), reducing the impact of both cache misses and hits.

Some of my terminology might be a little off, but this is what I've gleaned
from lots of documentation and testing explanations.

S

--
Stephen Sprunk "Stupid people surround themselves with smart
CCIE #3723 people. Smart people surround themselves with
K5SSS smart people who disagree with them." --Aaron Sorkin

*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

From: Grumble on 12 Mar 2006 17:17

Stephen Sprunk wrote:
> There's no 32-bit instructions that were _removed_ in the AMD64 set,
> though there's a few that require a longer encoding (e.g. INC, DEC).

I don't understand this statement.

What are x86 and/or AMD64 32-bit instructions?

Several IA-32 instructions are deprecated in AMD64.

The following instructions are invalid in 64-bit mode:

AAA-ASCII Adjust After Addition
AAD-ASCII Adjust Before Division
AAM-ASCII Adjust After Multiply
AAS-ASCII Adjust After Subtraction
BOUND-Check Array Bounds
CALL (far absolute)-Procedure Call Far
DAA-Decimal Adjust after Addition
DAS-Decimal Adjust after Subtraction
INTO-Interrupt to Overflow Vector
JMP (far absolute)-Jump Far
LDS-Load DS Segment Register
LES-Load ES Segment Register
POP DS-Pop Stack into DS Segment
POP ES-Pop Stack into ES Segment
POP SS-Pop Stack into SS Segment
POPA, POPAD-Pop All to GPR Words or Doublewords
PUSH CS-Push CS Segment Selector onto Stack
PUSH DS-Push DS Segment Selector onto Stack
PUSH ES-Push ES Segment Selector onto Stack
PUSH SS-Push SS Segment Selector onto Stack
PUSHA, PUSHAD-Push All to GPR Words or Doublewords

The following instructions are invalid in long mode:

SYSENTER-System Call (use SYSCALL instead)
SYSEXIT-System Exit (use SYSRET instead)

--
Regards, Grumble

From: Anton Ertl on 14 Mar 2006 12:14

"Stephen Sprunk" <stephen(a)sprunk.org> writes:
>"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message
>news:2006Mar5.094233(a)mips.complang.tuwien.ac.at...
>> But my question was this: in which cases would the compiler have to
>> use more instructions for ILP32 programs in 64-bit mode than for
>> I32LP64 programs?
>
>Other than what I mentioned previously, I can't think of any. There's no
>32-bit instructions that were _removed_ in the AMD64 set, though there's a
>few that require a longer encoding (e.g. INC, DEC).

There are some (listed by somebody else), but that does not matter for
my question. I was not asking about compatibility mode vs. 64-bit
mode for ILP32 programs, but about ILP32 vs. I32LP64 programs in
64-bit mode.

>>>Slower cache, but less need to use it.
>>
>> Sounds like a fallacy to me. The cache accesses that the additional
>> registers avoid would all be cache hits. The cache misses that the
>> bigger pointers cause are not reduced by having more registers.
>
>Cache hits still cost a few cycles. If a program is constrained by having
>only six or seven registers available, it's going to have many loads from
>the stack; eliminating or reducing those loads improves performance.
>Register spill loads/stores also take up fetch/decode slots.

Sure, but my point was that someone might read into your statement
that the number of cache misses may be reduced by having more
registers, but that is not the case.

>Beyond that,
>having more registers means there should be fewer false dependencies, and
>compilers are free to hoist loads further up in the block (beyond the CPU's
>OOO window), reducing the impact of both cache misses and hits.

Compilers usually do code motion and instruction scheduling before
register allocation, so having more registers only helps alleviating
the negative effects if such code motion; of course, with few
registers available, the compiler writer might disable or weaken such
optimizations.

Concerning the OOO window, 8 additional registers won't help much
given that the OOO window contains around 100 instructions in these
CPUs. Moreover, unless the compiler does very aggressive scheduling
(e.g., modulo scheduling), it will usually not move instructions
beyond the OOO window (and it will probably not use such optimizations
on architectures with only 16 registers).

Concerning cache misses, dealing with that by hoisting a load very far
up is a waste of a register; better use a prefetch instruction at the
place where you would put the load, and schedule the load for an L1
hit.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

First | Prev |
Pages: 1 2 3 4 5 6 7
Prev: interrupting for overflow and loop termination
Next: "Livermore Loops" on x86 Linux