C++: 64 bit performance vs. 32 bit [Computer Architecture]

Prev: interrupting for overflow and loop termination
Next: "Livermore Loops" on x86 Linux

From: Grumble on 12 Feb 2006 12:26

Niels J?rgen Kruse wrote:

> Brian Hurt <bhurt(a)AUTO> wrote:
>
>
>>On the x86, you get the advantage of 8 new registers in going to
>>64-bit. This generally increases the speed of most programs by more
>>than enough to overcome the decreased cache hit ratios 64 bits
>>induces, which means that 64-bit code is generally 10-15% faster than
>>the 32 bit code on the same hardware.
>
>
> There is a difference between AMD and Intel CPUs here. On Intel, the 8
> registers subtract from the pool of rename registers, on AMD it is use'm
> or lose'm.

Could you elaborate?

From: Andy Glew on 12 Feb 2006 16:07

nospam(a)ab-katrinedal.dk (Niels J?rgen Kruse) writes:

> Brian Hurt <bhurt(a)AUTO> wrote:
>
> > On the x86, you get the advantage of 8 new registers in going to
> > 64-bit. This generally increases the speed of most programs by more
> > than enough to overcome the decreased cache hit ratios 64 bits
> > induces, which means that 64-bit code is generally 10-15% faster than
> > the 32 bit code on the same hardware.
>
> There is a difference between AMD and Intel CPUs here. On Intel, the 8
> registers subtract from the pool of rename registers, on AMD it is use'm
> or lose'm.

It is more generic than just AMD vs. Intel.

The Intel P6 family (including the Pentium M family) have a separate
ROB and RRF. Architecural registers live in the RRF - the RRF grows
bigger. The ROB holds data, and defines the size of the instruction
window.

As you report, AMD's K7 and K8 work the same way.

The Intel Pentum 4 family (Wmt, Nwd, Psc) have a unified PRF - there
is a single pool of physical registers used to hold both architectural
and rename registers. If there is a ROB, it is dataless, just used
for bookeeping. There is no need to copy data values from ROB to RRF.

With a unified PRF machine, if you increase the number of
architectural registers you decrease the instruction window size.
Worse, you remove rename registers even if those architectural
registers are not in use.

Many people naively say "increasing registers improves performance".
The tradeoff is not so simple when it reduces speculation.

---

Note: many people confuse this issue with the separate, albeit
related, issue of where in the pipeline you read the PRF (or ROB+RRF).
The P6 family reads the ROB+RRF before placing values into the
reservation stations, RS, and relies on capturing data values while an
operation is pending in the RS. The Wmt family reads the PRF after
dispatching from the scheduler.

From: Bernd Paysan on 12 Feb 2006 19:13

Andy Glew wrote:
> With a unified PRF machine, if you increase the number of
> architectural registers you decrease the instruction window size.
> Worse, you remove rename registers even if those architectural
> registers are not in use.

But the register window is huge compared to the number of architectural
registers.

> Many people naively say "increasing registers improves performance".
> The tradeoff is not so simple when it reduces speculation.

It also depends on how costly the additional loads and stores are (for the
reduced register version), or if they are basically "for free", since the
inherent ILP isn't that high, and the additional loads and stores just fill
up otherwise empty slots.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

From: Andy Glew on 14 Feb 2006 13:21

Bernd Paysan <bernd.paysan(a)gmx.de> writes:

> Andy Glew wrote:
> > With a unified PRF machine, if you increase the number of
> > architectural registers you decrease the instruction window size.
> > Worse, you remove rename registers even if those architectural
> > registers are not in use.
>
> But the register window is huge compared to the number of architectural
> registers.

I wish.

Consider a Willammette/Psc/K8 era machine.

With x86-64 roughly 64 architectural registers (exactly how many depends).

Instruction window sizes (rob sizes) of circa 128 on current machines,
on the high end.

64 architectural registers. 128 renamed registers.
=> the architectural registers would be roughly 1/3 of a 128+64 entry PRF.
Half of a 128 entry PRF.
Half of a 256 entry PRF with threading.

(For simplicitly, we will assume unified, not split into integer/FP.
If you split, pretty much the same thing works out.)

Some people have proposed adding more architrectural registers.
I ask them: "are you happy reducing the register window by half"?

---

One of our big problems right now is that we have too many lregs - too
many architectural registers. Not only do they waste space in the
PRF, making it bigger and slower; they also waste space in the
renaming tables.

I have said, in this forum, for nigh on years now: eventually we must
go to multilevel register files. A small L1 register file that has
lots of ports. A large L2 register file that has fewer ports. Hell,
it could even be in main memory.

Same multilevel principle applies to renaming tables.

If you want to add more architectural registers, add them to the L2
PRF.

---

It's ironic: we might well have better performance - larger
instruction windows - if we had fewer architectural registers.

From: Anton Ertl on 14 Feb 2006 13:55

Andy Glew <first.last(a)employer.domain> writes:
>With x86-64 roughly 64 architectural registers

How do you compute that? I compute:

16 GPRs
16 xmm
8 387/mmx
--
40

As for the K7/K8 having the rename registers in addition to the
architectural ones, I read somewhere that one variant added new
architectural FP registers without increasing the physical FP
registers, and only a later variant increased the number of physical
FP registers to restore the available rename registers to the original
number. I don't remember if the additional architectural registers
were the first 8 XMM registers (then the two variants would be the
Palomino and the K8), or if it was the second 8 XMM registers (then
the two variants would be the original K8 and some shrink).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: interrupting for overflow and loop termination
Next: "Livermore Loops" on x86 Linux