From: nedbrek on
Hello,

"Brett Davis" <ggtgp(a)yahoo.com> wrote in message
news:ggtgp-6E8E47.23285012082010(a)news.isp.giganews.com...
> In article <i40k1g$ker$1(a)news.eternal-september.org>,
> "nedbrek" <nedbrek(a)yahoo.com> wrote:
>> "Brett Davis" <ggtgp(a)yahoo.com> wrote in message
>> > Most RISC chips implement a register move as a ALU binary OR with
>> > immediate value zero. Even the NOP is actually "OR r0 = r0, #0"
>> > which I remember from my Moto 68k days.
>>
>> In x86, it is encoded as "mov dst = src". Internally, this can be
>> converted
>> to an "(x)or/add/sub imm0" uop, or there might be a mov uop. Not sure
>> what
>> the tradeoffs are...
>>
>> > An ALU is always going to read two values and write one, even x86.
>>
>> Consider the "clear" operation (XOR AX ^= AX). There is one write, the
>> value 0. Or a "mov r = imm". These are usually executed at the ALU
>> port.
>
> That is two values going in, they just happen to be the same. AX = AX ^ AX

You might convert it to a "mov imm0" uop. Depending on how your register
read/bypass logic works, you might be able to rename this value to "use the
special index of the bypass logic (which always forwards 0)".

> This does bring up a big point, just because you have three ALUs does
> not mean that you need six read ports on the register file.
> Most of the time you get your values from the bypasses, if you can
> predict that with some accuracy you can do with fewer ports, saving
> die size and power.
> To really make this useful a ALU would have to keep values when it
> is idle, as you have many 3 cycle stalls waiting on L1.
> Grabbing values from the bypass saves power, as it is closer, and
> involves far fewer transistors to select.

Sure, there are papers on this (sadly, I've forgotten who and when...). The
biggest problem with speculation in the scheduler (which includes register
caching, and even simple loads to some extent) is that when you are wrong,
you need to cancel all the dependent ops that have scheduled in the
meantime. If your scheduler deallocs on pick, you then need to get all
these cancelled ops re-inserted...

Ned


From: nedbrek on
Hello all,

"Paul A. Clayton" <paaronclayton(a)embarqmail.com> wrote in message
news:115aae10-c46c-43fc-95f9-9e3547f8645f(a)f6g2000yqa.googlegroups.com...
>On Aug 12, 7:55 am, "nedbrek" <nedb...(a)yahoo.com> wrote:
>[snip]
>> Right, the mystery is resolved using physical register numbers. The
>> renamer
>> provides the number for each source. You would like there to be one
>> source
>> at this point, although you could make the bypass logic execute the
>> cmov -
>> this would require the renamer to produce 3 numbers (remember the flags
>> have
>> a producer!).
>
> For a simple single small FLAGS register, the source operation could
> use its operation number (ROB number) plus one and the consuming
> operation its operation number. The trick is then to elide any
> intermediate names (for x86, intermediate names would probably be
> rare since FLAGS consumers usually immediately follow the source
> operation--correct?). Because FLAGS is small, replication is
> relatively inexpensive; because it is singular, special handling
> might be simpler and/or more cost-effective--such features should
> be exploitable. (For non-selected consuming operations, writing
> the FLAGS value into the operation might make sense.) One might
> choose to handle the nearby consumers differently, inserting
> FLAGS 'reassert' operations to wake-up later consumers.

This assumes that physical registers are bound to ROB entries. That is true
for some machines, but not for all... as you increase ROB size, you like to
be able to size the physical register seperately (a lot of instructions
don't need regs - stores and jumps being the most common). Of course, there
are even more radical proposals (reference counted, shared values) that will
break this.

Ned