From: nedbrek on
Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4C2B6F29.3060908(a)patten-glew.net...
> Second, RUU is microarchitecture. A uarch feature that I never considered
> implementable myself. By the way, I can't remember timestamps in it - I
> suspect that you may be thinking of the
> SimpleScalar simulator implementation, which may have had timestamps.
>
> Whereas what I was talking about is simulator. Not microarchitecture.

Definitely talking simulator. I've never seen a RUU proposed for hardware.


> Look: there are queues, or at least latches, between your pipestages no
> matter
> what. All I am suggesting is that you configure the SIMULATOR queues so
> that you can do experiments such as saying
> "what if I combine instruction decode and in-order schedule into a single
> pipestage." Whereas, if you have a reverse
> pipeline with a cycle per loop iteration, you cannot even run that
> experiment. Glew observation: oftentimes people say
> "That's not practical", when actually something is practical in hardware,
> just not in their simulator infrastructure.

In my experience, stages from IPGEN to SCHED cost 1% per. Stages from READY
to READY (ignoring 1 for EXE) are 5-10% Stages in the backend are 1% or
less.

These sorts of trends are some of the first runs you do. If your model can
only sweep for 10+ pipestages (due to the stage configuration), is 3 or 5
really going to be significantly different?

> By the way, let me roughly describe what such a queue looks like:
> * hardware granularity and alignmemt
> - i.e. does hardware think of it as a fixed alignment queue,
> blocks of 4 always moving
> or does hardware think of it as a misaligned queue, where HW
> can read 4 ntries at any alignment
> - By the way - this should be parameterizable, since the
> decision to use an aligned or a misaligned
> queue is one of the basic studies you woll always do.
> * hardware size (minimum)
> * cycles to get across again, good to parameterize IN THE SIMULATOR
> so that you can easily simulate different chip layouts,
> with different delays for the wires

Definitely. The pipestage code we had in IPFsim had width & depth knobs,
plus a "serpentine" knob (serpentine pipes flowed freely, non-serpentine
could only advance by the full width).

> ===
> However, I also believe that there is a place for cycle accurate
> simulators
> that are - well, maybe not less detailed than SimpleScalar, but more
> agile. Simulators that you can run edxperiments
> such as saying "What if the latency through this pipestage was 0". Where
> you can quickly dial in multiple cycles of delay.
>
> In the detailed simulators, you might not allow yourself to use the
> timestamped
> queues that I talk about above. Whereas in the more agile simulator you
> might use such timestamped queues to give
> yourself the agility.
>
> I also believe that there is a place for simulators that are not cycle
> accurate. Like DFA.

Agility is mostly a measure of what assumptions went into the original code
("I'm writing a P6 model for Itanium"). As long as you don't stray too far
from that (adding a new branch predictor), you can add a lot of features and
details. After tacking on a lot of stuff, it gets harder to change. Some
stuff never really fits right (P4).

No high level simulator is ever going to (easily) be cycle accurate with
hardware. Leave that to RTL. What you want is something that trends
correctly ("these uarch changes are worth 20% over the baseline"). Also, if
you can't model everything, at least understand where the model breaks down.

I had a simulator show huge speedup from a prefetching idea. Turns out, the
model's page table walker was effectively pipelined. Once you factored that
out, the idea was useless.

> ===
> By the way: you should, of course, try to make it impossible, difficult,
> or at
> least obvious to access simulator artifact data from true hardware
> simulation. C++ encapsulation. Much harder
> to do in C; or, rather, much easier to violate encapsulation by accident.

Definitely. Like I said, I would like my next model to be in D. Pin uses
C++ linkage, so my DFA stuff will need to be C++, but I probably won't write
a lot there.

Ned


From: Andy 'Krazy' Glew on
On 7/1/2010 7:40 PM, nedbrek wrote:
> In my experience, stages from IPGEN to SCHED cost 1% per. Stages from READY
> to READY (ignoring 1 for EXE) are 5-10% Stages in the backend are 1% or
> less.
>
> These sorts of trends are some of the first runs you do. If your model can
> only sweep for 10+ pipestages (due to the stage configuration), is 3 or 5
> really going to be significantly different?

I'm old enough to remember when it was 5% for frontend pipestages, and 20% for execution loop pipestages. I'm sure that
the reduced importance of adding pipestages is due to (a) better predictors, (b) relatively slower memory, and (c) the
fact that adding 1 pipestage on top of 5 is a big deal, but on top of 10 is not so big.

We may well stuck in the double digits for pipestages. But I wonder if the pendulum may not want to swing the other way

1) because of device variation. You get better yield (and performance, in terms of average latency per transistor or
gate) if you have 20 gates per pipestage rather than Cray-like 8, and even better with 40.

2) seeking to minimize overheads such as setup and skew allowances, which helps both power and perf

3) if you start using asynchronous design styles (in my current analysis, asynchronous design styles for bandwidth may
have fewer gates per "cycle" (or whatever the equivalent term is for asynch); whereas if you are designing for minimum
latency of certain critical computations, asynch wants fat pipestages

4) and because I can see an asymptote where it is better to have less pipelined logic go idle, than it is to have more
pipelined logic get blocked with stuff in the pipeline that must be maintained.


> Definitely. The pipestage code we had in IPFsim had width& depth knobs,
> plus a "serpentine" knob (serpentine pipes flowed freely, non-serpentine
> could only advance by the full width).

Cool. I earlier described such "Alignment issues for queues, buffers, and pipelines" - i.e. I used the term "alignment"
or "blocking factor" for what you describe as "non-serpentine". I can even see where the term comes from.

Adding this to https://semipublic.comp-arch.net/wiki/Alignment_issues_for_queues,_buffers,_and_pipelines

From: Andy 'Krazy' Glew on
On 7/1/2010 5:42 AM, nedbrek wrote:

> Sure, we wanted an execute-at-execute model. That is what we were driving
> for.
>
> But, you have to cut us some slack! There were two of us (plus an intern
> for the summer). We came in with a blank slate for modelling out-of-order.
> We were basically trying to reproduce P6, for Itanium.

Slack given. Moreover, I don't know the history; or, rather, I know only the early history, before you were at Intel.
I don't think we overlapped much.



> We used the same mechanism as P6, loads wait for oldest STA.

Fair enough for you, but I feel obliged to mention for the record that the P6 simulators circa 1991 were not that
limited. We chose to implement only "loads don't pass stores whose address is unknown". But we evaluated other
policies. We knew the speedups with an oracle store-to-load dependency predictor, perfect; and also with random
prediction accuracies. I know that we had proposed various STLF predictors, such as history based. I suspect those
were in branches off the main version control trunk, if they were implemented in the simulator. (By the way, although
randomized predictors with a dialable accuracy are easy to do, and provide some insight, they are misleading. Real
predictors are not unform random; and if you knew the real predictor stats...)

From: MitchAlsup on
On Jul 1, 9:39 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> On 7/1/2010 7:40 PM, nedbrek wrote:
>
> > In my experience, stages from IPGEN to SCHED cost 1% per.  Stages from READY
> > to READY (ignoring 1 for EXE) are 5-10%  Stages in the backend are 1% or
> > less.
>
> > These sorts of trends are some of the first runs you do.  If your model can
> > only sweep for 10+ pipestages (due to the stage configuration), is 3 or 5
> > really going to be significantly different?
>
> I'm old enough to remember when it was 5% for frontend pipestages, and 20% for execution loop pipestages.  I'm sure that
> the reduced importance of adding pipestages is due to (a) better predictors, (b) relatively slower memory, and (c) the
> fact that adding 1 pipestage on top of 5 is a big deal, but on top of 10 is not so big.

For great big OoO machines 1% in the front end/pipestage is pretty
standard.
We saw 9%-12%-ish for not being able to do back to back integer
instructions and a 33%-50% increase in frequency by <basically>
doubling the number of pipe stages.

> We may well stuck in the double digits for pipestages.  But I wonder if the pendulum may not want to swing the other way

The pendulum definately wants to swing that direction. But pure market
momentum, and a bit of FUD are slowing the release of said pendulum.

> 1) because of device variation.   You get better yield (and performance, in terms of average latency per transistor or
> gate) if you have 20 gates per pipestage rather than Cray-like 8, and even better with 40.

Opteron is at 16 logic gates per pipe stage, or 20-21 gates if you
include flop, jitter, and skew.

CDC 6600 was 15 gates including the clear-set flop (Thornton)
CDC 7600 was 12 gates including the flop
Cray 1 was 10 gates including the latch (not a flop)
Cray 2 was 5 gates including the latch

The Cray 1 was slowed so as to avoid the noise in the FM radio band
(80 MHz), the Y-MP hopped to the other end of the FM-band (105 MHz)

Based on project I have done in the past, going from 16 gates per pipe
stage to 20 gates per pipestage results in a 20% improvement in
architectural figure of merit. That is the frequency loss/gain is a
complete wash. Since power has reared its ugly head, doing more per
cycle and having fewer cycle will be a win.

Not only does the pipeline have fewer stages at 20 logic gates per
cycle, one can bang on the SRAMs and register ports twice per cycle
and make other activities of instruction processing more efficient. 16
gates per cycle is about where designers want the architects to quit
using the SRAMs twice per cycle, butd by 20 gates per cycle, nobody
really cares if you use the SRAMs twice per cycle. Thus, one gains
cache bandwidth by slowing down just a bit, this makes the pipeline
shorter especially in the stages nobody sees (post retire).

One can desing/build a 6-7 pipestage x86 that cycles as fast as an
Opteron (given access to a FAB with the same transistors and metal.)
This will end up being a 1-wide monoScalar machine--think 486 with the
modern instruction set extensions and floating point latencies and
cache hierarchy. My simulations show this miniscule machine can get
roughly 50% the performance of an opteron for 10% of the die area and
less than 5% of the power.

> 2) seeking to minimize overheads such as setup and skew allowances, which helps both power and perf

The biggest lever left in power is speculation. That is: only do
activities for those instruction that will retire or have very high
probabilities of retiring.

Mitch
From: MitchAlsup on
On Jul 1, 9:45 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:

>(By the way, although randomized predictors with a dialable accuracy are easy
> to do, and provide some insight, they are misleading.  Real
> predictors are not unform random;  and if you knew the real predictor stats...)

Which is why I have never been a fan of semi-accurate simulations.
High level architectural models have their place, but what I want is
and low level
architectural model that contains (basically) everything but the scan
path!
I want the architects to build the control machine as a data path and
run
it through a trillion simulation cycles (without failing).
This control machine being cycle accurate to the Verilog model.

Mitch