x86 simulator (Was Re: RISC load-store verses x86 Add from memory.) [Computer Architecture]

Prev: Picking N-th ready element (e.g. in an OOO scheduler)
Next: Lolling at programmers, how many ways are there to create a bitmask ? ;) :)

From: nedbrek on 28 Jun 2010 10:14

Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4C2627FD.9030100(a)patten-glew.net...
> On 6/25/2010 5:51 PM, mac wrote:
>
> The Pin interface, http://www.pintool.org/, may be a good start.

I meant to post a status update on my search for x86 performance
simulators...

I looked at PTLsim (and a related project, MARSS). I then realized that
every simulator is just:
while(1) {
commit()
exe()
schedule()
dispatch()
fetch()
}

The interesting points are really in the architectural model, the memory
system, and the system emulation. These are the hard part, and the part you
must be most familiar with (to understand the impact on your other ideas).

Thus, there is really no point in trying to reuse an existing infrastructure
(since you need total knowledge, you must rewrite it to understand it).

So, I started on my own arch model, with intentions of developing the system
model...

In the meantime, I'd like to do some simple studies. In this case, I think
a lighter weight system would be good.

Looking at Pin, I think I can throw together a DFA-like simulator pretty
quickly...

I should be back soon...

Ned

From: Andy 'Krazy' Glew on 28 Jun 2010 09:53

On 6/28/2010 7:14 AM, nedbrek wrote:
I then realized that
> every simulator is just:
> while(1) {
> commit()
> exe()
> schedule()
> dispatch()
> fetch()
> }
>

Not quite.

What you have above is the so-called "reverse pipeline" model. Particularly if every iteration of the outer loop
corresponds to a cycle.

If so, then such a simulator cannot model pipelines that have 0 cycles through any such pipestage.

Now, while at the moment we tend to assume that traditional RISC 5-stage pipelines are the shortest pipelines likely,
some of us (me, at least) like being able to model eliminating the schedule pipestage, etc. To accomplish this, we
connect the pipestages by queues or buffers (not necessarily in order), and timestamp queue entries with the earliest
possible time that an entry can be consumed.

This leads to

for every cycle
fetch(q1)
dispatch(q1,q2)
schedule(q2,q3)
exe(q3,q4)
commit(q4)

or

for every cycle
while cycle not done
fetch(q1)
dispatch(q1,q2)
schedule(q2,q3)
exe(q3,q4)
commit(q4)

and, in general, the pipeline network is represented by a datastructure, not by code, allowing arbitrary order of
evaluation of pipestages. The better simulators sort the pipestages for efficient evaluation.

> The interesting points are really in the architectural model, the memory
> system, and the system emulation. These are the hard part, and the part you
> must be most familiar with (to understand the impact on your other ideas).
>
> Thus, there is really no point in trying to reuse an existing infrastructure
> (since you need total knowledge, you must rewrite it to understand it).
>
> So, I started on my own arch model, with intentions of developing the system
> model...
>
> In the meantime, I'd like to do some simple studies. In this case, I think
> a lighter weight system would be good.
>
> Looking at Pin, I think I can throw together a DFA-like simulator pretty
> quickly...
>
> I should be back soon...

Amen.

You don't want a simulator.

You want a library of simulator components, and a toolbox of different simulator frameworks.

From: MitchAlsup on 28 Jun 2010 11:55

On Jun 28, 8:53 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
> and, in general, the pipeline network is represented by a datastructure, not by code, allowing arbitrary order of
> evaluation of pipestages. The better simulators sort the pipestages for efficient evaluation.

I advocate actively pursuing the random ordering of pipestage
evaluation. This randomization exposes microarchitectural race
conditions.

Thus one might put the pipe stages in an array, and then randomize the
array before each clock cycle such as:

struct PipeStage pipestages[] = { fetch(), decode(), stations(),
execute(), cache(), writeback(), update() };
# define NUMSTAGES (sizeof pipestages/sizeof PipeStage)
struct PipeStage random[ NUMSTAGES ];

random = pipestages;
while( FOREVER )
{
randomize( *random, NUMSTAGES );
for( cpu = 0; I < CPUs; cpu++ )
for( stage = 0; stage < NUMSTAGES; stage++ )
random[stage]( CPU[cpu], stage );
}

randomize can start out as the null randomizer and be advanced when
the rest of the simulator is ready. Just swapping two elements at a
time is completely sufficient to stumble upon these microarchitectural
race conditions as long as you do not backtrack, and you have a good
random number generator. Sometimes, you will want a nonrandom number
generator to direct the randomization goals, and you should be sure to
test the pipeline in the straight forward and straight backwards
directions. {Don't forget to also randomize the order of the memory
hierarchy and southbridge components, and any other sub-system in
different clock domains.}

Mitch

From: nedbrek on 29 Jun 2010 07:11

Hello all,

"MitchAlsup" <MitchAlsup(a)aol.com> wrote in message
news:77b7636d-0ed3-4758-8ff8-d9beb1965c18(a)c33g2000yqm.googlegroups.com...
> On Jun 28, 8:53 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:
>> and, in general, the pipeline network is represented by a datastructure,
>> not
>> by code, allowing arbitrary order of evaluation of pipestages. The better
>> simulators sort the pipestages for efficient evaluation.
>
> I advocate actively pursuing the random ordering of pipestage
> evaluation. This randomization exposes microarchitectural race
> conditions.

That's an interesting approach. I feel it's too close to the RUU (unless I
am misunderstanding). I don't like having timestamps, except for debugging.
I am of the camp "let there be a software structure for each hardware
structure, and code for logic" (although parts of the memory and i/o system
often devolve into timestamped queues, due to the enormous latencies).

For IPFsim, we had a nice infrastructure (using factories) to instantiate
scheduler and execution frameworks. It supported in-order (for our McKinley
comparisons), P3, P4, and HSW.

Most of the debugging I've done is through testing the extremities of knobs
(open everything up and graph the performance, look for outliers), stepping
through code, and looking at execution traces (Ed Grochowski wrote a nice
tool for visualizing them, called Pipedream - it was this tool which helped
convert him to the out-of-order faith).

If there was one crazy new idea I'd want, it's the ability to run time
backwards. I can't count the number of times I was tracking down a bug, and
stepped one cycle too far!

Ned

From: Muzaffer Kal on 29 Jun 2010 10:27

On Tue, 29 Jun 2010 06:11:52 -0500, "nedbrek" <nedbrek(a)yahoo.com>
wrote:
>
>If there was one crazy new idea I'd want, it's the ability to run time
>backwards. I can't count the number of times I was tracking down a bug, and
>stepped one cycle too far!

Isn't this as easy as keeping the last N cycle/instruction states and
reload?

--
Muzaffer Kal

DSPIA INC.
ASIC/FPGA Design Services

http://www.dspia.com

| Next | Last
Pages: 1 2 3 4
Prev: Picking N-th ready element (e.g. in an OOO scheduler)
Next: Lolling at programmers, how many ways are there to create a bitmask ? ;) :)