From: Terje Mathisen "terje.mathisen at on
nedbrek wrote:
> If there was one crazy new idea I'd want, it's the ability to run time
> backwards. I can't count the number of times I was tracking down a bug, and
> stepped one cycle too far!

This has been available in at least some SW debuggers for more than 10
years.

The fast way is to checkpoint the register state and use HW page level
protection to detect modifications to memory, the slow method interprets
the code and saves a log of the previous value for all modified resources.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Andy 'Krazy' Glew on
On 6/30/2010 2:21 AM, nedbrek wrote:
> Hello all,
>
> "MitchAlsup"<MitchAlsup(a)aol.com> wrote in message
> news:175c0f8b-92ed-46a6-8737-cb0db9c0b22e(a)z8g2000yqz.googlegroups.com...
> On Jun 29, 6:11 am, "nedbrek"<nedb...(a)yahoo.com> wrote:
>> "MitchAlsup"<MitchAl...(a)aol.com> wrote in message
>>>> Andy had said:
>>>>> To accomplish this, we connect the pipestages by queues or buffers (not
>>>>> necessarily in order), and timestamp queue entries with the earliest
>>>>> possible time that an entry can be consumed.
>>>>
>>>> I advocate actively pursuing the random ordering of pipestage
>>>> evaluation. This randomization exposes microarchitectural race
>>>> conditions.
>>>
>>> That's an interesting approach. I feel it's too close to the RUU (unless
>>> I
>>> am misunderstanding).
>>
>> Register Update Unit = RUU
>>
>> I suspect you mean the Register Transfer Level (RTL) model
>
> My exposure to SimpleScalar is limited, but I believe the RUU is a big queue
> with timestamps in it... my objection was to Andy's queue with timestamps.
>
> In order for a random order to work, you need queues with timestamps - or
> double buffering all the data shared between pipestages (then double pumping
> the clock, with a read/write)... right?


First off, I didn't say queue, I said queues.

Second, RUU is microarchitecture. A uarch feature that I never considered implementable myself. By the way, I can't
remember timestamps in it - I suspect that you may be thinking of the SimpleScalar simulator implementation, which may
have had timestamps.

Whereas what I was talking about is simulator. Not microarchitecture.

Look: there are queues, or at least latches, between your pipestages no matter what. All I am suggesting is that you
configure the SIMULATOR queues so that you can do experiments such as saying "what if I combine instruction decode and
in-order schedule into a single pipestage." Whereas, if you have a reverse pipeline with a cycle per loop iteration, you
cannot even run that experiment. Glew observation: oftentimes people say "That's not practical", when actually
something is practical in hardware, just not in their simulator infrastructure.

Note also: simulator datastructures such as these timestamped queues that I am talking about are different, should be
distinguished, from hardware datastructures.

Let's see if I can contrive an example:

Say you have a 4 wide superscalar interface between instruction decode and renamer, D and R. You'll probably want a
hardware datastructure that corresponds to a 4-instruction-wide latch: Decoder_Output D2R_buffer[4]; with valid bits.

But if you are using a simulator infrastructure where you don't know what order pipestages are evaluated in, you might
build a queue between the D and R pipestages that can handle 8 entries: 4 entries that the R pipestage is going to
process in cycle N, as wel as 4 entries that the D pipestage may be producing in cycle N to be read by R in cycle N+1.

Don't get confused: the fact that the SIMULATOR queue may have 8 entries, 2 cycles of 4 entries, does not mean that the
HARDWARE has 8 entries. You might, for example, implement the hardware with a 2 phase clocking methodology, so that
the R pipestage reads cycle N's inputs in thye first, low phase, while the D pipestage writes the next cycles inpus in
the second, high, phase.

We use these SIMULATOR datastructures to hide or abstract away such hardware details. Sure, you can build a simulator
that models two phase clocking in detail - but then if you come to a system where two phase clocking is no longer
allowed, your simulator is inaccurate. And, I think most of the systems I have worked on recently have not allowed two
phase clocking.

Basically, you use simulator datastructures whose mappings to hardware are understood. E.g. this timestamped queue
maps to 2 phase clocking trivially; in a single phase world, it maps to flip flops; or, you have to add more queue depth.

By the way, let me roughly describe what such a queue looks like:

* hardware granularity and alignmemt
- i.e. does hardware think of it as a fixed align,nt queue, blocks of 4 always moving
or does hardware think of it as a misaligned queue, where HW can read 4 ntries at any alignment
- By the way - this should be parameterizabl, since the decision to use an aligned or a misaligned
queue is one of the basic studies you woll always do.
* hardware size (minimum)
* cycles to get across
again, good to parameterize IN THE SIMULATOR so that you can easily simulate different chip layouts,
with different delays for the wires

In general, the SIMULATOR queue will have # entries = the hardware size + hardware max granularity * (cycles + 1)

===

Note: I do not recommend that you expose these timestamps to your simulation modules. Not unless your actual hardware
plans to do so, e.g. for scheduling (I've just run across some schedulers that work that way); and even if it does, I
strongly suggest that you make a very clear distinction between what data is REAL in HARDWARE, and wha daa is an
ARTIFACT of the SIMULATOR INFRASTRUCTURE. It is far too easy to cheat in a simulator, and depend on something that
can't be built in real hardware. This is something that happens all too often with acadeic simulators.

(Note that academic simulators have two classes of problem: (1) they depend on simulator features that can't be built
in real hardware; (2) they assume that simulator restrictions apply to real hardware.)

Like Mitch, I think that there is a place for simulators more realistic than something like SimpleScalar, but less
detailed and hence more agile than RTL. Simulators that you can't cheat in.

However, I also believe that there is a place for cycle accurate simulators that are - well, maybe not less detailed
than SimpleScalar, but more agile. Simulators that you can run edxperiments such as saying "What if the latency through
this pipestage was 0". Where you can quickly dial in multiple cycles of delay.

In the detailed simulators, you might not allow yourself to use the timestamped queues that I talk about above. Whereas
in the more agile simulator you might use such timestamped queues to give yourself the agility.

I also believe that there is a place for simulators that are not cycle accurate. Like DFA.

===

By the way: you should, of course, try to make it impossible, difficult, or at least obvious to access simulator
artifact data from true hardware simulation. C++ encapsulation. Much harder to do in C; or, rather, much easier to
violate encapsulation by accident.



From: Andy 'Krazy' Glew on
On 6/30/2010 2:11 AM, nedbrek wrote:
> Hello all,
>
> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
> news:4C2A65F3.8080103(a)patten-glew.net...
>> On 6/29/2010 4:11 AM, nedbrek wrote:
>>
>>> For IPFsim, we had a nice infrastructure (using factories) to instantiate
>>> scheduler and execution frameworks. It supported in-order (for our
>>> McKinley
>>> comparisons), P3, P4, and HSW.
>>
>> I believe that the Itanium IPFsim infrastructure did not accurately model
>> OOO.
>> At the very least there were questions about its accuracy.
>
> We had issues with our architectural model. It was unable to provide
> stateless execution, which we needed for wrong path (I was working on one
> when we finally decided to stop work on Itanium).
>
> Wrong path issues are pretty well understood, it is unlikely we were facing
> any significant wrong path effects - because we were basically retreading
> historical OOO designs.
>
> I agree that moving forward with radical new designs would require wrong
> path.


It's not just wrong path. It's any form of data speculation. E.g. predicting loads bypassing stores that they are
actually dependent on.
From: Andy 'Krazy' Glew on
On 6/30/2010 2:57 AM, Terje Mathisen wrote:
> nedbrek wrote:
>> If there was one crazy new idea I'd want, it's the ability to run time
>> backwards. I can't count the number of times I was tracking down a
>> bug, and
>> stepped one cycle too far!
>
> This has been available in at least some SW debuggers for more than 10
> years.
>
> The fast way is to checkpoint the register state and use HW page level
> protection to detect modifications to memory, the slow method interprets
> the code and saves a log of the previous value for all modified resources.
>
> Terje
>


I meant to add this to the discussion of simulator queues earlier: one nice thing is that
you can easily extend such queues to record cycles past as well as cycles present and future.

It's not full reversible execution. Best to get that out of a generuc facility like a debugger, as Terje describes.

But it's easy and convenient, and only a slight extension. Once you have SIMULATOR queues, isntead of just hardware
datastructures.
From: nedbrek on
Hello all,

"Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message
news:4C2B6F9E.6070506(a)patten-glew.net...
> On 6/30/2010 2:11 AM, nedbrek wrote:
>> Hello all,
>>
>> "Andy 'Krazy' Glew"<ag-news(a)patten-glew.net> wrote in message
>> news:4C2A65F3.8080103(a)patten-glew.net...
>>> On 6/29/2010 4:11 AM, nedbrek wrote:
>>>
>>>> For IPFsim, we had a nice infrastructure (using factories) to
>>>> instantiate
>>>> scheduler and execution frameworks. It supported in-order (for our
>>>> McKinley
>>>> comparisons), P3, P4, and HSW.
>>>
>>> I believe that the Itanium IPFsim infrastructure did not accurately
>>> model
>>> OOO.
>>> At the very least there were questions about its accuracy.
>>
>> We had issues with our architectural model. It was unable to provide
>> stateless execution, which we needed for wrong path (I was working on one
>> when we finally decided to stop work on Itanium).
>>
>> Wrong path issues are pretty well understood, it is unlikely we were
>> facing
>> any significant wrong path effects - because we were basically retreading
>> historical OOO designs.
>>
>> I agree that moving forward with radical new designs would require wrong
>> path.
>
> It's not just wrong path. It's any form of data speculation. E.g.
> predicting
> loads bypassing stores that they are actually dependent on.

Sure, we wanted an execute-at-execute model. That is what we were driving
for.

But, you have to cut us some slack! There were two of us (plus an intern
for the summer). We came in with a blank slate for modelling out-of-order.
We were basically trying to reproduce P6, for Itanium.

We used the same mechanism as P6, loads wait for oldest STA.

Ned