From: Robert Myers on
On May 31, 4:12 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:

>
> This is a better form of discussion.
>
> First, I apparently was not explicit enough in my earlier post about batch scheduling for instruction windows:
>
> It is *not* expandable split windows.   It is a single, contiguous, instruction window, that happens to be split across
> multiple clusters of execution units, or multiple processor cores.  It takes advantage of the OOO instruction window
> support that each cluster has - the scheduler or RS equivalent, etc.  But it stitches them together to form a single
> large instruction window.
>
> In my earlier post I discussed how this stitching is not so hard for instruction sets with a small number of registers.
>   Worst case, you save/restore all of the registers at batch boundaries..  That's not too hard for the historical 6
> registers of x86.  It's harder as you add more and more registers.  Eventually, you want to switch to a pull strategy,
> where you only transfer the registers across clusters when the new batch needs a value live-out of the old.
>
> You are correct to say that memory, store-to-load forwarding, is a greater challenge.   You could use the existing STLF
> buffers that most OOO systems have within the clusters.  But that leaves the problem of stitching togethr store buffers
> between batches, between clusters.
>
> Simplest strategy would be to make the 2 (or more) cluster STLFs look like a single one.  E.g. send a signal from the
> older cluster to the younger cluster (if the batch occupies the whole cluster) indicating that is a store of address
> unknown.  That is sufficient to implement the P6-era strategy of blocking a load until all earlier stores have become
> address-ready.   But it is not sufficient to implement store-to-load forwarding.
>
> A load that becomes address ready is first compared to all stores that are address ready in its local cluster STLF
> buffer.  If matched, great, no traffic to the outside world.  If not matched, then you need to decide if you want to
> send the load to the earlier cluster's STLF buffer.  ...  At this point, I probably need to shut up, and write up an
> invention disclosure.  It doesn't seem all that hard to me, I daresay it seems obvious to me, but I've been told many
> times that what seems obvious to me isn't necessarily obvious to all.  Witness this discussion.
>
> Anyway, stitching together these store buffers - a pain, but I can see several, many, ways of doing so.
>
> If this stitching isn't good enough, there are several proposals for multilevel store buffers extant, both mine and
> those published in academia that I referred to in the previous post.  As I have said earlier, I like an L1 STLF
> associated with the cluster, and an L2 STLF shared between clusters.
>
> Finally, my log based verify re-execution suffices to serve as a catch-all mechanism to ensure correctness.  There are
> several other similar proposals that use verification to ensure correctness.
>
> We therefore have several different mechanisms to "stitch-together" STLF buffers.  So, it looks buildable.  Is it of any
> value.
>
It was the stitching together that seemed like magic to me, and your
expanded discussion helps a lot.

If a register is written to within the visible instruction stream,
then you don't need to pull it from the outside world. That seemed
beautiful right away. In the case of registers, you can tell what you
might need to wait on. Too bad memory doesn't work that way.

Even though you didn't explicitly address this point, the part that
worried me so much--relying on incorrectly speculated addresses--
presumably won't happen because you won't rely on an address from the
outside world until it is known definitively, and the outside world
can speculate internally and still know which addresses can be passed
on as known without risk arising from speculation.

> ---
>
> Now, yes, I have in other posts discussed SpMT, expandable split windows.  And, yes, the original post of this thread
> started off with batched, contiguous, instruction windows, but segued into split windows.  If that's what we want to
> discuss, yes, let's do so.
>
> However, in my log based SpMT, I drain instructions out of the window, and keep them in the log. Which really amounts to
> an L2 instruction window, just one that is very cheap.
>
[Robert wrote]:
> > but that's sort of like saying, "Suppose
> > all the problems that (say) CSP was invented to address didn't
> > exist."  I'm visualizing instruction windows clogged with instructions
> > waiting to complete because some far away memory reference might
> > change everything.  If I don't understand something, it's why you
> > think it will work well enough to be useful, not how it would work in
> > a single instance.
>
> So, first, it is well-known that there is some potential in simply increasing instruction window size.  The problem is
> how costly is the hardware, esp. power wise, to do so.
>
> E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for integer, going from instruction windows of 128 to 4K
> instructions.
>
> Cristal, A., Santana, O. J., Valero, M., and Martínez, J. F. 2004. Toward kilo-instruction processors. ACM Trans.
> Archit. Code Optim. 1, 4 (Dec. 2004), 389-417. DOI=http://doi.acm.org/10.1145/1044823.1044825
>
> Now, that is not a very good payoff.  Definitely sublinear.  But the whole point is not that we are trying to build
> processors that are 4096/128 = 32X bigger.  First, the OOO hardware is only a fraction of processor area, especially if
> you have big vector units like some recent proposals.  Second, more importantly, to build big effective instruction
> windows you do not need to build all that hardware.  As the kilo-instruction papers, and my own work, shows.
>
> Now, the batched instruction window proposal is on the wimpy side of this..   If it keeps the instruction window
> contiguous, it is paying the full cost for every instruction in the window.  Other proposals optimize that; batched doesn't.
>
> But, the batched proposal is easy to build.
>
> I think really what I am saying is that there are a number of techniques that help improve single thread performance.
> Batching is one of the simplest ones.  SpMT is more aggressive.  I'm collecting those techniques, and hope to pick and
> choose from such a toolkit if every I am building an aggressive single thread microarchitecture again.
>
> But, also, I am not pushing this as an alternative to improving parallel performance, such as via message-passing
> techniques like CSP (I half expect Nick to say "CSP isn't message passing"), or shared memory parallelism.  I'm open to
> all such techniques to improve performance, and I don't denigrate any of them.
>
> What I do denigrate is people saying "The only thing that anyone nowadays should be building is the simplest possible
> in-order processors, many, many of them, to save power and ... " bet the farm on parallel programming.  IMHO that does
> not make sense for general purpose workloads.  It might make sense for some supercomputers and other HPC, and for some
> graphics.  And I'll turn your point around: if you are saying the above, then you are saying "And a miracle happens..."
> with respect to parallel programming.  However, in truth, I don't think you are saying thae.
>
I'm not, for sure. I was merely invoking conservation of difficulty.
You may have misunderstood the intent of some of my previous posts, in
particular the paper that suggests there isn't much payoff from
microarchitectural innovation for OLTP workloads. I only want to
understand why we are where we are, not to insist that it is
inevitable. I'll admit, though, that I remain pessimistic, because I
can't identify a big market for significantly improved single-thread
performance, especially if it reduces performance per watt.

>
> First, out-of-order is only N^2 in simplistic algorithms.  I.e. OOO is N^2 the same way sorting is O(N^2) - i.e. it isn't.
>
> See, for example, the Ultrascalar papers show logarithmic gate delays and sqrt wire length, essentially by showing that
> OOO is a parallel prefix problem.
>
> Henry, D. S., Kuszmaul, B. C., and Viswanath, V. 1999. The Ultrascalar Processor-An Asymptotically Scalable Superscalar
> Microarchitecture. In Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI (March 21 - 24, 1999).
> ARVLSI. IEEE Computer Society, Washington, DC, 256.
>
> Arguably their results apply only to renaming of registers and memory, but I think they can extend to all OOO execution.
>
> Now, I'm not talking ultrascalar - I don't think I could justify building a new microarchitecture - so it is fair to ask
> how I am ducking the O(N^2) of simplistic OOO.  But it is important to note the theoretical limit: we aren't violating
> theory, just using O^N^2) algorithms up until it gets painful, and then switching to more scalable approaches.
>
> Wrt the batched instruction window, above I started discussing how to stitch instruction windows together.  Sorry, I
> started discussing, but stopped.  But the basic idea is that not all traffic needs to cross the cluster boundary. You
> can filter out a lot that doesn't need to cross make preductions for some of the rest.  And you are left with a lot less.

I suspect that you are making some kind of assumption about data
locality in the instruction stream, so that a large fraction of
address references can be identified by examining the local LTSB.
That assumption isn't crucial. It's just that, as it becomes less
true, you have to work a lot harder.

Robert.

From: nedbrek on
Hello all,

"Robert Myers" <rbmyersusa(a)gmail.com> wrote in message
news:72779549-89d9-4645-ae02-32995c1f1227(a)d12g2000vbr.googlegroups.com...
>
> I'll admit, though, that I remain pessimistic, because I
> can't identify a big market for significantly improved single-thread
> performance, especially if it reduces performance per watt.

For the same process and circuit tricks, performance per watt will always
decrease as performance goes up (I think some information theory guy could
prove this).

Therefore, P/W is equivalent to minimizing performance (0 perf for 0 power
FTW).

Maybe you meant P^2 (or P^3) per Watt. Sorry, it is a hang-up of mine. A
lot of people say "performance per watt" and I just want to throttle them.

Another good metric is max perf for given power envelope (desktop, laptop,
contact lens, etc.).

Ned


From: Robert Myers on
On Jun 1, 7:16 am, "nedbrek" <nedb...(a)yahoo.com> wrote:
> Hello all,
>
> "Robert Myers" <rbmyers...(a)gmail.com> wrote in message
>
> news:72779549-89d9-4645-ae02-32995c1f1227(a)d12g2000vbr.googlegroups.com...
>
>
>
> > I'll admit, though, that I remain pessimistic, because I
> > can't identify a big market for significantly improved single-thread
> > performance, especially if it reduces performance per watt.
>
> For the same process and circuit tricks, performance per watt will always
> decrease as performance goes up (I think some information theory guy could
> prove this).
>
> Therefore, P/W is equivalent to minimizing performance (0 perf for 0 power
> FTW).
>
> Maybe you meant P^2 (or P^3) per Watt.  Sorry, it is a hang-up of mine.  A
> lot of people say "performance per watt" and I just want to throttle them..
>
> Another good metric is max perf for given power envelope (desktop, laptop,
> contact lens, etc.).
>
I was being imprecise. The metric that really matters for non-mobile
installations (or even desktop replacement notebooks) is the perceived
cost of getting things done. That's a calculation that extends well
beyond computer architecture.

One big selling point for the kinds of architectural innovations Andy
has been talking about is increased latency tolerance, and, there, I
think there is a market. Whether it's a big enough market to be
interesting is another question.

Robert.



From: Andy 'Krazy' Glew on
On 5/31/2010 6:45 PM, Robert Myers wrote:
> On May 31, 4:12 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>> So, first, it is well-known that there is some potential in simply increasing instruction window size. The problem is
>> how costly is the hardware, esp. power wise, to do so.
>>
>> E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for integer, going from instruction windows of 128 to 4K
>> instructions.
>>
>> Cristal, A., Santana, O. J., Valero, M., and Mart�nez, J. F. 2004. Toward kilo-instruction processors. ACM Trans.
>> Archit. Code Optim. 1, 4 (Dec. 2004), 389-417. DOI=http://doi.acm.org/10.1145/1044823.1044825
>>
>> Now, that is not a very good payoff. Definitely sublinear. But the whole point is not that we are trying to build
>> processors that are 4096/128 = 32X bigger. First, the OOO hardware is only a fraction of processor area, especially if
>> you have big vector units like some recent proposals. Second, more importantly, to build big effective instruction
>> windows you do not need to build all that hardware. As the kilo-instruction papers, and my own work, shows.
>>
>> Now, the batched instruction window proposal is on the wimpy side of this.. If it keeps the instruction window
>> contiguous, it is paying the full cost for every instruction in the window. Other proposals optimize that; batched doesn't.

I forgot to say that the Kilo-instruction paper speedups I report above make several assumptions that are probably
conservative. E.g. 4-wide.



>> First, out-of-order is only N^2 in simplistic algorithms. I.e. OOO is N^2 the same way sorting is O(N^2) - i.e. it isn't.
>>
>> See, for example, the Ultrascalar papers show logarithmic gate delays and sqrt wire length, essentially by showing that
>> OOO is a parallel prefix problem.
>>
>> Henry, D. S., Kuszmaul, B. C., and Viswanath, V. 1999. The Ultrascalar Processor-An Asymptotically Scalable Superscalar
>> Microarchitecture. In Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI (March 21 - 24, 1999).
>> ARVLSI. IEEE Computer Society, Washington, DC, 256.
>>
>> Arguably their results apply only to renaming of registers and memory, but I think they can extend to all OOO execution.
>>
>> Now, I'm not talking ultrascalar - I don't think I could justify building a new microarchitecture - so it is fair to ask
>> how I am ducking the O(N^2) of simplistic OOO. But it is important to note the theoretical limit: we aren't violating
>> theory, just using O^N^2) algorithms up until it gets painful, and then switching to more scalable approaches.
>>
>> Wrt the batched instruction window, above I started discussing how to stitch instruction windows together. Sorry, I
>> started discussing, but stopped. But the basic idea is that not all traffic needs to cross the cluster boundary. You
>> can filter out a lot that doesn't need to cross make preductions for some of the rest. And you are left with a lot less.
>
> I suspect that you are making some kind of assumption about data
> locality in the instruction stream, so that a large fraction of
> address references can be identified by examining the local LTSB.
> That assumption isn't crucial. It's just that, as it becomes less
> true, you have to work a lot harder.

The Ultrascalar papers make no such assumptions about locality.

For the batched instruction window, I do make such assumptions. And where it is not true, we would have to fall back to
a more aggressive strategy.


--


By the way, you mentioned OLTP.

Simple, TPC-C, OLTP - simple queries, of the form "update this account", "what is this account balance?" have little
intra-query parallelism. They basically walk through linked trees until they get to the end.

More complicated transaction processing, joins, etc., have oodles of parallelism. However, if you are joining on a key
that is indexed, for which the data table itself is not sorted, the parallelism is middle loop or outer loop, not inner
loop. The inner loop is the aforementioned tree or hash pointer chase.

Of course, such joins are also parallelizable.
From: Andy 'Krazy' Glew on
On 6/1/2010 6:55 AM, Robert Myers wrote:

> One big selling point for the kinds of architectural innovations Andy
> has been talking about is increased latency tolerance, and, there, I
> think there is a market. Whether it's a big enough market to be
> interesting is another question.

I don't think there is, or will be, such a market until we have mined the low hanging fruit of parallelism, multi-core,
and many-core.
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7
Prev: Larrabee Dead
Next: CHEA Dumbass George Gollin's Racial Hate