From: Daniel A. Jimenez on
In article <2b90$4ad7b7c8$45c49ea8$21677(a)TEKSAVVY.COM>,
EricP <ThatWouldBeTelling(a)thevillage.com> wrote:
>Daniel A. Jimenez wrote:
>> ...
>> Trace cache is another more-or-less recent microarchitectural innovation
>> that allowed Pentium 4 to get away with decoding one x86 instruction
>> per cycle and still have peak IPC greater than 1.
>
>Actually trace cache goes back to the VAX HPS, circa 1985.
>They called the decoded instruction cache a "node cache".
>As far as I know, VAX HPS was never built though.

Was it a trace cache, i.e., were decoded instructions stored in order of
execution rather than the order of the program text?

>> Cracking instructions into micro-ops, scheduling the micro-ops, then fusing
>> the micro-ops back together in a different way later in the pipeline allows
>> an effectively larger instruction window and more efficient pipeline.
>> That's a relatively recent innovation, too.
>
>Except for the fused micro-ops, this was also VAX HPS.

The fused micro-ops is the innovation. It allows an effectively larger
instruction window, so more ILP, more performance. One can argue that
micro-ops are equivalent to microcode, which predates minis.
--
Daniel Jimenez djimenez(a)cs.utexas.edu
"I've so much music in my head" -- Maurice Ravel, shortly before his death.
" " -- John Cage
From: EricP on
Daniel A. Jimenez wrote:
> In article <2b90$4ad7b7c8$45c49ea8$21677(a)TEKSAVVY.COM>,
> EricP <ThatWouldBeTelling(a)thevillage.com> wrote:
>> Daniel A. Jimenez wrote:
>>> ...
>>> Trace cache is another more-or-less recent microarchitectural innovation
>>> that allowed Pentium 4 to get away with decoding one x86 instruction
>>> per cycle and still have peak IPC greater than 1.
>> Actually trace cache goes back to the VAX HPS, circa 1985.
>> They called the decoded instruction cache a "node cache".
>> As far as I know, VAX HPS was never built though.
>
> Was it a trace cache, i.e., were decoded instructions stored in order of
> execution rather than the order of the program text?

They were discussing a number of design issues and
evaluating different possible ways of implementing it,
but yes I think it is the same idea.

They state the nodes are entered into the node cache
in decode order, and must be merged with certain non-microcode
values like immediate literal instruction constants.
Also it must handle 'vax-isms' like procedure call
instructions CALLG/CALLS pick up their register save mask
from the routine entry point, and save that into the cache.

>>> Cracking instructions into micro-ops, scheduling the micro-ops, then fusing
>>> the micro-ops back together in a different way later in the pipeline allows
>>> an effectively larger instruction window and more efficient pipeline.
>>> That's a relatively recent innovation, too.
>> Except for the fused micro-ops, this was also VAX HPS.
>
> The fused micro-ops is the innovation. It allows an effectively larger
> instruction window, so more ILP, more performance. One can argue that
> micro-ops are equivalent to microcode, which predates minis.

But for the VAX, the instr. decode was very much a bottleneck
because it was originally designed as an LL sequential parse.
The node cache bypasses that bottleneck and, in theory,
allows parallel scheduling of micro-ops to their
OoO function units.

Eric


From: "Andy "Krazy" Glew" on
Jean wrote:
> In last couple of decades the exponential increase in computer
> performance was because of the advancements in both computer
> architecture and fabrication technology.
> What will be the case for future ? Can I comment that the next major
> leap in computer performance will not because of breakthroughs in
> computer architecture but rather from new underlying technology ?

I am not so sure.

There are significant improvements to be had in single thread
performance by going to really large instruction windows. Multilevel
instruction windows. The key is how to do this in a smart and power
efficient manner. Such as I have patent pending, on inventions I made
outside Intel. (I owe y'all an article on this.)

I doubt that this will deliver performance improvements linear in the
number of transistors. However, all of the evidence that I have seen
indicates that it will deliver performance proportional to the square
root of the number of trsnistors.

By the way, some people call this - performance proportional to square
root of the number of transistors - Pollack's Law. Fred Pollack, my old
boss, presented it at some big conferences. Myself, I told Fred about
this "law", which I first encountered in Tjaden and Flynn's paper that
said performance is proportional to the square root of the number of
branches looked past. After encountering such square root laws in
several places, I conjectured the generalization, which seems to be
confirmed by many metrics. To differentiate myself, let me conjecture
further that in a space of dimension d, performance is proportional to
the (d-1)/d-th root of the number of devices. E.g. in 3D, I conjecture
that performance is proportional to the 2/3-rds root of the number of
devices.

I suspect that there are significant improvements in parallel processing
to be made, most likely in the "how to make it easier" vein. I'm in the
many, many, processors camp.

I believe that there is significant potential to apply parallelism to
improve single thread performance. Speculative multithreading, SpMT. I
wish Haitham Akkary luck as he carries the torch for this research. DMT.
I wish I could do the same.

Along the lines of technology, as indicated above I suspect that 3D
integration could bring many benefits. But heat dissipation is such a
big problem that I doubt that it is reasonable to hope for this in the
next 10 years. I.e. I doubt that we will have cubes of logic intermixed
with memory 1 cm on a side. However, incremental progress will be made.
2-4 layers of transistors within 10 years.

Although smaller, faster, more power efficient devices are always a
possibility, I think that the human brain points out the capabilities of
relatively slow computation, albeit with complex elements and high
connectivity. I suppose this counts as technology, although not
necessarily on the traditional access of evolution.
From: "Andy "Krazy" Glew" on
EricP wrote:
> Daniel A. Jimenez wrote:
>> ...
>> Trace cache is another more-or-less recent microarchitectural innovation
>> that allowed Pentium 4 to get away with decoding one x86 instruction
>> per cycle and still have peak IPC greater than 1.
>
> Actually trace cache goes back to the VAX HPS, circa 1985.
> They called the decoded instruction cache a "node cache".
> As far as I know, VAX HPS was never built though.

Sorry, no.

The HPS (and HPSm) node cache was not a trace cache. It did not have a
single entry point for a trace of instructions.

I invented the trace cache, or at least the term "trace cache", while
taking the first Wen-Mei Hwu (the H in HPS) taught after receiving his
Ph.D. and coming to UIUC in 1986 or 1987. I invented it to solve the
problems that a decoded instruction cache had with variable length
instructions (and also to support forms of guarded execution, what would
now be called control independence or hammocks or hardware if
conversion). Wen-mei was my MS advisor. I am sure that he would have
informed me if the trace cache was just the node cache rehashed (and
given me a bash, and thrown it in the trash, and not given me any cash).

Alex Peleg and Uri Weiser may have preceded me in inventing the trace
cache, and certainly patented it first. (I never patented anything at
UIUC, or prior to Intel.) But so far as I know, I invented the term
"trace cache", and popularized it at Intel in 1991, before Peleg and Weiser.

From: nmm1 on
In article <hb86g3$fo6$1(a)apu.cs.utexas.edu>,
Daniel A. Jimenez <djimenez(a)cs.utexas.edu> wrote:
>
>Sorry, can't let that one go. There have been tremendous improvements in
>branch prediction accuracy from the late eighties to today. Without
>highly accurate branch prediction, the pipeline is filled with too many
>wrong path instructions so it's not worth going to deeper pipelines.
>Without deeper pipelines we don't get higher clock rates. So without
>highly accurate branch predictors, clock rates and performance would be
>much worse than they are today. If we hadn't hit the power wall in the
>early 2000s we would still be improving performance through better branch
>prediction and deeper pipelines.

Oh, really? I don't see it. The big difference is that the modern
processes remove the size limits that made earlier branch predictors
relatively ineffective. And that's not architecture.

>History-based memory schedulers are another recent innovation that
>promises to improve performance significantly.

Don't bet on it. A hell of a lot of the papers on the underlying
requirements are based on a gross abuse of statistics. They make
the cardinal (and capital) error of confusing decisions based on
perfect knowledge (i.e. foresight) with admissible decision rules
(which can use only history).

The point here is that, as with branch prediction, a hell of a lot
of the important problem codes are precisely those that are least
amenable to such optimisations. The ONLY solution is to get them
rewritten in a more civilised paradigm.


Regards,
Nick Maclaren.