From: nmm1 on
In article <4AD8030B.40707(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>There are significant improvements to be had in single thread
>performance by going to really large instruction windows. Multilevel
>instruction windows. The key is how to do this in a smart and power
>efficient manner. Such as I have patent pending, on inventions I made
>outside Intel. (I owe y'all an article on this.)

Isn't that all y'all?

>I doubt that this will deliver performance improvements linear in the
>number of transistors. However, all of the evidence that I have seen
>indicates that it will deliver performance proportional to the square
>root of the number of trsnistors.

Not what I have seen. The point here is that what you say IS true
for the better HPC and many benchmarketing codes, but isn't for
many of the problem codes. That's an old 1970s problem, when the
usual distinction was between user and supervisor mode code - the
former gained a lot of benefit from cache, but the latter didn't.

Again, that brings back my well-worn hobby-horse: NO such advance
can be of near-universal benefit unless we change to using coding
paradigms and programming languages that are better suited to such
optimisations.


Regards,
Nick Maclaren.
From: Piotr Wyderski on
ChrisQ wrote:

> In this respect, there's a basic conflict between the aims
> of science / pursuit of excellence and the aims of business.

Once upon a time there was an attempt to satisfy "the aims of science /
pursuit of excellence".
It was called Itanium. Turned out to be a multi-billion dollar flop. No more
projects like that please!

Best regards
Piotr Wyderski

From: jacko on
On 15 Oct, 19:02, Mayan Moudgill <ma...(a)bestweb.net> wrote:
> ChrisQ wrote:
> > Yes, but utimately boring and really just rearranging the deck chairs.
> > Compared to the 70's and 80's the pace of development is essentially
> > static.
>
> Name one feature introduced in the 80s which you consider not
> "rearranging the deck chairs". For extra credit identify the
> architecture in the 50s and 60s which first used this feature (or a
> variant).

All of computers has been deck chair Olympics since the Babbage age.
RISC being smaller chairs for easier type decoding. I think the multi
core approach is the way forward. The almost linear potential
improvements which may become possible are the ideal. The idea of
passing register sets between cores and splitting the cache and memory
into sections related to each core does free up some transistors, but
does mean sometimes a core can not execute anything ... but the cores
can be small...

The basic flexibility of instruction software sequencing to reduce the
size of hardware, is what micros were about. Speed innovations are
needed for some tasks, but often these are better done by custom
functional units. For your average divx video user 800MHz is ok.

I suggest DISCO-FETs for efficiency of switching, not necessarily for
extra speed. With simple instruction sets, cache line unit of
execution can be done by configuration of a 32 ALU etc. soft
configuration, and the cache line can be executed in one go. The
pipeline concept does inhibit some of the speed in this design, due to
the differing time for each execution step. Time = Max. Steps instead
of Time = sum of Steps.

cheers jacko

p.s. I think this was called async design, but pipelines were easier
to organize.
From: Daniel A. Jimenez on
In article <hb98qs$h11$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote:
>In article <hb86g3$fo6$1(a)apu.cs.utexas.edu>,
>Daniel A. Jimenez <djimenez(a)cs.utexas.edu> wrote:
>>
>>Sorry, can't let that one go. There have been tremendous improvements in
>>branch prediction accuracy from the late eighties to today. Without
>>highly accurate branch prediction, the pipeline is filled with too many
>>wrong path instructions so it's not worth going to deeper pipelines.
>>Without deeper pipelines we don't get higher clock rates. So without
>>highly accurate branch predictors, clock rates and performance would be
>>much worse than they are today. If we hadn't hit the power wall in the
>>early 2000s we would still be improving performance through better branch
>>prediction and deeper pipelines.
>
>Oh, really? I don't see it. The big difference is that the modern
>processes remove the size limits that made earlier branch predictors
>relatively ineffective. And that's not architecture.

Yes, really. Recent branch predictors have been made far more accurate by
techniques that are orthogonal to the size of the predictor. Two-level
branch prediction was introduced in 1991 and was the first big improvement
over a PC-indexed table of counters. However, even with a ridiculously huge
two-level predictor you can only go so far in terms of accuracy, and you lose
with latency. EV6 featured a tournament-style hybrid predictor choosing
between local and global history predictors; this is more accurate than a
larger two-level predictor. Intel's recent offerings have featured a loop
predictor capable of perfectly predicting loops with trip counts up to 64
(IIRC); that's impossible with two-level predictors. Intel has also recently
started putting indirect branch predictors in their products giving an
additional performance boost for e.g. dense switch statements and virtual
dispatch; until indirect branch predictors came along, the best you could
do was hope the target is cached in the BTB, which is often isn't for
complex control flow.

Exploiting larger tables enabled by improved process technology is not
trivial, either. The latency of a larger table means you have do play some
tricks to get a prediction in a single cycle, e.g. ahead-pipelining (not
pipelining, *ahead*-pipelining). Being able to do that without losing
accuracy is another recent innovation (i.e. how do you predict a branch
if you don't know which branch you're predicting? It's not easy.)

In terms of research, branch prediction stagnated in the late 1990s but
in the early to mid 2000s there was a resurgence of activity exploring
all kinds of weird ideas for improving accuracy, including some very
practical ones. Seznec's skewed predictor would have been in EV8 had it
not been cancelled; the clever organization and update policy of this
predictor make it more accurate than a two-level predictor with more than
twice the area.
--
Daniel Jimenez djimenez(a)cs.utexas.edu
"I've so much music in my head" -- Maurice Ravel, shortly before his death.
" " -- John Cage
From: EricP on
Andy "Krazy" Glew wrote:
> EricP wrote:
>> Daniel A. Jimenez wrote:
>>> ...
>>> Trace cache is another more-or-less recent microarchitectural innovation
>>> that allowed Pentium 4 to get away with decoding one x86 instruction
>>> per cycle and still have peak IPC greater than 1.
>>
>> Actually trace cache goes back to the VAX HPS, circa 1985.
>> They called the decoded instruction cache a "node cache".
>> As far as I know, VAX HPS was never built though.
>
> Sorry, no.
>
> The HPS (and HPSm) node cache was not a trace cache. It did not have a
> single entry point for a trace of instructions.
>
> I invented the trace cache, or at least the term "trace cache", while
> taking the first Wen-Mei Hwu (the H in HPS) taught after receiving his
> Ph.D. and coming to UIUC in 1986 or 1987. I invented it to solve the
> problems that a decoded instruction cache had with variable length
> instructions (and also to support forms of guarded execution, what would
> now be called control independence or hammocks or hardware if
> conversion). Wen-mei was my MS advisor. I am sure that he would have
> informed me if the trace cache was just the node cache rehashed (and
> given me a bash, and thrown it in the trash, and not given me any cash).
>
> Alex Peleg and Uri Weiser may have preceded me in inventing the trace
> cache, and certainly patented it first. (I never patented anything at
> UIUC, or prior to Intel.) But so far as I know, I invented the term
> "trace cache", and popularized it at Intel in 1991, before Peleg and
> Weiser.
>

My apologies - I did not mean to imply these were equivalent,
and the paper doesn't use the term 'trace cache', as I noted.
HPS wasn't a trace cache machine, but it is so similar that it
probably planted the seeds that became trace caching.

The functional description in the paper I cited and others
in the HPS series have many similarities to a trace cache idea:
caching the risc-y micro-ops of decoded instructions,
merging of instruction literal values into the cache,
repairs (aka replay traps), register renaming,
Tomasulo OoO execution, etc.

The cache structure itself was a question because of the
one-to-many mapping of instructions to nodes (micro-ops).
They discuss the cache space allocation and fragmentation
issues but do not propose a specific solution
(two level index? LRU? garbage collect?).

The micro ops are not physically stored together as a
basic block, so that leap has not been made yet.
However all the pieces are in play.

This all left me with the impression that HPS was an
early step along the design path the led to Pentium 4.

Some of these HPS papers PDFs are available online at
http://www.zytek.com/~melvin/pubs.html
And easy read and of historical interest is:
"An HPS Implementation of VAX: Initial Design and Analysis", 1986

Eric