Big OOO, SpMT, and possible designs (Was Re: Free/Open x86 Sim) [Computer Architecture]

Prev: A post to comp.risks that everyone on comp.arch should read
Next: Call for papers : HPCS-10, USA, July 2010

From: Andy 'Krazy' Glew on 15 May 2010 13:10

On 5/15/2010 7:14 AM, Bernd Paysan wrote:
> Andy 'Krazy' Glew wrote:
>> 7) Cost of fabs. High cost => risk aversion.
>
> Moore's Law on fab costs (going up exponentially) leads to the situation
> that design houses and fabs separate. Intel is about the only company
> left where design and fab are closely tied together - the other
> companies with both design and fab capability at least offer fab service
> to third parties.
>
> So where's the risk? When you are a fabless design house, a state-of-
> the-art process project costs you a few millions external costs plus
> internal R&D. This is pretty affordable. If you are not fabless, the
> same project costs you a few billions fab costs, plus internal R&D.
> This is basically the death for anyone with less market power than
> Intel, and therefore, they all don't do it anymore.
>
> For this situation, ARM has not only the right product (targeted at low
> power), it also has the right way to distribute that product (as IP,
> including Verilog sources for those who want to tinker with it).
>

One way to reduce risk for a new fab line is to use it for multiple different product designs: TSMC does this in spades.
Even Intel is doing this more and more.

However, there is a risk here in that TOO MANY different designs may succeed.

It is my understanding, which may be wrong, that the fab runs better with as few designs as possible. Lossage when
switching, investment in debugging and yield analysis. Multicore chips with several different variants may help - some
of the learning for a quad core may also apply to a 12-core with the same processor core.

I wonder if the increasingly short lifetime of masks "helps" in this regard: switching designs when a mask needs to be
replaced is less onerous than masks are good. But masks do not all get replaced at the same time, do they?

From: MitchAlsup on 15 May 2010 14:11

On May 15, 11:51 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net>
wrote:
> On 5/15/2010 4:44 AM, nedbrek wrote:
> > Rotenberg gets credit for trace cache. I don't think we can blame the
> > failure of the P4 trace cache on him... It seemed like pretty popular stuff
> > in most circles.
>
> Uri Weiser and Alex Peleg's patent is the first publication (if you consider a patent a publication; the law does) on
> trace cache. From wikipedia:
>
> The earliest widely acknowledged academic publication of trace cache
> was by Eric Rotenberg, Steve Bennett, and Jim Smith in their 1996 paper
> "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching."
>
> An earlier publication is US Patent 5,381,533,
> "Dynamic flow instruction cache memory organized around trace segments independent of virtual address line",
> by Alex Peleg and Uri Weiser of Intel Corp., patent filed March 30, 1994,
> a continuation of an application filed in 1992, later abandoned.
>
> My own work in trace caches was done 1987-1990, and I brought it to Intel, where it was evaluated for the P6.
> Admittedly, I was a part-time grad student under Hwu at the University of Illinois at the time, working for some of that
> time at Gould and Motorola. Peleg and Weiser may have gotten the patent, but I am reasonably sure that I originated the
> term "trace cache".

Shebanow and Alsup used what we called a Packet cache in 1989-90. Each
packet contained 6 instructions and 2 successive target address
fragments. The instructions in each packet were not necessarily from
the same basic block and were built in observed (first time) order.
Thus, we could "take" a branch every single cycle. A full
architectural model of this machine could process GCC from SPEC 89 at
2.1-2.2 IPC.

{Aside: in this machine we could also recover from a branch
misprediction every single cycle, issuing instructions from the non-
taken path IN the recovery cycle. More performance showed up from this
part of the mechanism than from the the "take" part of the mechanism
for GCC.}

Mitch

From: MitchAlsup on 15 May 2010 14:20

On May 15, 12:10 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net>
wrote:
> It is my understanding, which may be wrong, that the fab runs better with as few designs as possible.

It is my understanding that the FAB runs best when the masks remain in
the steppers, and wafers are simply run down the multi-week pipeline
at an appropriate pace to keep all of the machines and processes
moving at near optimal rates. Witness DRAMs.

When a batch of wafers are stopped, an extra stage is added to the
wafters (more oxide is placed over what has been completed) so that
the wafers do not decay wile sitting idle in the vault. This oxide is
stripped off (and sometimes the previous layer) before the wafers
begin processing anew. Yield is affected in a non-positive maner.

Mitch

From: Bernd Paysan on 15 May 2010 15:07

MitchAlsup wrote:

> On May 15, 12:10 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net>
> wrote:
>> It is my understanding, which may be wrong, that the fab runs better
>> with as few designs as possible.
>
> It is my understanding that the FAB runs best when the masks remain in
> the steppers, and wafers are simply run down the multi-week pipeline
> at an appropriate pace to keep all of the machines and processes
> moving at near optimal rates. Witness DRAMs.

The pace is optimal, when the wafers wait for the machine, not the other
way round. And apparently, the masks are one of the least things to
worry about, at least in a fab that is optimized for having many
different mask sets like TSMC. I've a mask from fab 2B sitting next to
my desk, in a simple plastic case with barcode and printed text on it
(souvenir from one of my projects, it's one of those masks that were
obsoleted by metal fixes).

> When a batch of wafers are stopped, an extra stage is added to the
> wafters (more oxide is placed over what has been completed) so that
> the wafers do not decay wile sitting idle in the vault. This oxide is
> stripped off (and sometimes the previous layer) before the wafers
> begin processing anew. Yield is affected in a non-positive maner.

Usually, this happens only when you hold back some wafers for metal
fixes during the prototype stage. And yes, yield is affected; this is
always difficult to explain to customers when a metal fix results in a
decreased yield, since the fab usually denies these effects.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

From: Kai Harrekilde-Petersen on 15 May 2010 16:05

"nedbrek" <nedbrek(a)yahoo.com> writes:

> Hello all,
>
> "MitchAlsup" <MitchAlsup(a)aol.com> wrote in message
> news:26c1c35a-d687-4bc7-82fd-0eef2df0f714(a)c7g2000vbc.googlegroups.com...
>> There is risk in taking a few steps backwards, but there is reward in
>> repositioning a microarchitecture so that it is tuned at the
>> performace per power end of the spectrum. And I think it is unlikely
>> that we can evolve from the Great Big OoO machines to the medium OoO
>> (or partialy ordered) machines the battery powered world wants/needs.
>> If the x86 people are not going to figure out how to get into the 10mW-
>> to-50mW power envelope with decent performance, then they are leaving
>> a very big vacuum in which new architectures and microarchitectures
>> will find fruitful hunting grounds. I sure would like to be on a team
>> that did put x86s into this kind of power spectrum, and I think one
>> could get a good deal of the performance of a modern low power (5W)
>> laptop into that power range, given that as the overriding goal of the
>> project.
>
> According to
> http://arstechnica.com/gadgets/news/2010/05/intel-fires-opening-salvo-in-x86-vs-arm-smartphone-wars.ars
> Atom is now at idle 21-23 mW. Assuming idle is half of normal (total power
> is half leakage, half dynamic), that gives ~50 mW. Of course, this is
> probably some low voltage, power gated state...
>
> Power envelope can be a moving target. Batteries get better, process
> (sometimes) gets better. Give things a while, and what used to not fit,
> suddenly does.

Try the <1mW range for a hearing aid system, including A/D microphones
and all. They run on ZnAir batteries, since the give the highest
energy density, but output 1.1-13V, but only ~1mA sustained.

Oh, and you have a limited volume available (mm^3) since you need to
be able to place it inside the ear canal, together with a battery,
microphone, and a speaker.

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: A post to comp.risks that everyone on comp.arch should read
Next: Call for papers : HPCS-10, USA, July 2010