Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: Jeff Fox on 1 Jan 2010 16:01

On Jan 1, 11:48 am, Bernd Paysan <bernd.pay...(a)gmx.de> wrote:
> >>>So: summarizing - I still don't think active messages is the right
> >>>name. I haven't encountered any real-life instances where people
> >>>actually send code to be executed (or even interpreted) at a
> >>>low-level inside the device driver.
>
> >> I have. Several different, to tell you. One guy (Heinz Schnitter)
> >> sends source code around - this is a distributed programming system.
> >> Another one (Chuck Moore) sends instructions around - this is a small
> >> chip full of tiny CPUs. They all did not really generalize, though I
> >> know that Chuck Moore knows what Heinz did.
>
> > Got any references that are publically available?
>
> Chucks SeaForth:
>
> http://www.intellasys.net/index.php?option=com_content&task=view&id=3...
>
> This is a commercial product, but Intellasys has basically folded down
> since patent Trolls and engineers can't work together in the long run

Green Array Chips http://www.greenarraychips.com has continued
the development and in 2009 produced working chips in a couple
geometries and severlal configurations. Designs using some number of
the 20k transistor, $.01 manufacture cost, 700 Forth MIP performance,
5mw core (in .18u) include the GA4, GA32, GA40, and GA144.

Arbitration of the contract between Chuck Moore and TPL will take
place soon. IntellaSys, a TPL Group, mostly shutting down in January
of 2009 although they did continue with the hearing enchancement
project as reported at the Silicon Valley Forth Interest Group Forth
Day meeting in November of 2009. You can read Chuck's opinions
about the legal case at his website http://www.colorforth.com

The previous generation of full custom VLSI Forth chips included
a network router coprocessor integrated into the design for active
messaging. It routed messages, did DMA if the individual or group
address bits in a message matched the node routing the message,
and could interrupt the CPU to execute messages after they
were in RAM. The active message processor used about 300
transistors and used two $.01 pins. It ran autonomously at up to
several hundred mbps but due to sharing memory with everything
else bandwidth limited performance to 40mbps. Maximum
CPU throughput was 220 Forth MIPS at 50mw, it also had a
40MSPS analog coprocessor and a video I/O coprocessor/
accelerator and a manufacture cost of about $.85 in quantity
due to the size of the die and the use of 68 pins. This was
back in the early nineties.

More information about the old F21 and the history of the chips
at my website http://www.ultratechnology.com

The lack of on-chip memory at the time in those designs meant
that each node required some external memory and were networked
with several chips at each node. Adding internal RAM and ROM
to the core made it reasonable to put multiple core per chip package
which lead to the SEAforth (Scalable Embedded Arrays). In these
designs some core have pins, some are just connected to other core.
Some packages have enough pins for flash, some for external RAM etc.
Some core have a/d and d/a, and some have serdes.

We took the Occam style communication channels and implemented
them as shared registers and added the ability to address up to
four of these ports at once. These ports require only a few
transistors
and block a node until a neighbor reads or writes. A port can be
read with a pointer as data or by the program counter as instructions.
Routing is done by packets that execute some instructions on a
port and read and write data to/from ports or memory. This allows
one one cent processor to send a program to another one cent
processor in a about nanoseconds and have it wake up and execute
it within a couple of hundred picoseconds.

We have a similar mechanism for waking up on pin changes in a
couple of hundred picoseconds to process real-time events. There
are some ports with serializer/deserializer hardware so that
messages can go from chip to chip in the same way that they
move between core on the same chip, except slower. The design
is Forth CSP with multiport addressing capability which makes for
very small programs.

The design has some things, from a software standpoing, in common
with parallel designs like the CELL. The big differences are, large
ram
spaces and floating point hardware. This results in a 20000/1 ratio
between core sizes so they are very different in most ways. Each
CELL core equates to about 20,000 700 integer MIP Forth core.
This is also why these tiny core are less likely to have fatal flaws
and why yield has been very close to 100%. These Forth core
have to have dense code, they only have 64 words of internal RAM
and 64 words of internal ROM each.

Most Forth words by frequency of execution are five-bit native opcodes
that pack together to form very dense code. This makes for very
dense arget code. The target code, the development code, even the
development tools are remarkably small and fast. When we tell
people that the boot code, the OS, the editor, the compiler, the
full custom VLSI CAD suite with a dozen programs, target compilers,
hardware and software simulators, design rule check and GDS
extract utilities, and source code to several chips all fit on a
fraction
of a floppy drive. When they see us do in a few seconds things
that take other people all day with their tools they are often very
surprised by how our tools operation. It is also interesting to
me that SPICE based tools claim these designs are impossible
and won't run at all.

These things are unusual and not what people are used to. I have
not worked on a target chip for which there was a C compiler in about
twenty five years. I have seen threads about if C is close to the
machine
but I never see people ask if the machine is close to C. These aren't
but they have so much of Forth in hardware that much of traditional
Forth isn't needed.

The tiny multicore chips are different. I noticed at one trade show
that
we had the only multicore chips that didn't need fans. As they are
not
supported by mainstream tools they will most likey remain a niche
product. I don't know if 'active messave' should apply to the
SEAforth
design or not. I think it did apply to the F21 we did long ago.

Best Wishes

From: Bernd Paysan on 1 Jan 2010 16:29

nmm1(a)cam.ac.uk wrote:
>>Hm, the most essential usage of address-of (unitary prefix &) in C is
>>to
>>return multiple parameters. ...
>
> The mind boggles. That is so far down the list of its uses that I
> hadn't even thought of it.
>
> You could start with trying to work out how to use scanf without it,

scanf *is* precisely what I'm talking about: Returning multiple values.

int a;
float b;
char c[];
(a, b, c) = tuple_fscanf(file, "%d%f%s");

The problem that this sort of "format string" based procedures are
completely bonkers as API isn't solved ;-). Of course you'd need some
way to accumulate an arbitrary run-time defined tuple (similar to the
problems of varargs), if you want to keep this crazy stuff. The good
news is: Such a tuple as return value on the stack will not mess around
with addresses that are not there, but maybe push more values on the
stack as needed - but the stack cleanup after calling tuple_fscanf will
deal with that. Format string errors then will still lead to wrong
values in the assigned tuple, but *not* in stuff written into the return
address (code space).

> and then try to pass a subsection of an array to a function (which
> then treats the subsection as a complete array).

Ah, that's easy:

int foo[10];

bar(foo+5);

No "address of" required, foo is an array object, +n is the operator to
create an array subsection. If you want to change the end, as well,
cast:

bar((int[3])(foo+5));

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

From: Robert Myers on 1 Jan 2010 18:07

On Jan 1, 1:59 pm, Mayan Moudgill <ma...(a)bestweb.net> wrote:
> Robert Myers wrote:
> > On Dec 31, 8:30 am, Mayan Moudgill <ma...(a)bestweb.net> wrote:
>
> >>Any problem that is not I/O bound that can be made solved using a
> >>non-imperative language can be made to work faster in an imperative
> >>language with the appropriate library (i.e. if you can write it in ML,
> >>it can be re-written in C-as-high-level-assembly, and it will perform
> >>better).
>
> > If your original claim is correct even for one machine (and the truth
> > of your claim is not obvious to me, given the lengthy discussion here
> > about memory models),
>
> Its not necessarily simple, but it is definitely do-able. One of the
> economic costs is being able to find the programmer(s) who can pull this
> off.
>
The closest thing to a proof I can examine that such programmers even
exist and even then, only to some approximation, is the Linux kernel.
As Far as I'm concerned, Windows and nearly every bit of software that
runs on it is a glaring counterexample. I assume the the industrial-
strength commercial *ixes might also be examples, but I don't (for the
most part) use them, and I can't examine the source.

I doubt if operating systems will ever be written in an elegant,
transparent, programmer-friendly language. For that universe (and
maybe some os-like systems like database software) your advice is, at
least in some practical sense, probably correct.

For the kind of scientific computing with which I have the most
experience, one naive calculation you can do shows the amount of
physics you can do as scaling as N^(1/4) (three space dimenstions,
plus time, leaving out lots of details). If your computer is 4x, then
you can do 40% more physics in the same amount of time, but with a 4x
expenditure of energy (and investment in hardware, but that's money
into the pockets of hardware mfrs, which is ok by me). The bomb labs,
of course, don't have to fuss with that sort of pedestrian
consideration, but I assume that much of the rest of the world does.

You are, I think, examining a universe in which the payoff for
performance is linear or even possibly better. I'm generally looking
at problems where the payoff for performance increases is marginal.

On the other hand, energy and computers are expensive, but scientists
and scientific programmers are also expensive, and the costs
associated with non-transparency and non-portability are very high.
Actually, I'd say those costs are unacceptable, but the world of
science has not yet advanced to my level of thinking. ;-) People are
perfectly happy to look at the end result of computations, taking it
largely on faith that the computations are correct or even make sense,
just so long as they fit "data" or prevailing prejudices. I don't
know why people bother with models that ostensibly mimic physics. In
the old days, people just fit curves, and I'm not sure how far beyond
curve-fitting we have actually advanced for pure science. I'm sure
that the picture doesn't look nearly as dismal for some kinds of
applications.

Some engineering applications make good use of large-scale
computation. One aerodynamicist I talked to who used CFD as a black
box said he was convinced there was a bug in a program that was widely
relied upon for aerodynamics. Even there, the successes may be
substantially delusional/luck.

> > does it then follow that it is also possible for
> > a second, or a third machine? If the cost of rewrite is bounded for
> > one machine (that is to say that, you eventually succeeded in getting
> > the program to work--a least you *think* you got it to work--
> > correctly), it is bounded for a second or a third machine?
>
> Yes - generally, to get it done for one machine, one will have to have
> identified all sharing/communication between the parallel entities. This
> is work you don't have to do again.
>
> There is a caveat - if the code (algorithms, task partititioning) has to
> be reworked because of large differences between the systems, then much
> less code/learning carries over.
>
But, for some applications, the costs may simply be unacceptable. You
can't invest the money to duplicate results? Too bad, then, I guess
you'll have to accept my results at face value, and, of course, I'm
the only one who will ever get support for working this problem,
because it's too expensive to move it anywhere. Good deal.

> > I'll accept that Nick is an expert at finding and fixing weird memory
> > problems. From reading his posts, I'm convinced that the memory model
> > of c is not well-defined or even agreed-upon in any practical sense.
>
> Which is why it's C-as-high-level-assembly, not C-as-an-ANSI-standard,
> will be used for getting the job done. Actually, thats not strictly true
> - you have to partition stuff into things that are isolated to one
> process/CPU and code dealing with the sharing of state/synchronization
> between processes. The isolated case can generally be written in vanilla
> C (or any other language), while the sharing case has to be written
> non-portably - perhaps actually in assembly. Hopefully, a lot of the
> shared code can be hidden behind macros, function calls or code
> generators to isolate the behavior.
>
Does this "C-as-high-level-assembly" compiler exist?

> > Thus, not only are you worried about whether your program will work
> > with a second machine, but even with a second compiler with a
> > different notion of critical details of c.
>
> Again: you don't rely on the compiler to get those details correct. You
> have to ensure that it is correct. This may mean constraining the
> compiler in ways that make it no more than a high-level assembler, or
> using inline assembly, or even straight assembly where necessary.
>
Does an appropriately-constrained compiler exist? People seem to want
to add features, not remove them.

> The problem is that unless you've done it, you don't know where the
> friction points are, and you assume that its too difficult. It isn't -
> its just complicated engineering. I can think of lots of systems codes
> which are in many ways more complicated.
>
For the kinds of problems you are most accustomed to thinking about,
perhaps.

> > From an economic point of view, the proposed trade doesn't even seem
> > rational: trade a program with transparently correct memory semantics
> > (perhaps because there are no memory semantics to be incorrect) for
> > one that is faster but may or may not do the same thing under some
> > limited set of circumstances.
>
> Generally tasks that are not trivially paralellizable/distributable and
> are not IO bound are parallelized because the performance is inadequate.
> If the performance is inadequate, it may be because we don't have the
> best serial implementation, or because the best serial implementation is
> itself not sufficient.
>
> What is the slowdown between approach X (for you favorite value of X)
> and the best serial implementation? This slowdown matters - a lot.
>
> If, for instance, ths slow down 4x, does that mean that we will end up
> with identical performance using 4 way parallelism? Probably not - the
> parallel inefficiencies will probably mean that we break even at 6x-8x.
>
> So: is it more economic to focus on writing a serial imperative program
> or a parallel approach X program?
>
> How about the case where we're going to *have to* parallelize - even
> with the best case serial program is just too slow. In that case, both
> the imperative approach and the alternative(s) will have to be parallel.
> What are the inefficencies here?
>
> The hierarchical nature of communication between processors means that a
> 4-way parallel machine will have better communication properties than a
> 16-way parallel machine, which in turn will be better than a 64-way and
> so on. This means that if we can fit an imperative parallel program into
> a 4 way, and approach X is 4x slower, then approach X will be forced
> into a 16 way. But since it is now one level down the communication
> hierarchy, it is quite possible that it will be even slower, requiring,
> say, a 32 way machine to be competitive.
>
The scientists I know generally want to speed things up because they
are in a hurry.

The question is: is it better to do a bit less physics and/or let the
machine run longer, or is it better to use up expensive scientist/
scientific programmer time and, at the same time, make the code opaque
and not easily transportable?

> Also, in some programs, it is easy to extract a small amount of (task)
> parallelism, but it is not possible to extract large (or unbounded)
> parallelism.
>
If we can't do "unbounded" ("scalable") parallelism, then there is an
end of the road as far as some kinds of science are concerned, and we
may already be close to it or even there in terms of massive
parallelism (geophysical fluid dynamics would be an example). The
notion that current solutions "scale" is pure bureaucratic fraud.
Manufacturers who want to keep selling more of the same (do you know
any?) cooperate in this fraud, since the important thing is what the
customer thinks.

> It possible that we have access to an N-way machine, there is N-way
> parallelism available in the program, the N-way solution using
> approach-X is fast enough, and we prioritize the advantages of using
> approach X (time-to-market, programmer availablity, etc.) over the
> baseline, highest performance, approach. In that case, we are free to
> speculate about the various alterative programming approaches.
>
Which is mostly the kind of problem I am familiar with. Within a
constrained universe, your advice seems eminently sensible.

My bitter observation (and maybe Nick will agree) is that the world
has come to be dominated by a language (C) that is best suited for
writing operating systems, while most of us never have such a need.

Robert.

From: Del Cecchi on 1 Jan 2010 19:19

"Mike" <mike(a)mike.net> wrote in message
news:v_qdnUeuT-97zKPWnZ2dnUVZ_hadnZ2d(a)earthlink.com...
>
> "Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote in message
> news:4B3E4928.7060703(a)patten-glew.net...
> | nmm1(a)cam.ac.uk wrote:
> | > C99 allows you to encrypt addresses and/or save them on disk.
> | > Seriously.
> |
> | Which is, seriously, a good idea. For certain classes of
> applications.
> | Such as when you want to persist a data structure to disk, that
> you
> will
> | later load into exactly the same machine, at the same locations.
> Like
> | in a phone switch.
> |
> | However, for 99% of the jobs we need to do, not such a good idea.
> |
> | Except... you can do stupid persistence packages for single
> threaded
> | machines, on OSes that guarantee that data is always allocated at
> the
> | same address. Ditto simplistic checkpoint recovery schemes.
> |
> | So I guess that it is not all that stupid for those apps.
> |
> | But it sure does get in the way for alias analysis.
>
>
> The IBM System i (not single threaded) places the file system in a
> single virtual address space in which all objects have a single
> constant virtual location which is never reassigned. That may
> provide
> a lead to a practical approach.
>
Back in the day it used to be said that system/i (os/400, s/38) didn't
really have a file system since it had a very large virtual address
space in which objects were located. But I was a hardware guy and
didn't really get the details.

del

From: Bill Todd on 1 Jan 2010 19:57

Del Cecchi wrote:
> "Mike" <mike(a)mike.net> wrote in message
> news:v_qdnUeuT-97zKPWnZ2dnUVZ_hadnZ2d(a)earthlink.com...

....

>> The IBM System i (not single threaded) places the file system in a
>> single virtual address space in which all objects have a single
>> constant virtual location which is never reassigned. That may
>> provide
>> a lead to a practical approach.
>>
> Back in the day it used to be said that system/i (os/400, s/38) didn't
> really have a file system since it had a very large virtual address
> space in which objects were located.

Well, sort of - at least in the sense that it didn't have a file system
that was exposed to applications.

But it must have had something resembling a file system internally if it
allowed objects to grow, because despite the fact that it had (for the
time) an effectively infinite virtual address space into which to map
them it had decidedly finite physical storage space on disk in which to
hold them, hence needed a mechanism to map an arbitrarily large
expandible object onto multiple separate areas on disk while preserving
its virtual contiguity (and likely also required a means to instantiate
new objects too large to fit into any existing physically-contiguous
area of free space).

The normal way a file system (just like almost everyone else) supports
movable/expandible objects with unvarying addresses is via indirection,
substituting the unvarying address of a small pointer for that of an
awkwardly large and/or variable-size object. That unvarying address
need not be physical, of course - e.g., the i-series may have hashed the
constant virtual address to a chain address and then walked the chain
entries until it found one stamped with the desired target virtual address.

But it's not clear how applicable this kind of solution would be to the
broader subject under discussion here.

- bill

First | Prev | Next | Last
Pages: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
Prev: PEEEEEEP
Next: Texture units as a general function