Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: Mayan Moudgill on 1 Jan 2010 03:35

Bernd Paysan wrote:
> Mayan Moudgill wrote:
>
>>So: summarizing - I still don't think active messages is the right
>>name. I haven't encountered any real-life instances where people
>>actually send code to be executed (or even interpreted) at a low-level
>>inside the device driver.
>
>
> I have. Several different, to tell you. One guy (Heinz Schnitter)
> sends source code around - this is a distributed programming system.
> Another one (Chuck Moore) sends instructions around - this is a small
> chip full of tiny CPUs. They all did not really generalize, though I
> know that Chuck Moore knows what Heinz did.
>
>

Got any references that are publically available?

Thanks

Mayan

From: Gavin Scott on 1 Jan 2010 04:22

Bernd Paysan <bernd.paysan(a)gmx.de> wrote:
> The tragedy is that it was never designed for that job. It just
> happened. The consequence of all these accidental pieces growing
> together as the "next OS" is a nightmare.

Indeed, but that hardly makes it unique in this industry :)

G.

From: nmm1 on 1 Jan 2010 07:30

In article <4B3CF140.2000504(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>> The key is that you can't do ANYTHING when handicapped by a memory
>> model like those of C, C++ or Java - once the address of a subobject
>> is required to be fixed, you are dead.
>
>Ah, now we are getting somewhere.
>
>Now, I *like* the fact that C in some ways is essentially a structured
>assembly - at a particular point in time an object or subobject has an
>address, and that can be taken and used.

If it had been left at that, it would have been fine. The disaster
was to put a layer of high-level language paint on it.

>However, I agree that it is the possibly of such an address persisting
>forever that gets in the way of so many things.
>
> Although... garbage collection systems can at least locate all such.
>However, we would not want to have to do such a scan from the GC roots
>frequently, as part of some synchronization or parallelization operations.

C99 allows you to encrypt addresses and/or save them on disk.
Seriously.

>So, I agree that it might be nice to have a language "mode" that forbids
>or deprecates the address-of operator. E.g. one that had a special
>repreesentation for parameters, that allowed reference, copy-in/out,
>etc. semantics. E.g. one that had a special representation for the
>pointers that so many people use to walk down an array or linked list.

You couldn't add one to C, without making it unusable. Even ignoring
the fact that subscription needs the address-of primitive, there is
just SO much else that depends on the basic static model. You can't
read data, handle strings or subsection arrays without the address-of
primitive, to name just three features. Simply forbidding the operator
isn't enough.

Also, modes have never worked in the past, and I can't see anything
that has changed. Obviously, the language needs to be designed to
support the modes, or else they are merely 'style guidelines', which
have failed time and time again. But that's not enough. There isn't
any real likelihood of starting with a C-like language and adding
enough constraints to make it reliable or optimisable.

I looked at C# and wasn't impressed. Yes, it tackled many of the
details, but skirted around most of the underlying problems. This
problem has been tackled many times over the past 40 years, and is
well-known to be hard. It needs a more radical approach than C#.

Regards,
Nick Maclaren.

From: nmm1 on 1 Jan 2010 07:37

In article <4B3D7D40.4070807(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>>
>>> Herlihy's non-blocking lock free algorithms using compare-and-swap
>>> depend on there being a single pointer that dominates the data structure
>>> you are changing.
>>
>> One of the many reasons that so many people feel that a double form
>> of compare-and-swap is essential.
>
>I've read Greenwald's thesis, and I am sympathetic. In fact, I have
>proposed implementations for DCAS, twice at Intel and once at AMD.
>
>However
>
>1) It doesn't meet all needs - witness the folks who ask for KCSS (K-way
>Compare and Swap Synchronization) and Transactional Memory with
>arbitrary numbers of locations involved.
>
>2) It doesn't seem to be necessary in a theoretical sense:

Agreed, but with reservations on the second point. Most of the
alternative approaches have ignored the performance implications.
Floating-point hardware isn't needed in a theoretical sense, either,
but there are people who would scream if you dropped it :-)

What is clear is that there are a zillion ways to provide adequate
'atomic' primitives, and most of them can be claimed to be 'best'
if you choose the right criteria. I doubt that agreement will be
reached in my lifetime.

>The thing I don't like about DCAS or KCSS or TM is that most of the
>hardware proposals depend on having a cache coherence protocol.
>
>As explained in other posts, of late I have been thinking in terms of
>supercomputers without a cache coherence protocol.

It's a thorny problem. I don't have any good ideas.

Regards,
Nick Maclaren.

From: Mayan Moudgill on 1 Jan 2010 13:59

Robert Myers wrote:
> On Dec 31, 8:30 am, Mayan Moudgill <ma...(a)bestweb.net> wrote:
>
>
>>Any problem that is not I/O bound that can be made solved using a
>>non-imperative language can be made to work faster in an imperative
>>language with the appropriate library (i.e. if you can write it in ML,
>>it can be re-written in C-as-high-level-assembly, and it will perform
>>better).

>
> If your original claim is correct even for one machine (and the truth
> of your claim is not obvious to me, given the lengthy discussion here
> about memory models),

Its not necessarily simple, but it is definitely do-able. One of the
economic costs is being able to find the programmer(s) who can pull this
off.

> does it then follow that it is also possible for
> a second, or a third machine? If the cost of rewrite is bounded for
> one machine (that is to say that, you eventually succeeded in getting
> the program to work--a least you *think* you got it to work--
> correctly), it is bounded for a second or a third machine?

Yes - generally, to get it done for one machine, one will have to have
identified all sharing/communication between the parallel entities. This
is work you don't have to do again.

There is a caveat - if the code (algorithms, task partititioning) has to
be reworked because of large differences between the systems, then much
less code/learning carries over.

>
> I'll accept that Nick is an expert at finding and fixing weird memory
> problems. From reading his posts, I'm convinced that the memory model
> of c is not well-defined or even agreed-upon in any practical sense.

Which is why it's C-as-high-level-assembly, not C-as-an-ANSI-standard,
will be used for getting the job done. Actually, thats not strictly true
- you have to partition stuff into things that are isolated to one
process/CPU and code dealing with the sharing of state/synchronization
between processes. The isolated case can generally be written in vanilla
C (or any other language), while the sharing case has to be written
non-portably - perhaps actually in assembly. Hopefully, a lot of the
shared code can be hidden behind macros, function calls or code
generators to isolate the behavior.

> Thus, not only are you worried about whether your program will work
> with a second machine, but even with a second compiler with a
> different notion of critical details of c.

Again: you don't rely on the compiler to get those details correct. You
have to ensure that it is correct. This may mean constraining the
compiler in ways that make it no more than a high-level assembler, or
using inline assembly, or even straight assembly where necessary.

The problem is that unless you've done it, you don't know where the
friction points are, and you assume that its too difficult. It isn't -
its just complicated engineering. I can think of lots of systems codes
which are in many ways more complicated.

> From an economic point of view, the proposed trade doesn't even seem
> rational: trade a program with transparently correct memory semantics
> (perhaps because there are no memory semantics to be incorrect) for
> one that is faster but may or may not do the same thing under some
> limited set of circumstances.

Generally tasks that are not trivially paralellizable/distributable and
are not IO bound are parallelized because the performance is inadequate.
If the performance is inadequate, it may be because we don't have the
best serial implementation, or because the best serial implementation is
itself not sufficient.

What is the slowdown between approach X (for you favorite value of X)
and the best serial implementation? This slowdown matters - a lot.

If, for instance, ths slow down 4x, does that mean that we will end up
with identical performance using 4 way parallelism? Probably not - the
parallel inefficiencies will probably mean that we break even at 6x-8x.

So: is it more economic to focus on writing a serial imperative program
or a parallel approach X program?

How about the case where we're going to *have to* parallelize - even
with the best case serial program is just too slow. In that case, both
the imperative approach and the alternative(s) will have to be parallel.
What are the inefficencies here?

The hierarchical nature of communication between processors means that a
4-way parallel machine will have better communication properties than a
16-way parallel machine, which in turn will be better than a 64-way and
so on. This means that if we can fit an imperative parallel program into
a 4 way, and approach X is 4x slower, then approach X will be forced
into a 16 way. But since it is now one level down the communication
hierarchy, it is quite possible that it will be even slower, requiring,
say, a 32 way machine to be competitive.

Also, in some programs, it is easy to extract a small amount of (task)
parallelism, but it is not possible to extract large (or unbounded)
parallelism.

It possible that we have access to an N-way machine, there is N-way
parallelism available in the program, the N-way solution using
approach-X is fast enough, and we prioritize the advantages of using
approach X (time-to-market, programmer availablity, etc.) over the
baseline, highest performance, approach. In that case, we are free to
speculate about the various alterative programming approaches.

As an aside: if approach X uses garbage collection, then there may be a
significant penalty for going from a serial implementation to a parallel
implementation; a simple comparison of the performance of the sequential
implementations in approach X and an imperative language may understate
the comparative performance.

First | Prev | Next | Last
Pages: 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
Prev: PEEEEEEP
Next: Texture units as a general function