From: George Neuner on
On 12 Jan 2007 01:35:29 -0800, "Tim Bradshaw" <tfb+google(a)tfeb.org>
wrote:

>George Neuner wrote:
>
>> Well, on Cells the private memories are not cache but staging memories
>> ... the main processor has to move data into and out of them on behalf
>> of the coprocessors.
>
>It doesn't matter very much who moves the data, it's still cache :-).
>The issue that counts, really, is what the programming model is at the
>user level. No one should need to care whether things are done
>automagically by the hardware as most L1/L2 caches are today, or by
>hardware with substantial SW support as, say, MMUs, or almost entirely
>by SW with some small amount of HW support, as, say disk paging.
>(Actually, the second thing that counts is whether the HW can
>efficiently support the programming model you choose.)

I have considerable experience with manual staging (on DSPs) and I can
tell you that it is a royal PITA to schedule several functional units
and keep them going full blast using software alone.

Cell is less onerous only because of the granularity of the code the
coprocessors can execute - whole functions or miniprograms rather than
the baby steps DSP units can take.


>> AFAIK, no one has tried to offer a hardware solution to staging
>> computations in a distributed memory system since the KSR1 (circa
>> 1990, which failed due to the company's creative bookkeeping rather
>> than the machine's technology). Everyone now relies on software
>> approaches like MPI and PVM.
>
>Well, I think they have actually, in all but name: that's essentially
>what NUMA machines are. Such machines are quite common, of course
>(well, for bigger systems anyway): all Sun's recent larger machines (4
>& 5-digit sunfire boxes) are basically NUMA, and it may be that smaller
>ones are too.

Non Uniform Memory Access simply means different memories have
different access times - that describes just about every machine made
today. The NUMA model distinguishes between "near" and "far" memories
in terms of access time, but does not distinguish by how the memories
are connected - a system with fast cache and slower main memory fits
the model just as well as one with a butterfly network between CPU and
memory.


>Of course, as I said above, this comes down to programming model and
>how much HW support you need for it. I think the experience of the
>last 10-20 years is that a shared memory model (perhaps "shared address
>space"?), preferably with cache-coherency, is a substantially easier
>thing to program for than a distributed memory model. Whether that will
>persist, who knows (I suspect it will, for a surprisingly long time).
>Of course the physical memory that underlies this model will become
>increasingly distributed, as it already has to a great extent.

It's all about the programming model and I think you are on the right
track. Shared address space is the right approach, IMO, but further I
believe it should be implemented in hardware.

That is why I mentioned KSR1 - the only massive multiprocessor I know
of that tried to help the programmer. KSR1 was a distributed memory
multiprocessor (256..1088 CPUs) with a multilevel caching tree network
which provided the programmer with the illusion of a shared memory.
The KSR1 ran a version of OSF/1, so software written for any shared
memory Unix multiprocessor was relatively easy to port - an important
consideration because most people looking to buy a supercomputer were
outgrowing a shared memory machine.

There was, of course, a penalty paid for the illusion of shared
memory. Estimates were that the cache consistency model slowed the
machine by 15-25% vs comparable MPI designs, but IMO that was more
than made up for by the ease of programming. The second generation
KSR2 improved shared memory speeds considerably, but few people ever
saw one - the company went belly up before it was formally introduced.


George
--
for email reply remove "/" from address
From: Tim Bradshaw on
mark.hoemmen(a)gmail.com wrote:

> Rob Warnock wrote:
> > "dusty deck" codes
>
> 1960's codes
>
> parallel codes
>
> Recent codes

OK, I think this makes the point that "codes" is a common usage in the
HPC community. I will expect delivery of Chris Barts' head on a
platter tomorrow morning. You can do what you want with the rest of
him.

From: Tim Bradshaw on
George Neuner wrote:


> I have considerable experience with manual staging (on DSPs) and I can
> tell you that it is a royal PITA to schedule several functional units
> and keep them going full blast using software alone.

I bet it is!



> Non Uniform Memory Access simply means different memories have
> different access times - that describes just about every machine made
> today. The NUMA model distinguishes between "near" and "far" memories
> in terms of access time, but does not distinguish by how the memories
> are connected - a system with fast cache and slower main memory fits
> the model just as well as one with a butterfly network between CPU and
> memory.

I agree with this in theory (and of course, all (well, nearly, there
have been recent cacheless designs which aimed to hide latency by
heavily multithreaded processors) machines are NUMA in that sense. But
I think the conventional use for the term is for multiprocessors where
all memory is "more local" (in time terms) to some processors than it
is to others, and that was the sense in which I was using it. You can
think of these kinds of machines as systems where there is only cache
memory. It seems to me inevitable that all large machines will become
NUMA, if they are not all already. And the nonuniformity will increase
over time.

My argument is that physically, these machines actually are distributed
memory systems, but their programming model is that of a shared memory
system. And this illusion is maintained by a combination of hardware
(route requests to non-local memory over the interconnect, deal with
cache-coherency etc) and system-level software (arrange life so that
memory is local to the threads which are using it where that is
possible etc).

Of course these machines typically are not MPP systems, and are also
typically not HPC-oriented. Though I think SGI made NUMA systems with
really quite large numbers of processors, and a Sun E25K can have 144
cores (72 2-core processors), though I think it would be quite unusual
to run a configuration like that as a single domain.

--tim

From: Chris Barts on
On Fri, 12 Jan 2007 14:20:11 -0800, Tim Bradshaw wrote:

>
> OK, I think this makes the point that "codes" is a common usage in the
> HPC community. I will expect delivery of Chris Barts' head on a
> platter tomorrow morning. You can do what you want with the rest of
> him.

Doesn't matter. It merely means a lot of people are wrong, and a lot of
people need frying.

--
My address happens to be com (dot) gmail (at) usenet (plus) chbarts,
wardsback and translated.
It's in my header if you need a spoiler.


----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
From: Chris Barts on
On Thu, 11 Jan 2007 03:37:59 -0800, Tim Bradshaw wrote:

> Chris Barts wrote:
>
>>
>> How many people have forgotten that 'code' is a mass noun and, as such,
>> does not take plurals? Do you also say 'these muds' and 'these dusts'?
>
> How many people have forgotten that *language changes over time* and is
> not something handed down from the elder days, never to be changed?

"Like, wow, dude! Language is whatever I say it is! Crumb buttercake up
the windowpane with the black shoehorn butterhorse!"

Grow up.

--
My address happens to be com (dot) gmail (at) usenet (plus) chbarts,
wardsback and translated.
It's in my header if you need a spoiler.


----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----