From: jgd on
In article <4B21DA4C.8010402(a)patten-glew.net>, ag-news(a)patten-glew.net (
Glew) wrote:

> At SC09 the watchword was heterogeneity.
>
> E.g. a big OOO x86 core, with small efficient cores of your favorite
> flavour. On the same chip.

It's a nice idea, but it leaves some questions unanswered. The small
cores are going to need access to memory, and that means more
controllers in the packages, and more legs on the chip. That costs,
whatever.

Now, are the small cores cache-coherent with the big one? If so, that's
more complexity, if not, it's harder to program. I suspect that if they
share an instruction set with the big core, cache coherency is
worthwhile, but if not, not.

Overall, the main advantage of this idea seems to be having a low-
latency link between main and small cores. That is not to be sneezed at:
we've given up a co-processor project because of the geological ages
needed to communicate across PCI-Express busses. Back-of-the-envelope
calculations made it clear that even if the co-processor took zero time
to do its work, we made a speed loss overall.

> While you could put a bunch of small x86 cores on the side, I think
> that you would probably be better off putting a bunch of small
> non-x86 cores on the side. Like GPU cores. Like Nvidia. OR AMD/ATI
> Fusion.
>
> Although this makes sense to me, I wonder if the people who want x86
> really want x86 everywhere - on both the big cores, and the small.
>
> Nobody likes the hetero programming model. But if you get a 100x
> perf benefit from GPGPU...

The stuff I produce is libraries, that get licensed to third parties and
put into a wide range of apps. Those get run on all sorts of machines,
from all sorts of manufacturers; we need to run on whatever the customer
has, rather simply than what the software developers' managers chose to
buy.

That means "small efficient cores of your favourite flavour" are
something of a pain: if there are several different varieties of such
things out there, I have to support (and thus build for and test) most
of them, or plump for one with a significant chance of being wrong, or
wait for a dominant one to emerge. Which is easiest?

That's the attraction of OpenCL as opposed to CUDA: it isn't tied to one
manufacturer's hardware. However, AMD don't seem to be doing a great job
of spreading it around at present.

The great potential advantage, to me, of the small cores being x86 is
not the x86 instruction set, or its familiarity, or its widespread
development tools. It's having them standardised. That doesn't solve the
problem of making good use of them, but it takes some logistic elements
(and thus costs) out of it.

--
John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: "Andy "Krazy" Glew" on
jgd(a)cix.compulink.co.uk wrote:
> In article <4B21DA4C.8010402(a)patten-glew.net>, ag-news(a)patten-glew.net (
> Glew) wrote:
>
>> At SC09 the watchword was heterogeneity.
>>
>> E.g. a big OOO x86 core, with small efficient cores of your favorite
>> flavour. On the same chip.
>
> It's a nice idea, but it leaves some questions unanswered. The small
> cores are going to need access to memory, and that means more
> controllers in the packages, and more legs on the chip. That costs,
> whatever.
>
> Now, are the small cores cache-coherent with the big one? If so, that's
> more complexity, if not, it's harder to program. I suspect that if they
> share an instruction set with the big core, cache coherency is
> worthwhile, but if not, not.

I must admit that I do not understand your term "legs on the chip". When
I first saw it, I thought that you meant pins. Like, the old two chips
in same package, or on same chip, not sharing a memory controller. But
that does not make sense here.

Whenever you have multicore, you have to arrange for memory access. The
main way this is done is to arrange for all to access the same memory
controller. (Multiple memory controllers are a possibility. Multiple
MCs subdividing the address space, either by address ranges or by
interleaved cache lines or similar blocks, a possibility. Multiple MCs
with separate address spaces, dedicated to separate groups of
processors, are possible. But I don't know what would would motivate
that. Bandwidth - but non-cache coherent shared memory has the same
bandwidth advantages. Security?)

I therefore do not understand you when you say "that means more
controllers in the package". The hetero chips would probably share the
same memory controller.

If you mean cache controllers, yes: if you want cache consistency, you
will need cache controlers for every small processor, or at least group
of processors.

If you have a scalable interconnect on chip, then both big and small
processors will connect to it. Having N big cores + M small cores is no
more complex in that regard than having N+M big cores. Except... since
the sizes and shapes of the big and small cores is different, the
physical layout will be different. Timing, etc. (But if you are
creating a protocol that is timing and layout sensitive, you deserve to
be cancelled.) Logically, same complexity.

Testing wise, of course, different complexity. You would have to test
all of the combinations big/big, big/small, small/small, small/small on
the ends of the IC, ...

--

As for cache consistency, that is on and off. Folks like me aren't
afraid to take the cache protocols that work on multichip systems, and
put them on-chip. Integration is obvious. Where you get into problems
is wrt tweaking.

On the other hand, big MP / HPC systems tend to have nodes that consist
of 4-8-16 cache consistent shared memory cores, and then run PGAS style
non-cache-coherent shared memory between them, or MPI message passing.
Since integration is inevitable as well as obvious, inevitably we
will have more than one cache coherent domains on chip, which are PGAS
or MPI non-cache coherent between the domains.

From: nmm1 on
In article <4B265271.6020809(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>jgd(a)cix.compulink.co.uk wrote:
>>
>>> At SC09 the watchword was heterogeneity.
>>>
>>> E.g. a big OOO x86 core, with small efficient cores of your favorite
>>> flavour. On the same chip.
>>
>> It's a nice idea, but it leaves some questions unanswered. ...
>>
>> Now, are the small cores cache-coherent with the big one? If so, that's
>> more complexity, if not, it's harder to program. I suspect that if they
>> share an instruction set with the big core, cache coherency is
>> worthwhile, but if not, not.
>
>As for cache consistency, that is on and off. Folks like me aren't
>afraid to take the cache protocols that work on multichip systems, and
>put them on-chip. Integration is obvious. Where you get into problems
>is wrt tweaking.

Precisely. Therefore, when considering larger multi-core than today,
one should look at the systems that have already delivered that using
multiple chips, and see how they have done. It's not pretty.

Now, it is POSSIBLE that multi-core coherence is easier to make
reliable and efficient than multi-chip coherence, but a wise man
will not assume that until he has investigated the causes of the
previous problems and seen at least draft solutions.

8-way shouldn't be a big deal, 32-way will be a lot trickier,
128-way will be a serious problem and 512-way will be a nightmare.
All numbers subject to scaling :-)

>On the other hand, big MP / HPC systems tend to have nodes that consist
>of 4-8-16 cache consistent shared memory cores, and then run PGAS style
>non-cache-coherent shared memory between them, or MPI message passing.

The move to that was a response to the reliability, efficiency and
(most of all) cost problems on the previous multi-chip coherent
systems.

> Since integration is inevitable as well as obvious, inevitably we
>will have more than one cache coherent domains on chip, which are PGAS
>or MPI non-cache coherent between the domains.

Extremely likely - nay, almost certain. Whether those domains will
share an address space or not, it's hard to say. My suspicion is
that they will, but there will be a SHMEM-like interface to them
from their non-owning cores.


Regards,
Nick Maclaren.
From: jgd on
In article <4B265271.6020809(a)patten-glew.net>, ag-news(a)patten-glew.net (
Glew) wrote:

> I must admit that I do not understand your term "legs on the chip".
> When I first saw it, I thought that you meant pins. Like, the old two
> chips in same package, or on same chip, not sharing a memory
> controller. But that does not make sense here.

That is what I meant. I just wasn't clear enough.

> Whenever you have multicore, you have to arrange for memory access.
> The main way this is done is to arrange for all to access the same
> memory controller. (Multiple memory controllers are a possibility.

I wasn't explaining enough. A single memory controller does not seem
to be enough for today's big OOO x86 cores. A Core 2 Duo has two memory
controllers; a Core i7 has three. This is inevitably pushing up pin
count. If you add a bunch more small cores, you're going to need even
more memory bandwidth, and thus presumably more memory controllers. This
is do doubt achievable, but the price may be a problem.

--
John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: Robert Myers on
On Dec 14, 4:03 pm, j...(a)cix.compulink.co.uk wrote:
 
>
> I wasn't explaining enough. A single memory controller does not seem
> to be enough for today's big OOO x86 cores. A Core 2 Duo has two memory
> controllers; a Core i7 has three. This is inevitably pushing up pin
> count. If you add a bunch more small cores, you're going to need even
> more memory bandwidth, and thus presumably more memory controllers. This
> is do doubt achievable, but the price may be a problem.

Bandwidth. Bandwidth. Bandwidth.

It must be in scripture somewhere. It is, but no one reads the Gospel
according to Seymour any more.

Is an optical fat link out of the question? I know that optical on-
chip will take a miracle and maybe a Nobel prize, but just one fat
link. Is that too much to ask?

Robert.
First  |  Prev  |  Next  |  Last
Pages: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Prev: PEEEEEEP
Next: Texture units as a general function