From: Robert Myers on
On Jun 1, 10:55 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:

>
> I forgot to say that the Kilo-instruction paper speedups I report above make several assumptions that are probably
> conservative. E.g. 4-wide.
>

It's a thought-provoking paper.

I really didn't mean to be insulting, just provocative. I wanted you
to talk more, and you did.

I was thinking of David Dinucci's version of coarse-grained
parallelism when I read your proposal. In that approach, you snip the
problem into carefully-chosen pieces ahead of time, carefully specify
what's coming in and what should go out, and leave it to a run-time
scheduler to put it all back together. Has some nice formal
properties. That was my context for trying to understand what you
were proposing.

I pictured you looking at his proposal and saying,"You want parallel
lumps? Easy." You take a meat cleaver to a serial instruction
stream, whack it into bite-sized chunks, and say, "There. What else
are you looking for?" There had to be some serious potential ugliness
hiding in what you were proposing. Or maybe the whole idea of trying
to identify the parallel pieces on the programming end was just wrong-
headed to begin with. I'm still thinking.

Thinking about how much you might throw into the pot at once and what
you might do with it once it's there is a useful exercise, whether it
leads to a commercially-viable architecture or not.

Robert.
From: nmm1 on
In article <4C041839.90907(a)patten-glew.net>,
Andy 'Krazy' Glew <ag-news(a)patten-glew.net> wrote:
>
> So, first, it is well-known that there is some potential in simply
> increasing instruction window size. The problem is how costly is the
> hardware, esp. power wise, to do so.

Yes.

> E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for
> integer, going from instruction windows of 128 to 4K instructions.
>
> Now, that is not a very good payoff. Definitely sublinear. ....

Bluntly, it's dire. There are lots of other ways to optimise the
'floating-point' workloads to that scale, and a 50% improvement for
a 32-fold increase in window size (for integer) is horrible.

> But the whole point is not that we are trying to build processors that
> are 4096/128 = 32X bigger. First, the OOO hardware is only a fraction
> of processor area, especially if you have big vector units like some
> recent proposals. Second, more importantly, to build big effective
> instruction windows you do not need to build all that hardware. As the
> kilo-instruction papers, and my own work, shows.

I have downloaded the paper and will look at it, but the very fact
that people are considering such extreme measures shows that we are
running out of viable ideas. I am not denying that small speedups
can be obtained, expensively, but we have to rethink from a much more
basic stating point if we want large ones.

And there is no way that we will get there without pain - by which I
mean to the perpetrators of the current code! We have absolutely got
to simplify the hardware's task (which is one of the few points where
I agree with the Itanic's designers).


Regards,
Nick Maclaren.
From: Robert Myers on
On Jun 1, 2:48 pm, n...(a)cam.ac.uk wrote:
> In article <4C041839.90...(a)patten-glew.net>,
> Andy 'Krazy' Glew  <ag-n...(a)patten-glew.net> wrote:
>
>
>
> > So, first, it is well-known that there is some potential in simply
> > increasing instruction window size.  The problem is how costly is the
> > hardware, esp. power wise, to do so.
>
> Yes.
>
> > E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for
> > integer, going from instruction windows of 128 to 4K instructions.
>
> > Now, that is not a very good payoff.  Definitely sublinear.  ....
>
> Bluntly, it's dire.  There are lots of other ways to optimise the
> 'floating-point' workloads to that scale, and a 50% improvement for
> a 32-fold increase in window size (for integer) is horrible.
>

But what are the alternatives? Computers deal with ever-larger
datasets, with ever-increasing numbers of processors, and the data
gradually slide off the latency-tolerance horizon of any given
processor. Faster memory? Chaa, right. Violate apparently basic
laws of physics? Maybe with some as yet undiscovered twist on the EPR
paradox.

Personally, I think speed-up is the wrong focus. Sooner or later, for
reasons that seem pretty fundamental to me, we will be back to
bragging, as Intel once did, about instructions in flight.

Robert.
From: nmm1 on
In article <0b318bfc-a635-4867-bdae-0b51ecae7156(a)m21g2000vbr.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>
>But what are the alternatives? Computers deal with ever-larger
>datasets, with ever-increasing numbers of processors, and the data
>gradually slide off the latency-tolerance horizon of any given
>processor. Faster memory? Chaa, right. Violate apparently basic
>laws of physics? Maybe with some as yet undiscovered twist on the EPR
>paradox.

Hang on. You have seen me ride my hobby-horse off into the sunset
often enough! We aren't going to get anywhere until we change the
current approaches to programming - not just the languages, but the
paradigms used. In the most extreme case, I take that as far as
saying that we need to change the way that mathematics and algorithms
are taught!

But let's not go that far today (it is getting late). We know that
we can do a lot better with far less difficulty than most people
will admit - and will get better RAS, too! Fortran has done a lot
of that, and that's a language from way back. But it assuredly
doesn't come without effort.

>Personally, I think speed-up is the wrong focus. Sooner or later, for
>reasons that seem pretty fundamental to me, we will be back to
>bragging, as Intel once did, about instructions in flight.

God help us, yes :-(


Regards,
Nick Maclaren.
From: Andy 'Krazy' Glew on
On 6/1/2010 11:28 AM, Robert Myers wrote:
> On Jun 1, 10:55 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>
>>
>> I forgot to say that the Kilo-instruction paper speedups I report above make several assumptions that are probably
>> conservative. E.g. 4-wide.
>>
>
> It's a thought-provoking paper.
>
> I really didn't mean to be insulting, just provocative. I wanted you
> to talk more, and you did.
>
> I was thinking of David Dinucci's version of coarse-grained
> parallelism when I read your proposal. In that approach, you snip the
> problem into carefully-chosen pieces ahead of time, carefully specify
> what's coming in and what should go out, and leave it to a run-time
> scheduler to put it all back together. Has some nice formal
> properties. That was my context for trying to understand what you
> were proposing.
>
> I pictured you looking at his proposal and saying,"You want parallel
> lumps? Easy." You take a meat cleaver to a serial instruction
> stream, whack it into bite-sized chunks, and say, "There. What else
> are you looking for?" There had to be some serious potential ugliness
> hiding in what you were proposing. Or maybe the whole idea of trying
> to identify the parallel pieces on the programming end was just wrong-
> headed to begin with. I'm still thinking.
>
> Thinking about how much you might throw into the pot at once and what
> you might do with it once it's there is a useful exercise, whether it
> leads to a commercially-viable architecture or not.

Although I have been having supper once a month with David Dinucci, ever since the Portland Geek Dinner/Friends of
Eugene Miya" started, I have not looked much at his work. Must do so.

Actually, I have spent much of the last twenty years trying to identify the parallel pieces of the program. SpMT is
part of that - restricting the pieces to start/stop at function boundaries (or loops, or IF/ENDIF). But while such
control flow techniques work pretty well, I have not yet figured out how to build hardware to automatically analyze the
data dependencies - or at least hardware that does such analysis in a way that is significantly cheaper than executing
the code itself. In my terminology, I say that such analysis is accomplished a posteriori, after forking the candidate
parallel threads, rather than a priori, beforehand.

For this reason I am very, very, happy if a programmer or compiler is willing to do so for me. Even if just in the form
of hints.

Now, by the way, I am reasonably happy with approaches that do such compiler-style analysis in "hardware", or, rather,
firmware or microcode. The log based mechanisms that I am so fond of even provide such firmware with much more data
than they would have if statically compiling, and more than most JIT code generators have access to. You can execute
the code once or a few times, record the log, and analyze the log to see where the dependences occur, and do not. Such
an approach is certainly as good as a compiler or programmer can be - since we would always give them HINT instructions
to control the fork mechanism themselves - and potentially better, since the firmware has more information to analyze
than does a compiler.

But it has been my observation that a priori predictors always do better than a posteriori. The cost of making those
few initial wrong guesses can dominate, especially for code, such as transaction processing, that tends to thrash the
I-cache.

--

You are correct in observing that my batch scheduling across processor/cluster boundaries is a retreat from such
analysis. But, I emphasize: "batch" is just a technique for creating a larger contiguous instruction window. It's
strength lies in its simplicity. It does not have great upside, and it isn't supposed to.

I am just assembling a set of tools or components for a next generation out-of-order machine:

* Multicluster Multithreading
* => cheap thread forking and migration between clusters and processors
* Batch scheduling across cluster boundaries to use those clusters for single thread
* SpMT ... ditto ...
* Log based

All, or nearly all, of these mechanisms also being useful for parallel programming.
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7
Prev: Larrabee Dead
Next: CHEA Dumbass George Gollin's Racial Hate