Life After Moore's Law [Computer Architecture]

Prev: Processors stall on OLTP workloads about half the time--almost no ?matter what you do
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do

From: Chris Gray on 2 May 2010 16:50

"nedbrek" <nedbrek(a)yahoo.com> writes:

> I'm particularly interested in parallel linking.

I wrote the original Myrias linker/librarian. One of my goals was to learn
how to use the Myrias "pardo" model on that kind of application, so the
linker was a parallel application. Normally it ran on a workstation, where
the pardo loops serialized. We ran it on the actual hardware as a bit of a
test, and it was basically I/O bound.

One of the tools that it needed, which I may have described here before, was
the new system I/O calls we added. In particular, the linker needed
"seekread", which was an atomic seek/read combination. The linker also
needed "tellwrite", which was an atomic write to the current end of the
file, which returned the file position at which the write ocurred. Those
calls allowed parallel I/O without confusion.

Those, in combination with a new object file format (officially known as
SCOFF - Super Computer Object File Format, but which was really "Stuart and
Chris's Object File Format") allowed me to write the linker. The object file
format started with a magic number, and then a pointer to the directory. The
directory was at the end of the file, written after all of the
code/data/whatever sections had been written out in parallel using tellwrite.
I *think* the file format also represented each function separately, so that
functions could be linked as separate entities. That problem has always
bugged me about things like Elf, which presumeably were based on the way in
which the original C compiler translated each source file into a single large
assembly file which was then assembled as one indivisible blob. I think Tera
was one of several projects which worked around that.

I don't have the code here, and it was a long time ago, so I don't remember
much in the way of how it worked internally, but I believe there were 2 or
3 pardo's in the code. After I had finished with it, the new compiler group
decided to break the linker/librarian into two separate programs - I don't
recall why. So, even if I could get hold of the latest version of it, I
likely wouldn't be very familiar with the code.

Basically, the parallelism was over the input files, and then over the
functions and data sections within them. I believe there was an outer loop to
iterate over what was learned about unresolved symbols (you don't want the
parallel tasks to grab those things themselves, else you can end up with many
copies of them). We did not use the Myrias memory semantics - the child tasks
simply allocated what they needed, possibly reading code/data into it, and
that memory was then given to the main task, as the result of the child's
work.

--
Experience should guide us, not rule us.

Chris Gray cg(a)GraySage.COM

From: Paul Wallich on 3 May 2010 12:06

nedbrek wrote:
> Hello all,
>
> <nmm1(a)cam.ac.uk> wrote in message
> news:hre2p7$3nf$1(a)smaug.linux.pwf.cam.ac.uk...
>> That being said, MOST of the problem IS only that people are very
>> reluctant to change. We could parallelise ten or a hundred times
>> as many tasks as we do before we hit the really intractable cases.
>
> I'm curious what sort of problems these are? My day-to-day tasks are:
> 1) Compiling (parallel)
> 2) Linking (serial)
> 3) Running a Tcl interpreter (serial)
> 4) Simulating microarchitectures (serial, but I might be able to run
> multiple simulations at once, given enough RAM).

I know I'm not well-versed here, but isn't simulating microarchitectures
at least small-n parallel?

From: MitchAlsup on 3 May 2010 18:56

On May 3, 11:06 am, Paul Wallich <p...(a)panix.com> wrote:
> I know I'm not well-versed here, but isn't simulating microarchitectures
> at least small-n parallel?

Where n is at least pipe-lenght and might be at least as big as pipe-
length*SuperScalarity + cache-hierarchy + memory-system

That is n is approaching 32-64 easily blocked off units of work.

Mitch

From: nedbrek on 4 May 2010 07:21

Hello all,

"MitchAlsup" <MitchAlsup(a)aol.com> wrote in message
news:db84f8dd-54e1-4a9a-9f8c-29768da1e9be(a)d19g2000yqf.googlegroups.com...
> On May 3, 11:06 am, Paul Wallich <p...(a)panix.com> wrote:
>> I know I'm not well-versed here, but isn't simulating microarchitectures
>> at least small-n parallel?
>
> Where n is at least pipe-lenght and might be at least as big as pipe-
> length*SuperScalarity + cache-hierarchy + memory-system
>
> That is n is approaching 32-64 easily blocked off units of work.

Potentially. The question is, how much additional complexity does it cost?
Accuracy costs complexity. You don't want to pay additional complexity just
for performance (at the expense of exploring ideas or getting reliable
data).

There is much greater process parallelism (traces * configuration). When I
was at Intel, we had ~400 traces. So, you had parallelism of 400 just for a
single config. And configurations can grow exponentially (4 cache sizes * 4
cache latencies * 10 rob sizes * 10 scheduler sizes - not that it has to).

Ned

First | Prev |
Pages: 1 2 3 4 5 6
Prev: Processors stall on OLTP workloads about half the time--almost no ?matter what you do
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do