6502 illegal opcodes questions [Commodore]

Prev: what happened CBM=VGA
Next: 1581 Drive Kits on eBay

From: mdj on 9 Jun 2006 02:24

Michael J. Mahon wrote:

> > Parallelism is the big door, but I think the approaches that need to be
> > explored cover a wider gamut than multiprocess parallelism, which as
> > you point of has considerable latency issues.
>
> And I would say that tools to help with the decomposition of algorithms
> into parallel parts, while minimizing the effects of latency and limited
> bandwidth, are the most important "tool frontier" today.

The main 'commodity' area where large scale parallelism was going to be
useful was image processing, particularly video. A large degree of
parallelism actually exists today in this field, but rather than
traditional parallel approaches, CPU vendors added SIMD extensions, and
video card vendors built GPU's. Todays GPU's are basically parallel
computing machines done on a single chip. I'd imagine most
parallelisable tasks will utilise these extensions first, and resort to
generic approaches second, as the specialised environments are in
general, more efficient.

Of course, you're right that down the track there's a lot more domains
that can be vastly improved by parallisation, but much of those remain
unexplored.

> >>The popular "thread" model, in which all of memory is conceptually
> >>shared by all threads, is a disaster for real multiprocessors, since
> >>they will *always* have latency and bandwidth issues to move data
> >>between them, and a "single, coherent memory image" is both slow
> >>and wasteful.
> >
> >
> > It is however an extremely efficient form of multiprocessing for
> > applications with modest horizontal scaling potential.
>
> And it offers unprecedented potential for data races an
> nondeterministic behavior! ;-)
>
> The thread model should have fundamentally segregated memory, so
> that inter-thread references require special coordination commensurate
> with their special risks and costs.

The tradional UNIX multiprocessing model using fork(), pipe(),
semaphore() and friends has, and does, do exactly this for thirty
years. Originally this is approach was expensive, but modern MMU's mean
that the cost of cloning a process is not much higher than creating a
thread (which merely replicates the stack). It exists on single
machines, plus machine clusters that support such techniques as process
migration and other more esotetic parallel computing ideas - see
BeoWulf clusters.

Since System V, this model has also included a message passing
interface that can, and is extended across cluster nodes.

The reasons for adding the thread model as well are mainly due to
intra-process parallelism rather than inter-process. The tradional I/O
models are synchronous, but once you introduce asynchronous I/O models
you need threads to allow your program to do something useful instead
of waiting for I/O. In such scenarios, there's no reason at all to
introduce the overhead of IPC, as there simply isn't any IPC to be
done.

Additionally, the multiprocess model actually shares the very same
issues the thread model does - just because you've copied your data
into another address space doesn't alleviate the issue of data being
modified erroneously. The only real difference is that in the 'thread'
model, you have the opportunity to exploit far more efficient means of
synchronisation, which don't require marshalling data and copying it
across address spaces. It's better to think of it as the same, only
faster.

That said, the multiprocessing approach offers certain degrees of
robustness in the case of an errant child crashing the program. Of
course, these issues only really affect programs written in languages
that allow arbitrary pointer arithmetic.

The problem isn't really one of this model or that model being wrong,
and it's better to think of threads as just a way to exploit higher
efficiency in the case of closer locality.

Unfortunately, the other popular operating system only supports the
thread model, and it's unfortunately again, different in semantics to
the POSIX model.

What's really needed, is languages that support parallel processing
constructs. It's not really fair to blame a parallelisation techniques
particular quirks for the limitations of the tools we use.

> > There's essentially 3 basic models for parallelism that must be
> > exploited:
> >
> > Multithread - in which one processor core can execute multiple threads
> > simultaneously
>
> This is the only case that can even approximate "uniform memory", since
> at least most of the cache hierarchy will be common to all threads.

I guess you mean determinism with regards to execution timing? This
isn't really useful on any modern architecture, you must use enforced
synchronisation and real-time scheduling to achieve this.

> > Uniform Memory Multiprocessor - in which many processsor cores share
> > the same physical memory subsystem. Note that this is further divided
> > into multiple cores in the same package, plus other cores in different
> > packages, which have very different latency properties.
>
> Even within one package, only lower cache levels will be common, so
> this is not fundamentally different from your next case...

Some implementations support high speed transports of L1 cache data
both to on-die cores, and inter-package cores. There's still a lot of
variance in the available implementations.

> > Non Uniform Memory Multiprocessor - In this case the latency can vary
> > wildly depending on the system configuration.
> >
> > Modern multiprocessor servers employ all three approaches, both on the
> > same system board, plus via high speed interconnects that join multiple
> > system boards together. OS's must weight the 'distance' to another CPU
> > when considering a potential execution unit for a process.
>
> All of your cases are actually the same, differing only in the level
> of memory hierarchy (and its corresponding latency and bandwidth) that
> is shared.

That's right. I drew a distinction because at present not all multicore
systems are implemented in the ideal way. The current generation Intel
Xeon dual cores for instance only benefit you for power consumption,
and are actually of equivalent locality to different processor
packages. Eventually the idealised designs will surface though and
you'll have general localities of die, board, system, and network. For
the moment it's somewhat more complicated.

> Any practical system will consist of all levels of connectivity, with
> sharing at virtually all the different levels of the memory hierarchy.
> And I would add another set of levels, in which there is no "memory
> consistency" model, but message passing is the sharing mechanism.
> This extends the multiprocessing model across networks.

A brand new PC these days will get you two cores. There are different
scales of systems. The big ones, for sure, will exhibit all 4 levels,
but there's plenty of work to be done just to handle the opportunities
that exist in a new laptop.

> > What's slow and wasteful depends a great deal on the task at hand.
> > Multithreading used to be just as expensive as multiprocessing. But
> > consider a current generation CPU designed for low power, high
> > concurrency, the UltraSPARC T1.
> >
> > These units have execution cores cable of running 4 concurrent threads.
> > In the highest end configuration, there are 8 of these execution cores
> > per physical processor. The cores have a 3.2GB/s interconnect. Each
> > physical processor has 4 independant memory controllers, so you have
> > non-uniform memory access on the one die.
>
> Exactly. The general case is becoming the common case.

The thing to keep in mind is that there's two cores per memory
controller, so ideally, if you need to migrate a process, you'll
migrate it to it's 'twin' core first if it's available before moving it
to another memory controller. You stay on package in preference to
moving to another package. There's a lot of variables involved in
picking the next most appropriate locality for a process.

> And multi-threaded processors are actually a very old idea. The
> Honeywell 800 supported 8 "threads" (not called that, of course),
> by executing instructions in "rotation", skipping slots that were
> waiting for I/O to complete. At the time, it was considered to be
> a hardware implementation of multiprogramming.

Indeed. There are very few new ideas in Computing these days.

> Today, multithreaded processors do much the same, but the "I/O wait"
> has been replaced by the "cache miss".
>
> The peripheral processor of the CDC 6600 was another salient example
> of multi-threading. It was implemented in the same fast logic as
> the central processor, but presented the appearance of 10 separate
> PPs, each executing instructions at 10th the rate of the central
> processor. This had the effect of matching its instruction rate to
> the latency of memory, and provided 10-fold concurrency for managing
> I/O and memory transfers.
>
> > Peak power consumption for this part is 79W at 1Ghz. Considering you
> > can in theory run 32 threads simulaneously, that's pretty impressive.
> > How well you can exploit it depends on your application. An 'old
> > school' web server for instance, can only get 8 way parallelism on this
> > chip. A new school web server written in Java, can get 32 way, assuming
> > at any given time there is at least 32 concurrent requests for the same
> > dynamic page, or 32 static requests.
> >
> > It's getting to the stage where the power consumed by driving I/O over
> > a pin on an IC package is significant, so expect to see systems like
> > this grow in popularity.
>
> This was always an inevitable result of higher levels of integration.
> As soon as a significant amount of cache can be shared on the chip,
> it becomes advantageous to adorn it with multiple processors.

Just be thankful you aren't the author of the process scheduler of a
current operating system :-)

> > Interesting, you can download a VHDL description of this part from Sun,
> > and synthesise it on one of the higer end FPGA's. Oh how I wish I had
> > access to hardware like that!
> >
> > A top of the range Sun server uses parts that have 4 execution threads
> > per core, four cores per board, each with it's own memory
> > controller+memory, and up to 18 boards per system (coupled together by
> > an 9GB/s crossbar switch). Exploiting all the resources in this system
> > and doing it efficiently is *hard*, as it employs every different style
> > of parallelism I mentioned before within the same 'machine'.
> >
> > And I haven't even considered computing clusters!
>
> Exactly. And the full hierarchy of latency and bandwidth needs to be
> addressed by both measurement tools and by behavioral models for code
> partitioning and optimization. *This* is the tools frontier that I see,
> with huge potential payoffs.

It's very difficult to construct these tools when the languages we use
exploit parallelism in such low level ways. What's needed, is the
ability to abstract away the issues of locality, focus on the core
issues involved, and then instrument systems accordingly.

This is similar to the compiler you mentioned a while back that used
runtime instrumentation of it's p-code engine to determine critical
sections then tune them. Although modern VM's apply this technique
today, that's about as far as it's got.

One of the reasons I still like Solaris is that it's the only common OS
that supports the concept of 'microstate' accounting. This essentially
means that you can log the transitions of a process through various
known 'microstates' So rather than know just that a process is blocked
on a resource, you can find out what resources, why, how long, etc.
This is integrated back into the development toolchain all the way
through to the Java level profiler, so you can visualise exactly what
different threads are doing, spending time on, and easily spot
bottlenecks which in many cases can be tuned out at either the OS or
the VM. Very fancy stuff.

> > The way it's panning out is that real multiprocessors are a disaster
> > for parallelism. The problem is that essentially any task that can be
> > parallelised needs to process the same data that it does in serial
> > form. Because of this, you can utilise the same buses, I/O subsystems,
> > and take advantage of 'nearness' to allow some pretty incredible IPC
> > speeds.
>
> No, that problem corresponds to a *very poor* partitioning of the
> problem onto parallel processors--ironically, one that is encouraged
> by current languages' simplistic "thread" models of parallel computing.

In truly partitioned systems, how do you propose to solve the issue of
providing access to the source data for all partitions? Shipping the
data via slower I/O trunks is generally less efficient than processing
it all on systems with greater locality.

It works fine for compute-bound loads, but data bound loads need
different configurations, at least for currently available hardware.

It's worth noting though that many modern languages have absolutely
awful parallelism management. Java in fact, is one of the worst in this
regard, requiring the developer to work around a very fundamental
design mistake in many cases. Luckily this issue is well understood.
Unfortunately there's not a lot of languages out there that are
markedly better.

Fortunately, there are a lot of very nice frameworks around that
mitigate both the known issues, plus provide excellent implementations
of known resource sharing idioms.

> Let me give a little example.
>
> Maximum efficiency of resource utilization is obtained by "pooling"
> all of a particular resource together so that all requestors obtain
> it by withdrawing from one pool. Then, you're not "out" of that
> resource until you are *really* out of it.
>
> But this creates a huge point of serial contention, since all
> requestors must lock the pool, allocate some resource, then unlock
> the pool. It is as if a large cafeteria put one giant salt shaker
> in the middle of the room for all to share.
>
> An alternative resource allocation scheme which is well adapted to
> multiple concurrent users and a hierarchy of latencies is to provide
> multiple local pools, shared by a small number of users at essentially
> the same level of connection latency. This is like the more common
> case of putting a small salt shaker within arms reach of each small
> group of diners.

It's worth noting though, that that having multiple resource pools
introduces it's own probems. From the diners perspective, it's most
efficient if they each are allocated a salt shaker. However, this
introduces the problem of managing all the salt shakers, imposing an
unreasonable burden on resurant staff :-) The challenge is to find the
balance between the two that provides adequate parallelism while
effectively managing shared resources. Hence we have a multitude of
approaches depending on the specifics of the problem domain.

It's probably also worth pointing out that more than once, I've visited
a nearby table to 'borrow' a condiment, as the one I my table was
exhausted. The resource sharing introduces a latency, which fortunately
was easily resolved by the locality of similar resources.

In a very busy place, you can easily find that even neighbouring tables
have exhausted their resources, resulting in a somewhat ad-hoc 'hunt'
for nearby sources, before resorting to approaching a waiter, or
counter in order to obtain more. If this continues too long you can end
up with a gridlock situation, with an already busy staff trying
desperately to fill many tiny bottles and then distribute them before
the system returns to any sense of order.

A parallel system is more like a resuraunt with a queue of people out
the front waiting.

In many cases, the partitioning of the problem isn't the hard part,
it's figuring out a strategy for keeping each of the partitions
ticking.

In conclusion, partitioning often isn't the hard part, it's formulating
strategies to keep each partition operating smoothly. Only then are
resources being consumed effectively.

> Of course, there is still the issue of resource balancing (when the
> resource is really uniform--not like memory), and this can be done
> by periodically re-balancing the amounts of resource in the local
> pools, and across hierarchical levels if necessary.

A technique that is always a lot harder to apply that it seems.

> This is the kind of thinking that must go into the next generation
> of systems, and it is very different from the thinking that has
> inspired the systems and tools of today.

Agreed - although don't overlook the fact that there are plenty of
systems out there that have faced a lot of these issues already.

> > Multithreading approaches are very important on these systems. In fact,
> > multithreading is important even on systems with single execution
> > units. The gap between I/O throughput and processing throughput means
> > you get a certain degree of 'parallelism' even though you can only run
> > one thread at a time. Free performance improvement if you employ
> > parallel design techniques.
> >
> > Of course, there are certain heavily compute-bound applications where
> > the degree of IPC is very low, and massive parallelism is possible
> > regardless of the interconnect used, as IPC constitutes a relatively
> > small part of the workload. For the rest of the cases though where lots
> > of data is being consumed, systems that allow low-overhead IPC through
> > multithreading are the way to go.
>
> And a tiny fraction of todays tools and designers are even out of
> kindergarten on issues of partitioning and locality.

Completely agree with this point. I'm continually surprised by the
number of people I encounter that do not understand the dining
philosophers problem, and that's perhaps the easiest one!

> Object orientation is almost totally orthogonal, if not antithetical,
> to the *real* problems of highly parallel computing, which is the
> platform of the future. I expect we'll figure this out sometime
> in the next decade. ;-(

Almost, other than the additional abstraction and encapsulation being
required by both concepts.Current languages don't really provide
adequate abstractions for parallel computing primitives. Some of them
can be modelled effectively using OO techniques, some of them can't.

Matt

From: Michael J. Mahon on 10 Jun 2006 05:58

mdj wrote:
> Michael J. Mahon wrote:
>
>
>>>Parallelism is the big door, but I think the approaches that need to be
>>>explored cover a wider gamut than multiprocess parallelism, which as
>>>you point of has considerable latency issues.
>>
>>And I would say that tools to help with the decomposition of algorithms
>>into parallel parts, while minimizing the effects of latency and limited
>>bandwidth, are the most important "tool frontier" today.
>
>
> The main 'commodity' area where large scale parallelism was going to be
> useful was image processing, particularly video. A large degree of
> parallelism actually exists today in this field, but rather than
> traditional parallel approaches, CPU vendors added SIMD extensions, and
> video card vendors built GPU's. Todays GPU's are basically parallel
> computing machines done on a single chip. I'd imagine most
> parallelisable tasks will utilise these extensions first, and resort to
> generic approaches second, as the specialised environments are in
> general, more efficient.
>
> Of course, you're right that down the track there's a lot more domains
> that can be vastly improved by parallisation, but much of those remain
> unexplored.

Video transcoding is an "embarrassingly parallel" application, yet
I have not found a single popular software manufacturer that makes
a parallel transcoder. With all the multicore chips now, they are
leaving factors of performance on the table.

Apparently, the market for video encoders does not evaluate them
on performance--astonishingly. Some of the most popular perform
terribly.

>>>>The popular "thread" model, in which all of memory is conceptually
>>>>shared by all threads, is a disaster for real multiprocessors, since
>>>>they will *always* have latency and bandwidth issues to move data
>>>>between them, and a "single, coherent memory image" is both slow
>>>>and wasteful.
>>>
>>>
>>>It is however an extremely efficient form of multiprocessing for
>>>applications with modest horizontal scaling potential.
>>
>>And it offers unprecedented potential for data races an
>>nondeterministic behavior! ;-)
>>
>>The thread model should have fundamentally segregated memory, so
>>that inter-thread references require special coordination commensurate
>>with their special risks and costs.
>
>
> The tradional UNIX multiprocessing model using fork(), pipe(),
> semaphore() and friends has, and does, do exactly this for thirty
> years. Originally this is approach was expensive, but modern MMU's mean
> that the cost of cloning a process is not much higher than creating a
> thread (which merely replicates the stack). It exists on single
> machines, plus machine clusters that support such techniques as process
> migration and other more esotetic parallel computing ideas - see
> BeoWulf clusters.

The actual situation is a full continuum of bandwidths and latencies,
and that hierarchy is very poorly supported. Instead, we give designers
the static choice between shared memory and message passing. This is
a bad match to a world in which the system interconnect continuum is
neither very slow nor very fast, and is constantly moving.

> Since System V, this model has also included a message passing
> interface that can, and is extended across cluster nodes.
>
> The reasons for adding the thread model as well are mainly due to
> intra-process parallelism rather than inter-process. The tradional I/O
> models are synchronous, but once you introduce asynchronous I/O models
> you need threads to allow your program to do something useful instead
> of waiting for I/O. In such scenarios, there's no reason at all to
> introduce the overhead of IPC, as there simply isn't any IPC to be
> done.
>
> Additionally, the multiprocess model actually shares the very same
> issues the thread model does - just because you've copied your data
> into another address space doesn't alleviate the issue of data being
> modified erroneously. The only real difference is that in the 'thread'
> model, you have the opportunity to exploit far more efficient means of
> synchronisation, which don't require marshalling data and copying it
> across address spaces. It's better to think of it as the same, only
> faster.
>
> That said, the multiprocessing approach offers certain degrees of
> robustness in the case of an errant child crashing the program. Of
> course, these issues only really affect programs written in languages
> that allow arbitrary pointer arithmetic.
>
> The problem isn't really one of this model or that model being wrong,
> and it's better to think of threads as just a way to exploit higher
> efficiency in the case of closer locality.
>
> Unfortunately, the other popular operating system only supports the
> thread model, and it's unfortunately again, different in semantics to
> the POSIX model.
>
> What's really needed, is languages that support parallel processing
> constructs. It's not really fair to blame a parallelisation techniques
> particular quirks for the limitations of the tools we use.

The most important "tools" we are missing are the conceptual tools
to guide systematic and quantitative decomposition of applications into
concurrent "chunks" to be run on varying numbers of processors with
varying amounts and latencies of shared memory and/or message passing.

Until we can get our design heads on straight, it doesn't make much
sense to codify a particular approach in a language design--though it
will as we learn.

>>>There's essentially 3 basic models for parallelism that must be
>>>exploited:
>>>
>>>Multithread - in which one processor core can execute multiple threads
>>>simultaneously
>>
>>This is the only case that can even approximate "uniform memory", since
>>at least most of the cache hierarchy will be common to all threads.
>
>
> I guess you mean determinism with regards to execution timing? This
> isn't really useful on any modern architecture, you must use enforced
> synchronisation and real-time scheduling to achieve this.

When races are involved, or rendezvous with non-determinate order of
arrival, then actual nondeterminism enters the picture. Sometimes this
is planned, and enhances performance--as in the case of an evaluation
tree that takes advantage of the distributive property of the rollup
operation (like arithmetic addition or set addition). Sometimes it is
unplanned, as when an erroneous race condition exists that causes
intermittent bugs or unreproducible results.

>>>Uniform Memory Multiprocessor - in which many processsor cores share
>>>the same physical memory subsystem. Note that this is further divided
>>>into multiple cores in the same package, plus other cores in different
>>>packages, which have very different latency properties.
>>
>>Even within one package, only lower cache levels will be common, so
>>this is not fundamentally different from your next case...
>
>
> Some implementations support high speed transports of L1 cache data
> both to on-die cores, and inter-package cores. There's still a lot of
> variance in the available implementations.

Yes, that's my point. The *hierarchy* needs to be the thing that is
optimized for, not some ideal simplified case.

The only way to get a uniform memory model is to slow down all the
memory accesses so that the ones that could go faster run at the same
speed as the ones that can't run any faster--clearly simpler, but *way*
non-optimal!

>>>Non Uniform Memory Multiprocessor - In this case the latency can vary
>>>wildly depending on the system configuration.
>>>
>>>Modern multiprocessor servers employ all three approaches, both on the
>>>same system board, plus via high speed interconnects that join multiple
>>>system boards together. OS's must weight the 'distance' to another CPU
>>>when considering a potential execution unit for a process.
>>
>>All of your cases are actually the same, differing only in the level
>>of memory hierarchy (and its corresponding latency and bandwidth) that
>>is shared.
>
>
> That's right. I drew a distinction because at present not all multicore
> systems are implemented in the ideal way. The current generation Intel
> Xeon dual cores for instance only benefit you for power consumption,
> and are actually of equivalent locality to different processor
> packages. Eventually the idealised designs will surface though and
> you'll have general localities of die, board, system, and network. For
> the moment it's somewhat more complicated.

There is no "ideal" way--there are only real ways, and none of them
have simple behavior if they're any good!

>>Any practical system will consist of all levels of connectivity, with
>>sharing at virtually all the different levels of the memory hierarchy.
>>And I would add another set of levels, in which there is no "memory
>>consistency" model, but message passing is the sharing mechanism.
>>This extends the multiprocessing model across networks.
>
>
> A brand new PC these days will get you two cores. There are different
> scales of systems. The big ones, for sure, will exhibit all 4 levels,
> but there's plenty of work to be done just to handle the opportunities
> that exist in a new laptop.

Agreed--and the tools for *optimizing* applications across even two
processors are almost nonexistent. A suboptimal data partitioning
can cause a large factor of performance degradation, and designers
have virtually no tools to understand and adjust this.

>>>What's slow and wasteful depends a great deal on the task at hand.
>>>Multithreading used to be just as expensive as multiprocessing. But
>>>consider a current generation CPU designed for low power, high
>>>concurrency, the UltraSPARC T1.
>>>
>>>These units have execution cores cable of running 4 concurrent threads.
>>>In the highest end configuration, there are 8 of these execution cores
>>>per physical processor. The cores have a 3.2GB/s interconnect. Each
>>>physical processor has 4 independant memory controllers, so you have
>>>non-uniform memory access on the one die.
>>
>>Exactly. The general case is becoming the common case.
>
>
> The thing to keep in mind is that there's two cores per memory
> controller, so ideally, if you need to migrate a process, you'll
> migrate it to it's 'twin' core first if it's available before moving it
> to another memory controller. You stay on package in preference to
> moving to another package. There's a lot of variables involved in
> picking the next most appropriate locality for a process.

Now you see what I mean. OSs should automatically manage this
"dynamic affinity" or "memory reference locality", with input and
assistance from compile-time and run-time analysis.

And the designer should be able to visualize the range of behaviors
caused by various design choices relative to parallel execution.

The absence of these tools is a major obstacle to the design and
implementation of efficient parallel applications, and this is a
*much* more fruitful area for investment than playing with language
constructs.

>>And multi-threaded processors are actually a very old idea. The
>>Honeywell 800 supported 8 "threads" (not called that, of course),
>>by executing instructions in "rotation", skipping slots that were
>>waiting for I/O to complete. At the time, it was considered to be
>>a hardware implementation of multiprogramming.
>
>
> Indeed. There are very few new ideas in Computing these days.

Sad but true.

>>Today, multithreaded processors do much the same, but the "I/O wait"
>>has been replaced by the "cache miss".
>>
>>The peripheral processor of the CDC 6600 was another salient example
>>of multi-threading. It was implemented in the same fast logic as
>>the central processor, but presented the appearance of 10 separate
>>PPs, each executing instructions at 10th the rate of the central
>>processor. This had the effect of matching its instruction rate to
>>the latency of memory, and provided 10-fold concurrency for managing
>>I/O and memory transfers.
>>
>>
>>>Peak power consumption for this part is 79W at 1Ghz. Considering you
>>>can in theory run 32 threads simulaneously, that's pretty impressive.
>>>How well you can exploit it depends on your application. An 'old
>>>school' web server for instance, can only get 8 way parallelism on this
>>>chip. A new school web server written in Java, can get 32 way, assuming
>>>at any given time there is at least 32 concurrent requests for the same
>>>dynamic page, or 32 static requests.
>>>
>>>It's getting to the stage where the power consumed by driving I/O over
>>>a pin on an IC package is significant, so expect to see systems like
>>>this grow in popularity.
>>
>>This was always an inevitable result of higher levels of integration.
>>As soon as a significant amount of cache can be shared on the chip,
>>it becomes advantageous to adorn it with multiple processors.
>
>
> Just be thankful you aren't the author of the process scheduler of a
> current operating system :-)

Actually, that's one of the most important areas to do *much* better
and more adaptively than is done in current OSs.

The "distributed process allocation" problem in the OS is as fundamental
as the "code selection" problem in a compiler, and will have much larger
performance impact as the level of *system* concurrency continues to
increase.

I would even go so far as to say that the current *lack* of good
solutions to this problem is a major reason for the limited utility
of multicore systems. In a couple of years, we could all have 8-way
systems on our desktops, but if the fundamental enablers for parallel
apps don't get done, they won't perform much better than single-core
systems (as is true today for dual-core/single-core).

>>>Interesting, you can download a VHDL description of this part from Sun,
>>>and synthesise it on one of the higer end FPGA's. Oh how I wish I had
>>>access to hardware like that!
>>>
>>>A top of the range Sun server uses parts that have 4 execution threads
>>>per core, four cores per board, each with it's own memory
>>>controller+memory, and up to 18 boards per system (coupled together by
>>>an 9GB/s crossbar switch). Exploiting all the resources in this system
>>>and doing it efficiently is *hard*, as it employs every different style
>>>of parallelism I mentioned before within the same 'machine'.
>>>
>>>And I haven't even considered computing clusters!
>>
>>Exactly. And the full hierarchy of latency and bandwidth needs to be
>>addressed by both measurement tools and by behavioral models for code
>>partitioning and optimization. *This* is the tools frontier that I see,
>>with huge potential payoffs.
>
>
> It's very difficult to construct these tools when the languages we use
> exploit parallelism in such low level ways. What's needed, is the
> ability to abstract away the issues of locality, focus on the core
> issues involved, and then instrument systems accordingly.

Exactly--which is why I say this is the real "frontier" today.

> This is similar to the compiler you mentioned a while back that used
> runtime instrumentation of it's p-code engine to determine critical
> sections then tune them. Although modern VM's apply this technique
> today, that's about as far as it's got.

It doesn't need to be a non-existent machine--most machines have timer
interrupts that permit background profiling and dynamic code changes.
All of these tricks can be (and have been) done on actual machine code.

> One of the reasons I still like Solaris is that it's the only common OS
> that supports the concept of 'microstate' accounting. This essentially
> means that you can log the transitions of a process through various
> known 'microstates' So rather than know just that a process is blocked
> on a resource, you can find out what resources, why, how long, etc.
> This is integrated back into the development toolchain all the way
> through to the Java level profiler, so you can visualise exactly what
> different threads are doing, spending time on, and easily spot
> bottlenecks which in many cases can be tuned out at either the OS or
> the VM. Very fancy stuff.

And not too difficult or intrusive if it is based on sampling
information at interrupt time (which automatically bases sample
density on performance impact).

>>>The way it's panning out is that real multiprocessors are a disaster
>>>for parallelism. The problem is that essentially any task that can be
>>>parallelised needs to process the same data that it does in serial
>>>form. Because of this, you can utilise the same buses, I/O subsystems,
>>>and take advantage of 'nearness' to allow some pretty incredible IPC
>>>speeds.
>>
>>No, that problem corresponds to a *very poor* partitioning of the
>>problem onto parallel processors--ironically, one that is encouraged
>>by current languages' simplistic "thread" models of parallel computing.
>
>
> In truly partitioned systems, how do you propose to solve the issue of
> providing access to the source data for all partitions? Shipping the
> data via slower I/O trunks is generally less efficient than processing
> it all on systems with greater locality.

For systems with higher latency or lower bandwidth connections, the
problem partitioning must allow for divisions that do not cut "strong"
interactions. Like Aristotle and reality, we must find ways to cut
applications at their "joints", and ways to design applications that
present numerous opportunities for cuts at various levels of
connectivity.

Thus a *conceptual* toolset is required plus a *measurement* toolset
that can provide useful feedback to all levels, from high level design
time, to low-level design time, to compile time, to link time, to load
time, to execution time (and dynamically at execution time).

> It works fine for compute-bound loads, but data bound loads need
> different configurations, at least for currently available hardware.

In many applications, there are relatively weak data dependencies
connecting large segments of data, so the data can be divided up,
processed, and then merged together--all very well suited to avoidance
of excess communication. (For example, video and audio transcoding
and most image processing are such cases.)

Note that many cases of multidimensonal data have the characteristic
that the amount of computation grows as a power of the data volume,
while the size of the interacting "boundaries" between data chunks
grows as a lesser power, so that by adjusting "chunk" size, the
ratio of processing to communication can be easily adjusted over a
wide range--and *should* be so adjusted adaptively, based on run-time
realities and available system resources.

> It's worth noting though that many modern languages have absolutely
> awful parallelism management. Java in fact, is one of the worst in this
> regard, requiring the developer to work around a very fundamental
> design mistake in many cases. Luckily this issue is well understood.
> Unfortunately there's not a lot of languages out there that are
> markedly better.
>
> Fortunately, there are a lot of very nice frameworks around that
> mitigate both the known issues, plus provide excellent implementations
> of known resource sharing idioms.

And it is exactly this fundamental neglect of the most important
dimension of future computing--parallelism--that I fault. Language
designers are fiddling with details while the main problem goes
unaddressed. Could it be because details are easy and the *real*
problem is harder? ;-)

As I said at ASPLOS a decade ago to language designers, "Your long
vacation is *over*! It's time to get to work on parallelism."

>>Let me give a little example.
>>
>>Maximum efficiency of resource utilization is obtained by "pooling"
>>all of a particular resource together so that all requestors obtain
>>it by withdrawing from one pool. Then, you're not "out" of that
>>resource until you are *really* out of it.
>>
>>But this creates a huge point of serial contention, since all
>>requestors must lock the pool, allocate some resource, then unlock
>>the pool. It is as if a large cafeteria put one giant salt shaker
>>in the middle of the room for all to share.
>>
>>An alternative resource allocation scheme which is well adapted to
>>multiple concurrent users and a hierarchy of latencies is to provide
>>multiple local pools, shared by a small number of users at essentially
>>the same level of connection latency. This is like the more common
>>case of putting a small salt shaker within arms reach of each small
>>group of diners.
>
>
> It's worth noting though, that that having multiple resource pools
> introduces it's own probems. From the diners perspective, it's most
> efficient if they each are allocated a salt shaker. However, this
> introduces the problem of managing all the salt shakers, imposing an
> unreasonable burden on resurant staff :-) The challenge is to find the
> balance between the two that provides adequate parallelism while
> effectively managing shared resources. Hence we have a multitude of
> approaches depending on the specifics of the problem domain.

It was exactly my point that the ratio of salt shakers to diners is
an important, and easily managed, property of a solution. And it can
have *huge* performance effects if it is not properly balanced. Either
extreme case is almost certainly suboptimal.

> It's probably also worth pointing out that more than once, I've visited
> a nearby table to 'borrow' a condiment, as the one I my table was
> exhausted. The resource sharing introduces a latency, which fortunately
> was easily resolved by the locality of similar resources.

And this is a relatively low extra overhead in the rare cases when a
local pool is exhausted. (Proper dynamic pool balancing will make this
case very unusual, unless local demand has a large time/space variance--
which is yet another parametric input into the allocation problem.)

> In a very busy place, you can easily find that even neighbouring tables
> have exhausted their resources, resulting in a somewhat ad-hoc 'hunt'
> for nearby sources, before resorting to approaching a waiter, or
> counter in order to obtain more. If this continues too long you can end
> up with a gridlock situation, with an already busy staff trying
> desperately to fill many tiny bottles and then distribute them before
> the system returns to any sense of order.

If the resource consumers make a local "run" on resources faster than
the anticipative/adaptive balancing algorithm can rebalance the pools,
then there will always be trouble--extending at least as far as the
offending locality.

This is simply reality. Everything in the real world exhibits the
same issues, and the same kinds of costs for avoiding or remedying
them.

It's good to recognize when we have reached the limits of our ability
to optimize a solution--like finding a "Second Law of Thermodynamics"
that tells us we've done as much to cope with residual randomness as
can be done. Of course, we're generally nowhere near that kind of
optimality today.

> A parallel system is more like a resuraunt with a queue of people out
> the front waiting.
>
> In many cases, the partitioning of the problem isn't the hard part,
> it's figuring out a strategy for keeping each of the partitions
> ticking.
>
> In conclusion, partitioning often isn't the hard part, it's formulating
> strategies to keep each partition operating smoothly. Only then are
> resources being consumed effectively.

You are actually making an agrument for the relatively outmoded kind
of concurrency called multiprogramming, where the concurrency of the
load exceeds the concurrency of the system. This is obsolete.

In the PC arena today, we are already on the other side of that problem,
with systems that have more concurrency than the typical application
loads have (because they are so poorly designed to exploit concurrency).

This imbalance toward higher degrees of parallelism is going to increase
dramatically as we move forward. (And, if we even considered trying to
use our office networks of dozens or hundreds of machines to accelerate
applications, we are already *far* beyond the levels of concurrency
supported by our current tools.

>>Of course, there is still the issue of resource balancing (when the
>>resource is really uniform--not like memory), and this can be done
>>by periodically re-balancing the amounts of resource in the local
>>pools, and across hierarchical levels if necessary.
>
>
> A technique that is always a lot harder to apply that it seems.

There are well known techniques for ensuring stability--analogous
to servo system design. (Programmers generally are not aware of
the mass of work done on stability in control systems.)

>>This is the kind of thinking that must go into the next generation
>>of systems, and it is very different from the thinking that has
>>inspired the systems and tools of today.
>
>
> Agreed - although don't overlook the fact that there are plenty of
> systems out there that have faced a lot of these issues already.

A few have made a start--none have approached the problem as the
absolutely fundamental problem that it is.

>>>Multithreading approaches are very important on these systems. In fact,
>>>multithreading is important even on systems with single execution
>>>units. The gap between I/O throughput and processing throughput means
>>>you get a certain degree of 'parallelism' even though you can only run
>>>one thread at a time. Free performance improvement if you employ
>>>parallel design techniques.
>>>
>>>Of course, there are certain heavily compute-bound applications where
>>>the degree of IPC is very low, and massive parallelism is possible
>>>regardless of the interconnect used, as IPC constitutes a relatively
>>>small part of the workload. For the rest of the cases though where lots
>>>of data is being consumed, systems that allow low-overhead IPC through
>>>multithreading are the way to go.
>>
>>And a tiny fraction of todays tools and designers are even out of
>>kindergarten on issues of partitioning and locality.
>
>
> Completely agree with this point. I'm continually surprised by the
> number of people I encounter that do not understand the dining
> philosophers problem, and that's perhaps the easiest one!
>
>
>>Object orientation is almost totally orthogonal, if not antithetical,
>>to the *real* problems of highly parallel computing, which is the
>>platform of the future. I expect we'll figure this out sometime
>>in the next decade. ;-(
>
>
> Almost, other than the additional abstraction and encapsulation being
> required by both concepts.Current languages don't really provide
> adequate abstractions for parallel computing primitives. Some of them
> can be modelled effectively using OO techniques, some of them can't.

Right--which is why I'm trying to emphasize that the truly *important*
dimension of decomposition for concurrency has been sadly neglected,
when it should be the real focus of our language work.

We don't need language "lawyers", or even "mathematicians"; we need
language "engineers" with their sights firmly focused on the needs of
the future--and the exploitation of multi-leveled parallelism heads
that list, both in terms of challenge and in terms of opportunity.

-michael

Parallel computing for 8-bit Apple II's!
Home page: http://members.aol.com/MJMahon/

"The wastebasket is our most important design
tool--and it is seriously underused."

From: mdj on 10 Jun 2006 08:15

Michael J. Mahon wrote:

> Video transcoding is an "embarrassingly parallel" application, yet
> I have not found a single popular software manufacturer that makes
> a parallel transcoder. With all the multicore chips now, they are
> leaving factors of performance on the table.

Unfortunately the main uses for transcoding in the consumer space are
to convert DVD's or broadcast digital into more compact forms. Since
this is percieved as largely a piracy driven concern, you cannot obtain
a license for CSS to build such an application. If you go ahead and
build one anyway (it's not like the techniques are patented) you'll be
prosecuted under the terms of the DMCA. *grumble*

> Apparently, the market for video encoders does not evaluate them
> on performance--astonishingly. Some of the most popular perform
> terribly.

I would guess that there are some professional level solutions that do
a better job.

> The actual situation is a full continuum of bandwidths and latencies,
> and that hierarchy is very poorly supported. Instead, we give designers
> the static choice between shared memory and message passing. This is
> a bad match to a world in which the system interconnect continuum is
> neither very slow nor very fast, and is constantly moving.

Agreed. The moving target is one of the things that makes it difficult,
and a prime candidate for higher level abstractions, lest work done now
be less applicable to hardware available 2 years from now.

> The most important "tools" we are missing are the conceptual tools
> to guide systematic and quantitative decomposition of applications into
> concurrent "chunks" to be run on varying numbers of processors with
> varying amounts and latencies of shared memory and/or message passing.
>
> Until we can get our design heads on straight, it doesn't make much
> sense to codify a particular approach in a language design--though it
> will as we learn.

Interestingly the hardware world has learned a lot about out of order
execution in recent years, so perhaps a good deal of the techniques
have been better explored that first appears. I agree though, we need
the quantitative approach first through tools, then the better
languages can evolve from those discoveries.

> When races are involved, or rendezvous with non-determinate order of
> arrival, then actual nondeterminism enters the picture. Sometimes this
> is planned, and enhances performance--as in the case of an evaluation
> tree that takes advantage of the distributive property of the rollup
> operation (like arithmetic addition or set addition). Sometimes it is
> unplanned, as when an erroneous race condition exists that causes
> intermittent bugs or unreproducible results.

Yeah, and the planned nondeterministic methods are the case that has
the least exploration at this stage, and is the one that benefits most
as the size of clusters goes up.

> >>>Uniform Memory Multiprocessor - in which many processsor cores share
> >>>the same physical memory subsystem. Note that this is further divided
> >>>into multiple cores in the same package, plus other cores in different
> >>>packages, which have very different latency properties.
> >>
> >>Even within one package, only lower cache levels will be common, so
> >>this is not fundamentally different from your next case...
> >
> >
> > Some implementations support high speed transports of L1 cache data
> > both to on-die cores, and inter-package cores. There's still a lot of
> > variance in the available implementations.
>
> Yes, that's my point. The *hierarchy* needs to be the thing that is
> optimized for, not some ideal simplified case.
>
> The only way to get a uniform memory model is to slow down all the
> memory accesses so that the ones that could go faster run at the same
> speed as the ones that can't run any faster--clearly simpler, but *way*
> non-optimal!

I suspect that given the number of possible localities, etc. and the
likelihood of these changing quickly over time, that the most
successful way will be to design for the worst case interconnect, and
allow optimisation to occur per-architecture by some form of runtime.
Still, it doesn't solve the problem of appropriate partitioning of
algorithms in the first place.

> >>>Non Uniform Memory Multiprocessor - In this case the latency can vary
> >>>wildly depending on the system configuration.
> >>>
> >>>Modern multiprocessor servers employ all three approaches, both on the
> >>>same system board, plus via high speed interconnects that join multiple
> >>>system boards together. OS's must weight the 'distance' to another CPU
> >>>when considering a potential execution unit for a process.
> >>
> >>All of your cases are actually the same, differing only in the level
> >>of memory hierarchy (and its corresponding latency and bandwidth) that
> >>is shared.
> >
> >
> > That's right. I drew a distinction because at present not all multicore
> > systems are implemented in the ideal way. The current generation Intel
> > Xeon dual cores for instance only benefit you for power consumption,
> > and are actually of equivalent locality to different processor
> > packages. Eventually the idealised designs will surface though and
> > you'll have general localities of die, board, system, and network. For
> > the moment it's somewhat more complicated.
>
> There is no "ideal" way--there are only real ways, and none of them
> have simple behavior if they're any good!

:-) I meant "ideal" in terms of the ones that currently are 'better'
versus the 'worse' ones, in terms of processing units on the same
system board having closer locality for similar system cost.

> >>Any practical system will consist of all levels of connectivity, with
> >>sharing at virtually all the different levels of the memory hierarchy.
> >>And I would add another set of levels, in which there is no "memory
> >>consistency" model, but message passing is the sharing mechanism.
> >>This extends the multiprocessing model across networks.
> >
> >
> > A brand new PC these days will get you two cores. There are different
> > scales of systems. The big ones, for sure, will exhibit all 4 levels,
> > but there's plenty of work to be done just to handle the opportunities
> > that exist in a new laptop.
>
> Agreed--and the tools for *optimizing* applications across even two
> processors are almost nonexistent. A suboptimal data partitioning
> can cause a large factor of performance degradation, and designers
> have virtually no tools to understand and adjust this.

Indeed. There are plenty of cases where even non-parallel application
performance drops when moving to multiprocessor systems, simply because
of operating system design. A classic example, which is largely solved
now is the naive scheduler which shifts a running process to another
CPU, throwing out the cache on the one it was on in the process, and
worse, this happens while the process in question was waiting on I/O
:-)

> >>>What's slow and wasteful depends a great deal on the task at hand.
> >>>Multithreading used to be just as expensive as multiprocessing. But
> >>>consider a current generation CPU designed for low power, high
> >>>concurrency, the UltraSPARC T1.
> >>>
> >>>These units have execution cores cable of running 4 concurrent threads.
> >>>In the highest end configuration, there are 8 of these execution cores
> >>>per physical processor. The cores have a 3.2GB/s interconnect. Each
> >>>physical processor has 4 independant memory controllers, so you have
> >>>non-uniform memory access on the one die.
> >>
> >>Exactly. The general case is becoming the common case.
> >
> >
> > The thing to keep in mind is that there's two cores per memory
> > controller, so ideally, if you need to migrate a process, you'll
> > migrate it to it's 'twin' core first if it's available before moving it
> > to another memory controller. You stay on package in preference to
> > moving to another package. There's a lot of variables involved in
> > picking the next most appropriate locality for a process.
>
> Now you see what I mean. OSs should automatically manage this
> "dynamic affinity" or "memory reference locality", with input and
> assistance from compile-time and run-time analysis.
>
> And the designer should be able to visualize the range of behaviors
> caused by various design choices relative to parallel execution.
>
> The absence of these tools is a major obstacle to the design and
> implementation of efficient parallel applications, and this is a
> *much* more fruitful area for investment than playing with language
> constructs.
>
> >>And multi-threaded processors are actually a very old idea. The
> >>Honeywell 800 supported 8 "threads" (not called that, of course),
> >>by executing instructions in "rotation", skipping slots that were
> >>waiting for I/O to complete. At the time, it was considered to be
> >>a hardware implementation of multiprogramming.
> >
> >
> > Indeed. There are very few new ideas in Computing these days.
>
> Sad but true.
>
> >>Today, multithreaded processors do much the same, but the "I/O wait"
> >>has been replaced by the "cache miss".
> >>
> >>The peripheral processor of the CDC 6600 was another salient example
> >>of multi-threading. It was implemented in the same fast logic as
> >>the central processor, but presented the appearance of 10 separate
> >>PPs, each executing instructions at 10th the rate of the central
> >>processor. This had the effect of matching its instruction rate to
> >>the latency of memory, and provided 10-fold concurrency for managing
> >>I/O and memory transfers.
> >>
> >>
> >>>Peak power consumption for this part is 79W at 1Ghz. Considering you
> >>>can in theory run 32 threads simulaneously, that's pretty impressive.
> >>>How well you can exploit it depends on your application. An 'old
> >>>school' web server for instance, can only get 8 way parallelism on this
> >>>chip. A new school web server written in Java, can get 32 way, assuming
> >>>at any given time there is at least 32 concurrent requests for the same
> >>>dynamic page, or 32 static requests.
> >>>
> >>>It's getting to the stage where the power consumed by driving I/O over
> >>>a pin on an IC package is significant, so expect to see systems like
> >>>this grow in popularity.
> >>
> >>This was always an inevitable result of higher levels of integration.
> >>As soon as a significant amount of cache can be shared on the chip,
> >>it becomes advantageous to adorn it with multiple processors.
> >
> >
> > Just be thankful you aren't the author of the process scheduler of a
> > current operating system :-)
>
> Actually, that's one of the most important areas to do *much* better
> and more adaptively than is done in current OSs.

Yes. The current approach adds effectively a 'timeout' constant to a
process. The timeout represents the cost of shifting it to the next
nearest locality, and processes don't move until this time has expired
without the opportunity to shift them. Once the timeout has expired,
you look for a gap in the scheduling table on processors of this
locality, and if one exists, you slot it in. If not, you increase the
timeout to the cost of moving one level further out, and so on. Each
time a process gets scheduled to run, you reset the timeout to nearest
locality.

This approach works very well for avoiding unnecessary shifting of
single-thread processes on a multitasking system, and things tend to
settle into relative harmony as the system runs.

Of course, handling scheduling and moving of massively parallel
processes is another kettle of fish!

> The "distributed process allocation" problem in the OS is as fundamental
> as the "code selection" problem in a compiler, and will have much larger
> performance impact as the level of *system* concurrency continues to
> increase.
>
> I would even go so far as to say that the current *lack* of good
> solutions to this problem is a major reason for the limited utility
> of multicore systems. In a couple of years, we could all have 8-way
> systems on our desktops, but if the fundamental enablers for parallel
> apps don't get done, they won't perform much better than single-core
> systems (as is true today for dual-core/single-core).

There's a real need to crawl before we walk here. One big problem that
needs be solved even on current systems is how to deal with many
concurrent independant processes all in kernel space at the same time.
Unless we move to microkernel systems, which I think eventally we will,
we have to solve the problem of how to effectively schedule multiple
concurrent threads of operating system code competing for resources as
well, effectively meaning the problem has to be solved twice. Most
current OS's have pretty ordinary granularity when it comes to
intra-kernel concurrency.

> >>>Interesting, you can download a VHDL description of this part from Sun,
> >>>and synthesise it on one of the higer end FPGA's. Oh how I wish I had
> >>>access to hardware like that!
> >>>
> >>>A top of the range Sun server uses parts that have 4 execution threads
> >>>per core, four cores per board, each with it's own memory
> >>>controller+memory, and up to 18 boards per system (coupled together by
> >>>an 9GB/s crossbar switch). Exploiting all the resources in this system
> >>>and doing it efficiently is *hard*, as it employs every different style
> >>>of parallelism I mentioned before within the same 'machine'.
> >>>
> >>>And I haven't even considered computing clusters!
> >>
> >>Exactly. And the full hierarchy of latency and bandwidth needs to be
> >>addressed by both measurement tools and by behavioral models for code
> >>partitioning and optimization. *This* is the tools frontier that I see,
> >>with huge potential payoffs.
> >
> >
> > It's very difficult to construct these tools when the languages we use
> > exploit parallelism in such low level ways. What's needed, is the
> > ability to abstract away the issues of locality, focus on the core
> > issues involved, and then instrument systems accordingly.
>
> Exactly--which is why I say this is the real "frontier" today.
>
> > This is similar to the compiler you mentioned a while back that used
> > runtime instrumentation of it's p-code engine to determine critical
> > sections then tune them. Although modern VM's apply this technique
> > today, that's about as far as it's got.
>
> It doesn't need to be a non-existent machine--most machines have timer
> interrupts that permit background profiling and dynamic code changes.
> All of these tricks can be (and have been) done on actual machine code.

Yes. I've often considered playing with this myself with Apple Pascal,
and profiling the bytecode interpreter as it runs. Then, use the output
of it to selectively native compile various procedures or functions and
relink the application. Should be a hoot :-)

> > One of the reasons I still like Solaris is that it's the only common OS
> > that supports the concept of 'microstate' accounting. This essentially
> > means that you can log the transitions of a process through various
> > known 'microstates' So rather than know just that a process is blocked
> > on a resource, you can find out what resources, why, how long, etc.
> > This is integrated back into the development toolchain all the way
> > through to the Java level profiler, so you can visualise exactly what
> > different threads are doing, spending time on, and easily spot
> > bottlenecks which in many cases can be tuned out at either the OS or
> > the VM. Very fancy stuff.
>
> And not too difficult or intrusive if it is based on sampling
> information at interrupt time (which automatically bases sample
> density on performance impact).

Unfortunately on other operating systems, you don't have this level of
intrumentation, and the other systems are different enough in terms of
architecture, operating system design, etc. as to make many of the
performance tuning possibilities you uncover on Solaris not applicable
:-(

> >>>The way it's panning out is that real multiprocessors are a disaster
> >>>for parallelism. The problem is that essentially any task that can be
> >>>parallelised needs to process the same data that it does in serial
> >>>form. Because of this, you can utilise the same buses, I/O subsystems,
> >>>and take advantage of 'nearness' to allow some pretty incredible IPC
> >>>speeds.
> >>
> >>No, that problem corresponds to a *very poor* partitioning of the
> >>problem onto parallel processors--ironically, one that is encouraged
> >>by current languages' simplistic "thread" models of parallel computing.
> >
> >
> > In truly partitioned systems, how do you propose to solve the issue of
> > providing access to the source data for all partitions? Shipping the
> > data via slower I/O trunks is generally less efficient than processing
> > it all on systems with greater locality.
>
> For systems with higher latency or lower bandwidth connections, the
> problem partitioning must allow for divisions that do not cut "strong"
> interactions. Like Aristotle and reality, we must find ways to cut
> applications at their "joints", and ways to design applications that
> present numerous opportunities for cuts at various levels of
> connectivity.
>
> Thus a *conceptual* toolset is required plus a *measurement* toolset
> that can provide useful feedback to all levels, from high level design
> time, to low-level design time, to compile time, to link time, to load
> time, to execution time (and dynamically at execution time).
>
> > It works fine for compute-bound loads, but data bound loads need
> > different configurations, at least for currently available hardware.
>
> In many applications, there are relatively weak data dependencies
> connecting large segments of data, so the data can be divided up,
> processed, and then merged together--all very well suited to avoidance
> of excess communication. (For example, video and audio transcoding
> and most image processing are such cases.)
>
> Note that many cases of multidimensonal data have the characteristic
> that the amount of computation grows as a power of the data volume,
> while the size of the interacting "boundaries" between data chunks
> grows as a lesser power, so that by adjusting "chunk" size, the
> ratio of processing to communication can be easily adjusted over a
> wide range--and *should* be so adjusted adaptively, based on run-time
> realities and available system resources.

Indeed. Video encoding can very easily be parallelised through time
division, across many more processors than we're likely to have in
consumer machines any time soon. This can be achieved through fairly
simple multithreading plus rendezvous techniques. Of course, if the
number of parallel slices becomes excessive, you introduce to many full
frames into your encoding, and require a multitiered approach, but it's
still relatively feasible to do using simple tools.

> > It's worth noting though that many modern languages have absolutely
> > awful parallelism management. Java in fact, is one of the worst in this
> > regard, requiring the developer to work around a very fundamental
> > design mistake in many cases. Luckily this issue is well understood.
> > Unfortunately there's not a lot of languages out there that are
> > markedly better.
> >
> > Fortunately, there are a lot of very nice frameworks around that
> > mitigate both the known issues, plus provide excellent implementations
> > of known resource sharing idioms.
>
> And it is exactly this fundamental neglect of the most important
> dimension of future computing--parallelism--that I fault. Language
> designers are fiddling with details while the main problem goes
> unaddressed. Could it be because details are easy and the *real*
> problem is harder? ;-)

:-)

> As I said at ASPLOS a decade ago to language designers, "Your long
> vacation is *over*! It's time to get to work on parallelism."
>
> >>Let me give a little example.
> >>
> >>Maximum efficiency of resource utilization is obtained by "pooling"
> >>all of a particular resource together so that all requestors obtain
> >>it by withdrawing from one pool. Then, you're not "out" of that
> >>resource until you are *really* out of it.
> >>
> >>But this creates a huge point of serial contention, since all
> >>requestors must lock the pool, allocate some resource, then unlock
> >>the pool. It is as if a large cafeteria put one giant salt shaker
> >>in the middle of the room for all to share.
> >>
> >>An alternative resource allocation scheme which is well adapted to
> >>multiple concurrent users and a hierarchy of latencies is to provide
> >>multiple local pools, shared by a small number of users at essentially
> >>the same level of connection latency. This is like the more common
> >>case of putting a small salt shaker within arms reach of each small
> >>group of diners.
> >
> >
> > It's worth noting though, that that having multiple resource pools
> > introduces it's own probems. From the diners perspective, it's most
> > efficient if they each are allocated a salt shaker. However, this
> > introduces the problem of managing all the salt shakers, imposing an
> > unreasonable burden on resurant staff :-) The challenge is to find the
> > balance between the two that provides adequate parallelism while
> > effectively managing shared resources. Hence we have a multitude of
> > approaches depending on the specifics of the problem domain.
>
> It was exactly my point that the ratio of salt shakers to diners is
> an important, and easily managed, property of a solution. And it can
> have *huge* performance effects if it is not properly balanced. Either
> extreme case is almost certainly suboptimal.
>
> > It's probably also worth pointing out that more than once, I've visited
> > a nearby table to 'borrow' a condiment, as the one I my table was
> > exhausted. The resource sharing introduces a latency, which fortunately
> > was easily resolved by the locality of similar resources.
>
> And this is a relatively low extra overhead in the rare cases when a
> local pool is exhausted. (Proper dynamic pool balancing will make this
> case very unusual, unless local demand has a large time/space variance--
> which is yet another parametric input into the allocation problem.)
>
> > In a very busy place, you can easily find that even neighbouring tables
> > have exhausted their resources, resulting in a somewhat ad-hoc 'hunt'
> > for nearby sources, before resorting to approaching a waiter, or
> > counter in order to obtain more. If this continues too long you can end
> > up with a gridlock situation, with an already busy staff trying
> > desperately to fill many tiny bottles and then distribute them before
> > the system returns to any sense of order.
>
> If the resource consumers make a local "run" on resources faster than
> the anticipative/adaptive balancing algorithm can rebalance the pools,
> then there will always be trouble--extending at least as far as the
> offending locality.
>
> This is simply reality. Everything in the real world exhibits the
> same issues, and the same kinds of costs for avoiding or remedying
> them.
>
> It's good to recognize when we have reached the limits of our ability
> to optimize a solution--like finding a "Second Law of Thermodynamics"
> that tells us we've done as much to cope with residual randomness as
> can be done. Of course, we're generally nowhere near that kind of
> optimality today.

This is part of my point, and I believe certain language constructs are
an impediment to achieving this within the application programming
space.

> > A parallel system is more like a resuraunt with a queue of people out
> > the front waiting.
> >
> > In many cases, the partitioning of the problem isn't the hard part,
> > it's figuring out a strategy for keeping each of the partitions
> > ticking.
> >
> > In conclusion, partitioning often isn't the hard part, it's formulating
> > strategies to keep each partition operating smoothly. Only then are
> > resources being consumed effectively.
>
> You are actually making an agrument for the relatively outmoded kind
> of concurrency called multiprogramming, where the concurrency of the
> load exceeds the concurrency of the system. This is obsolete.

I'm not so sure it is - OS's still fail to exploit multiprocessor
systems for multiple execution threads efficiently, be they the same
process/program or not. This is a fundamental problem with any form of
parallelism.

> There are well known techniques for ensuring stability--analogous
> to servo system design. (Programmers generally are not aware of
> the mass of work done on stability in control systems.)

Indeed.

> >>This is the kind of thinking that must go into the next generation
> >>of systems, and it is very different from the thinking that has
> >>inspired the systems and tools of today.
> >
> >
> > Agreed - although don't overlook the fact that there are plenty of
> > systems out there that have faced a lot of these issues already.
>
> A few have made a start--none have approached the problem as the
> absolutely fundamental problem that it is.

It's beyond the understanding of most to approach the problem this way.
There's a degree of research going on in the quantum computing field,
but this is for algorithm design for theoretical machines. I'd imagine
though that a lot of the techniques apply to real parallel systems as
well.

> >>>Multithreading approaches are very important on these systems. In fact,
> >>>multithreading is important even on systems with single execution
> >>>units. The gap between I/O throughput and processing throughput means
> >>>you get a certain degree of 'parallelism' even though you can only run
> >>>one thread at a time. Free performance improvement if you employ
> >>>parallel design techniques.
> >>>
> >>>Of course, there are certain heavily compute-bound applications where
> >>>the degree of IPC is very low, and massive parallelism is possible
> >>>regardless of the interconnect used, as IPC constitutes a relatively
> >>>small part of the workload. For the rest of the cases though where lots
> >>>of data is being consumed, systems that allow low-overhead IPC through
> >>>multithreading are the way to go.
> >>
> >>And a tiny fraction of todays tools and designers are even out of
> >>kindergarten on issues of partitioning and locality.
> >
> >
> > Completely agree with this point. I'm continually surprised by the
> > number of people I encounter that do not understand the dining
> > philosophers problem, and that's perhaps the easiest one!
> >
> >
> >>Object orientation is almost totally orthogonal, if not antithetical,
> >>to the *real* problems of highly parallel computing, which is the
> >>platform of the future. I expect we'll figure this out sometime
> >>in the next decade. ;-(
> >
> >
> > Almost, other than the additional abstraction and encapsulation being
> > required by both concepts.Current languages don't really provide
> > adequate abstractions for parallel computing primitives. Some of them
> > can be modelled effectively using OO techniques, some of them can't.
>
> Right--which is why I'm trying to emphasize that the truly *important*
> dimension of decomposition for concurrency has been sadly neglected,
> when it should be the real focus of our language work.

There is room, no, for multiple focus groups?

> We don't need language "lawyers", or even "mathematicians"; we need
> language "engineers" with their sights firmly focused on the needs of
> the future--and the exploitation of multi-leveled parallelism heads
> that list, both in terms of challenge and in terms of opportunity.

I agree with you, and with engineer hat firmly on, I maintain that
keeping applications agnostic of architecture-centric concerns is a
necessary step in this goal. Consider that massive scale parallelism is
going to be heterogeneous. There are techniques out there that support
this, and some embrace them. Others are content to blindly export
architectural dependancies into their code, when as you rightly point
out, there are far more pressing concerns awaiting us.

Matt

From: Linards Ticmanis on 10 Jun 2006 11:17

Paul Schlyter wrote:

> Using soft interrupts as entry points on the Apple II would have been
> infeasible -- but Apple could have used a JMP table instead, positioned
> near the beginning or the end of the ROM address space. CP/M used that
> method for entry points to its BIOS: a series of JMP operations at the
> very beginning of the memory block used by the CP/M BIOS.

Commodore did the same by the way; they have that "kernal" jump table at
the end of the ROMs which stayed fairly stable throughout their 8-bit
line, and which is easy to "hook" since most of the important entry
points use indirect JMPs through a RAM vector. Didn't actually make the
machines very compatible with each other, though.

--
Linards Ticmanis

From: Michael on 10 Jun 2006 13:11

Wow -- this has been an awesome thread to read on Software Engineering,
especially about the paradigm shifts happening in languages.

I especially appreciate the fact that it has remained civil, instead of
degenerating into some flame-fest, like so many other threads do when
they approach the territory of "Java Sucks", "C/C++ sucks", instead of
getting to the real issues --
a) what problem are you tying to solve?
b) what paradigms are you trying to use in the language?
c) what paradigms are you trying to use outside the language?

Cheers

First | Prev | Next | Last
Pages: 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
Prev: what happened CBM=VGA
Next: 1581 Drive Kits on eBay