From: Michael J. Mahon on
mdj wrote:
> Michael J. Mahon wrote:
>
>
>>Video transcoding is an "embarrassingly parallel" application, yet
>>I have not found a single popular software manufacturer that makes
>>a parallel transcoder. With all the multicore chips now, they are
>>leaving factors of performance on the table.
>
>
> Unfortunately the main uses for transcoding in the consumer space are
> to convert DVD's or broadcast digital into more compact forms. Since
> this is percieved as largely a piracy driven concern, you cannot obtain
> a license for CSS to build such an application. If you go ahead and
> build one anyway (it's not like the techniques are patented) you'll be
> prosecuted under the terms of the DMCA. *grumble*

I understand your point, but what about legitimate transcoding?
Most packages I have for generating MPEG2 don't do CSS at all.

>>Apparently, the market for video encoders does not evaluate them
>>on performance--astonishingly. Some of the most popular perform
>>terribly.
>
>
> I would guess that there are some professional level solutions that do
> a better job.

One hopes, but don't count on it. ;-)

>>The actual situation is a full continuum of bandwidths and latencies,
>>and that hierarchy is very poorly supported. Instead, we give designers
>>the static choice between shared memory and message passing. This is
>>a bad match to a world in which the system interconnect continuum is
>>neither very slow nor very fast, and is constantly moving.
>
>
> Agreed. The moving target is one of the things that makes it difficult,
> and a prime candidate for higher level abstractions, lest work done now
> be less applicable to hardware available 2 years from now.

Actually, the problem of allocating computation across a hierarchy of
memories and multiprocessors is quite old, and the topology has not
varied from the "general case" in several decades--only more levels
of hierarchy have been introduced. It's still exactly the same
problem, with exactly the same range of solutions.

Put another way, there is nothing "special purpose" about the model
that I propose we optimize for--it should apply to everything from
atoms to stars (in fact, the universe already seems to have an
allocation algorithm that works pretty well ;-).

Here's an outline of a possible (incomplete) conceptualization based,
interestingly enough, on the dynamic stability of (most) stars.

Given a population of processes, all working concurrently and
1) consuming resources, and 2) communicating, let us allocate
processes to a single "processing node" until its resources are
mostly used, then begin allocating processes to nearby nodes,
until all processes have been "spread out" across nodes. We will
treat resource contention on a node as "pressure" which leads
to "expansion" of the set of processes across more nodes.

Now, as the processes communicate, we treat non-local communication
as an attractive force between the communicating processes, which
biases the allocation algorithm toward moving them closer together
(in "communication cost" space).

Now note that a dynamic equilibrium is possible in which the contractive
"force" of communication is balanced by the expansive "force" of
resource contention. This is analogous to the equilibrium that exists
in a stable star between gravitation attraction and internal pressure.

Such models can form the basis, when melded with control system theory,
for stable, quasi-optimal allocation of multiple communicating processes
across a fabric of processor-memory nodes. Generalizations to a
hierarchy of connectivity latencies and bandwidths are possible and
would be required except in the simplest cases.

>>The most important "tools" we are missing are the conceptual tools
>>to guide systematic and quantitative decomposition of applications into
>>concurrent "chunks" to be run on varying numbers of processors with
>>varying amounts and latencies of shared memory and/or message passing.
>>
>>Until we can get our design heads on straight, it doesn't make much
>>sense to codify a particular approach in a language design--though it
>>will as we learn.
>
>
> Interestingly the hardware world has learned a lot about out of order
> execution in recent years, so perhaps a good deal of the techniques
> have been better explored that first appears. I agree though, we need
> the quantitative approach first through tools, then the better
> languages can evolve from those discoveries.

Actually, most of what is being practiced in the realm of hardware
concurrency is quite well understood, and has been around for decades.
What has changed is that transistors have become cheap enough that the
techniques originally developed in limited form for supercomputers
can now be implemented in more general forms on microprocessor chips.

Buffering, speculation, scoreboarding, data flow, and associative memory
are not new, but their affordability in desktop processors is.

>>When races are involved, or rendezvous with non-determinate order of
>>arrival, then actual nondeterminism enters the picture. Sometimes this
>>is planned, and enhances performance--as in the case of an evaluation
>>tree that takes advantage of the distributive property of the rollup
>>operation (like arithmetic addition or set addition). Sometimes it is
>>unplanned, as when an erroneous race condition exists that causes
>>intermittent bugs or unreproducible results.
>
>
> Yeah, and the planned nondeterministic methods are the case that has
> the least exploration at this stage, and is the one that benefits most
> as the size of clusters goes up.

Bingo.

BTW, it guarantees the loss of strict reproducibility of results if
floating point arithmetic is used, because the order of operations is
non-deterministic, and that is what determines roundoff error. We'll
just have to be content with "fuzzy" reproducibility!

(Incidentally, the variance of results produced when a computation is
evaluated nondeterministically several times is a lower bound on the
roundoff error in the computation.)

>>>>>Uniform Memory Multiprocessor - in which many processsor cores share
>>>>>the same physical memory subsystem. Note that this is further divided
>>>>>into multiple cores in the same package, plus other cores in different
>>>>>packages, which have very different latency properties.
>>>>
>>>>Even within one package, only lower cache levels will be common, so
>>>>this is not fundamentally different from your next case...
>>>
>>>
>>>Some implementations support high speed transports of L1 cache data
>>>both to on-die cores, and inter-package cores. There's still a lot of
>>>variance in the available implementations.
>>
>>Yes, that's my point. The *hierarchy* needs to be the thing that is
>>optimized for, not some ideal simplified case.
>>
>>The only way to get a uniform memory model is to slow down all the
>>memory accesses so that the ones that could go faster run at the same
>>speed as the ones that can't run any faster--clearly simpler, but *way*
>>non-optimal!
>
>
> I suspect that given the number of possible localities, etc. and the
> likelihood of these changing quickly over time, that the most
> successful way will be to design for the worst case interconnect, and
> allow optimisation to occur per-architecture by some form of runtime.
> Still, it doesn't solve the problem of appropriate partitioning of
> algorithms in the first place.

Right. And one must be able to adaptively change allocations to
maintain efficiency as the computation evolves, since many computations
have more than one "locality" behavior, depending on what's going on.

>>>>>Non Uniform Memory Multiprocessor - In this case the latency can vary
>>>>>wildly depending on the system configuration.
>>>>>
>>>>>Modern multiprocessor servers employ all three approaches, both on the
>>>>>same system board, plus via high speed interconnects that join multiple
>>>>>system boards together. OS's must weight the 'distance' to another CPU
>>>>>when considering a potential execution unit for a process.
>>>>
>>>>All of your cases are actually the same, differing only in the level
>>>>of memory hierarchy (and its corresponding latency and bandwidth) that
>>>>is shared.
>>>
>>>
>>>That's right. I drew a distinction because at present not all multicore
>>>systems are implemented in the ideal way. The current generation Intel
>>>Xeon dual cores for instance only benefit you for power consumption,
>>>and are actually of equivalent locality to different processor
>>>packages. Eventually the idealised designs will surface though and
>>>you'll have general localities of die, board, system, and network. For
>>>the moment it's somewhat more complicated.
>>
>>There is no "ideal" way--there are only real ways, and none of them
>>have simple behavior if they're any good!
>
>
> :-) I meant "ideal" in terms of the ones that currently are 'better'
> versus the 'worse' ones, in terms of processing units on the same
> system board having closer locality for similar system cost.

Ah. Well, that's just physics. Unless the designers mess it up, things
that are closer together can afford to be connected at a higher level of
the hierarchy.

Early multi-core designs were just sub-optimal "pasteup" jobs.

>>>>Any practical system will consist of all levels of connectivity, with
>>>>sharing at virtually all the different levels of the memory hierarchy.
>>>>And I would add another set of levels, in which there is no "memory
>>>>consistency" model, but message passing is the sharing mechanism.
>>>>This extends the multiprocessing model across networks.
>>>
>>>
>>>A brand new PC these days will get you two cores. There are different
>>>scales of systems. The big ones, for sure, will exhibit all 4 levels,
>>>but there's plenty of work to be done just to handle the opportunities
>>>that exist in a new laptop.
>>
>>Agreed--and the tools for *optimizing* applications across even two
>>processors are almost nonexistent. A suboptimal data partitioning
>>can cause a large factor of performance degradation, and designers
>>have virtually no tools to understand and adjust this.
>
>
> Indeed. There are plenty of cases where even non-parallel application
> performance drops when moving to multiprocessor systems, simply because
> of operating system design. A classic example, which is largely solved
> now is the naive scheduler which shifts a running process to another
> CPU, throwing out the cache on the one it was on in the process, and
> worse, this happens while the process in question was waiting on I/O
> :-)

And you correctly identify this as an immature OS scheduler design--yet
another example of how even OS designers are blissfully unaware of the
concurrency trends barreling toward them like trucks! (And they've
been quite visibly coming for a decade, so no one can justly claim that
they were surprised!)

I've been amazed by how unaware OS designers are about the implications
of memory hierarchy and multiprocessing on their algorithms. And those
who figured out how to "cope" with a 4-way system can't understand
that they will need a completely different approach to deal efficiently
with a 32- or 64-way system. (Think about queueing theory, lock
contention, resource pools, and error recovery.)

>>>>This was always an inevitable result of higher levels of integration.
>>>>As soon as a significant amount of cache can be shared on the chip,
>>>>it becomes advantageous to adorn it with multiple processors.
>>>
>>>
>>>Just be thankful you aren't the author of the process scheduler of a
>>>current operating system :-)
>>
>>Actually, that's one of the most important areas to do *much* better
>>and more adaptively than is done in current OSs.
>
>
> Yes. The current approach adds effectively a 'timeout' constant to a
> process. The timeout represents the cost of shifting it to the next
> nearest locality, and processes don't move until this time has expired
> without the opportunity to shift them. Once the timeout has expired,
> you look for a gap in the scheduling table on processors of this
> locality, and if one exists, you slot it in. If not, you increase the
> timeout to the cost of moving one level further out, and so on. Each
> time a process gets scheduled to run, you reset the timeout to nearest
> locality.
>
> This approach works very well for avoiding unnecessary shifting of
> single-thread processes on a multitasking system, and things tend to
> settle into relative harmony as the system runs.
>
> Of course, handling scheduling and moving of massively parallel
> processes is another kettle of fish!

Not if approached with the right abstractions. ;-)

>>The "distributed process allocation" problem in the OS is as fundamental
>>as the "code selection" problem in a compiler, and will have much larger
>>performance impact as the level of *system* concurrency continues to
>>increase.
>>
>>I would even go so far as to say that the current *lack* of good
>>solutions to this problem is a major reason for the limited utility
>>of multicore systems. In a couple of years, we could all have 8-way
>>systems on our desktops, but if the fundamental enablers for parallel
>>apps don't get done, they won't perform much better than single-core
>>systems (as is true today for dual-core/single-core).
>
>
> There's a real need to crawl before we walk here. One big problem that
> needs be solved even on current systems is how to deal with many
> concurrent independant processes all in kernel space at the same time.
> Unless we move to microkernel systems, which I think eventally we will,
> we have to solve the problem of how to effectively schedule multiple
> concurrent threads of operating system code competing for resources as
> well, effectively meaning the problem has to be solved twice. Most
> current OS's have pretty ordinary granularity when it comes to
> intra-kernel concurrency.

True--and significant changes in the low levels of OSs is inevitable.
But "crawl before walk" implies that we are facing this as a *new*
problem, when, in fact, it has been coming toward us for a generation,
but we have failed to acknowledge it in our designs.

I hope using our nose for a wall detector won't cause us too much more
pain in this process (though note that lack of application support for
even low levels of parallelism is really slowing the PC market down).

>>>This is similar to the compiler you mentioned a while back that used
>>>runtime instrumentation of it's p-code engine to determine critical
>>>sections then tune them. Although modern VM's apply this technique
>>>today, that's about as far as it's got.
>>
>>It doesn't need to be a non-existent machine--most machines have timer
>>interrupts that permit background profiling and dynamic code changes.
>>All of these tricks can be (and have been) done on actual machine code.
>
>
> Yes. I've often considered playing with this myself with Apple Pascal,
> and profiling the bytecode interpreter as it runs. Then, use the output
> of it to selectively native compile various procedures or functions and
> relink the application. Should be a hoot :-)

Sounds like great fun!

Using a source of timer interrupts makes dynamic profiling a breeze.
(Of course, this would also identify the parts of the *interpreter* that
were hot.)

>>>One of the reasons I still like Solaris is that it's the only common OS
>>>that supports the concept of 'microstate' accounting. This essentially
>>>means that you can log the transitions of a process through various
>>>known 'microstates' So rather than know just that a process is blocked
>>>on a resource, you can find out what resources, why, how long, etc.
>>>This is integrated back into the development toolchain all the way
>>>through to the Java level profiler, so you can visualise exactly what
>>>different threads are doing, spending time on, and easily spot
>>>bottlenecks which in many cases can be tuned out at either the OS or
>>>the VM. Very fancy stuff.
>>
>>And not too difficult or intrusive if it is based on sampling
>>information at interrupt time (which automatically bases sample
>>density on performance impact).
>
>
> Unfortunately on other operating systems, you don't have this level of
> intrumentation, and the other systems are different enough in terms of
> architecture, operating system design, etc. as to make many of the
> performance tuning possibilities you uncover on Solaris not applicable
> :-(

Users should pester OS designers to provide them with background
profiling capabilities. It's actually easy to do on any OS.

Of course, if you can define interrupting timers, then you can do it
yourself--with considerably more overhead.

>>In many applications, there are relatively weak data dependencies
>>connecting large segments of data, so the data can be divided up,
>>processed, and then merged together--all very well suited to avoidance
>>of excess communication. (For example, video and audio transcoding
>>and most image processing are such cases.)
>>
>>Note that many cases of multidimensonal data have the characteristic
>>that the amount of computation grows as a power of the data volume,
>>while the size of the interacting "boundaries" between data chunks
>>grows as a lesser power, so that by adjusting "chunk" size, the
>>ratio of processing to communication can be easily adjusted over a
>>wide range--and *should* be so adjusted adaptively, based on run-time
>>realities and available system resources.
>
>
> Indeed. Video encoding can very easily be parallelised through time
> division, across many more processors than we're likely to have in
> consumer machines any time soon. This can be achieved through fairly
> simple multithreading plus rendezvous techniques. Of course, if the
> number of parallel slices becomes excessive, you introduce to many full
> frames into your encoding, and require a multitiered approach, but it's
> still relatively feasible to do using simple tools.

Now *there's* a degree of parallelism that I'd like to have access to!

<snip>

>>There are well known techniques for ensuring stability--analogous
>>to servo system design. (Programmers generally are not aware of
>>the mass of work done on stability in control systems.)
>
>
> Indeed.
>
>
>>>>This is the kind of thinking that must go into the next generation
>>>>of systems, and it is very different from the thinking that has
>>>>inspired the systems and tools of today.
>>>
>>>
>>>Agreed - although don't overlook the fact that there are plenty of
>>>systems out there that have faced a lot of these issues already.
>>
>>A few have made a start--none have approached the problem as the
>>absolutely fundamental problem that it is.
>
>
> It's beyond the understanding of most to approach the problem this way.
> There's a degree of research going on in the quantum computing field,
> but this is for algorithm design for theoretical machines. I'd imagine
> though that a lot of the techniques apply to real parallel systems as
> well.

I wouldn't say so. Quantum computing is so radically different from
state machine programming that there is almost no comparison. I'd say
it's at least as big a difference in paradigm as going from analog
computing to digital computing.

What needs to be done is for trained people with an *engineering*
mindset to sit down and squarely face the problem of multi-level
concurrency.

I expect the solutions will roll out in layers of successive refinement,
but we haven't yet even chosen to directly address the problem.

>>>>>Multithreading approaches are very important on these systems. In fact,
>>>>>multithreading is important even on systems with single execution
>>>>>units. The gap between I/O throughput and processing throughput means
>>>>>you get a certain degree of 'parallelism' even though you can only run
>>>>>one thread at a time. Free performance improvement if you employ
>>>>>parallel design techniques.
>>>>>
>>>>>Of course, there are certain heavily compute-bound applications where
>>>>>the degree of IPC is very low, and massive parallelism is possible
>>>>>regardless of the interconnect used, as IPC constitutes a relatively
>>>>>small part of the workload. For the rest of the cases though where lots
>>>>>of data is being consumed, systems that allow low-overhead IPC through
>>>>>multithreading are the way to go.
>>>>
>>>>And a tiny fraction of todays tools and designers are even out of
>>>>kindergarten on issues of partitioning and locality.
>>>
>>>
>>>Completely agree with this point. I'm continually surprised by the
>>>number of people I encounter that do not understand the dining
>>>philosophers problem, and that's perhaps the easiest one!
>>>
>>>
>>>
>>>>Object orientation is almost totally orthogonal, if not antithetical,
>>>>to the *real* problems of highly parallel computing, which is the
>>>>platform of the future. I expect we'll figure this out sometime
>>>>in the next decade. ;-(
>>>
>>>
>>>Almost, other than the additional abstraction and encapsulation being
>>>required by both concepts.Current languages don't really provide
>>>adequate abstractions for parallel computing primitives. Some of them
>>>can be modelled effectively using OO techniques, some of them can't.
>>
>>Right--which is why I'm trying to emphasize that the truly *important*
>>dimension of decomposition for concurrency has been sadly neglected,
>>when it should be the real focus of our language work.
>
>
> There is room, no, for multiple focus groups?

Well, after spending 50 years on notation, how about starting in
on parallelism? ;-)

I realize that there have been a few seminal approaches to supporting
parallelism in languages--including the radically concurrent applicative
languages--but none of the "rubber" was ever brought anywhere near the
"road" of real systems, let alone ones with hierarchical connectivity.

There are already dozens of designers focusing on notation. I'm
actually making a *plea* for more focus areas--and for one of
them to be concurrency!

-michael

Parallel computing for 8-bit Apple II's!
Home page: http://members.aol.com/MJMahon/

"The wastebasket is our most important design
tool--and it is seriously underused."
From: Sam Gillett on

"Linards Ticmanis" wrote ...

> Paul Schlyter wrote:
>
>> Using soft interrupts as entry points on the Apple II would have been
>> infeasible -- but Apple could have used a JMP table instead, positioned
>> near the beginning or the end of the ROM address space. CP/M used that
>> method for entry points to its BIOS: a series of JMP operations at the
>> very beginning of the memory block used by the CP/M BIOS.
>
> Commodore did the same by the way; they have that "kernal" jump table at
> the end of the ROMs which stayed fairly stable throughout their 8-bit
> line, and which is easy to "hook" since most of the important entry
> points use indirect JMPs through a RAM vector. Didn't actually make the
> machines very compatible with each other, though.

But the Kernel jump table did make it easier for a C64 programmer to adapt to
programming the C128. Going back a little farther in time, the jump table
made it easy for VIC-20 programmers to adapt to the C64. :-)

As others pointed out in this thread, when you have giga Hertz processors and
giga-bytes of RAM, excessive use of OS calls may not be good programming
practice. But, back in the days of one Mega Hertz processors and 64 KB of
RAM, coding right "down to the metal" made good sense.

8-bit machines may not be as useful today as they were 20 years ago, but they
are still a fun hobby. Twenty year old programming practices are obsolete on
modern machines, but still good practice on the machines of yesteryear.
:-)
--
Best regards,

Sam Gillett

Change is inevitable,
except from vending machines!



From: Lyrical Nanoha on
On Sat, 10 Jun 2006, Michael J. Mahon wrote:

> For over a decade I have preached to anyone who would listen that the
> biggest design error in C is its use of null-terminated strings. This
> inherently insecure data representation makes unsafe programming easy
> and safe programming harder.
>
> Security demands that *all* data moves into fixed space handle the case
> where the data does not fit. Yet programmers seduced by the "cuteness"
> of null-termination argue against fixing this no-brainer security lapse.
>
> Count-prefixed strings can also be very efficiently implemented, as can
> count-controlled move loops for null-terminated strings (the obvious fix
> for C).

I'd like to suggest that perhaps an alternate runtime library for C can be
coded on a few systems, "libnewc" perhaps, that includes, among other
things, better ways of handling strings. C's mechanism is very limited
and the null termination thing doesn't help.

-uso.
From: mdj on

Michael J. Mahon wrote:

> Actually, the problem of allocating computation across a hierarchy of
> memories and multiprocessors is quite old, and the topology has not
> varied from the "general case" in several decades--only more levels
> of hierarchy have been introduced. It's still exactly the same
> problem, with exactly the same range of solutions.
>
> Put another way, there is nothing "special purpose" about the model
> that I propose we optimize for--it should apply to everything from
> atoms to stars (in fact, the universe already seems to have an
> allocation algorithm that works pretty well ;-).
>
> Here's an outline of a possible (incomplete) conceptualization based,
> interestingly enough, on the dynamic stability of (most) stars.
>
> Given a population of processes, all working concurrently and
> 1) consuming resources, and 2) communicating, let us allocate
> processes to a single "processing node" until its resources are
> mostly used, then begin allocating processes to nearby nodes,
> until all processes have been "spread out" across nodes. We will
> treat resource contention on a node as "pressure" which leads
> to "expansion" of the set of processes across more nodes.
>
> Now, as the processes communicate, we treat non-local communication
> as an attractive force between the communicating processes, which
> biases the allocation algorithm toward moving them closer together
> (in "communication cost" space).
>
> Now note that a dynamic equilibrium is possible in which the contractive
> "force" of communication is balanced by the expansive "force" of
> resource contention. This is analogous to the equilibrium that exists
> in a stable star between gravitation attraction and internal pressure.
>
> Such models can form the basis, when melded with control system theory,
> for stable, quasi-optimal allocation of multiple communicating processes
> across a fabric of processor-memory nodes. Generalizations to a
> hierarchy of connectivity latencies and bandwidths are possible and
> would be required except in the simplest cases.

Indeed this is the goal. Perhaps you should get involved in one of the
open source projects working on such fields. There seems to be a lot
more 'hardcore' computing going on in the open source community these
days than inside corporate walls.

> Actually, most of what is being practiced in the realm of hardware
> concurrency is quite well understood, and has been around for decades.
> What has changed is that transistors have become cheap enough that the
> techniques originally developed in limited form for supercomputers
> can now be implemented in more general forms on microprocessor chips.
>
> Buffering, speculation, scoreboarding, data flow, and associative memory
> are not new, but their affordability in desktop processors is.

Certainly techniques that are not applied in the software realm, which
is what I was pointing out.

> Bingo.
>
> BTW, it guarantees the loss of strict reproducibility of results if
> floating point arithmetic is used, because the order of operations is
> non-deterministic, and that is what determines roundoff error. We'll
> just have to be content with "fuzzy" reproducibility!
>
> (Incidentally, the variance of results produced when a computation is
> evaluated nondeterministically several times is a lower bound on the
> roundoff error in the computation.)

Yes. The only practical solution is to use much larger floating point
representations such that the fuzziness is less of a concern, although
modern hardware can perform 256 bit fixed point arithmetic quite
quickly :-)

> > I suspect that given the number of possible localities, etc. and the
> > likelihood of these changing quickly over time, that the most
> > successful way will be to design for the worst case interconnect, and
> > allow optimisation to occur per-architecture by some form of runtime.
> > Still, it doesn't solve the problem of appropriate partitioning of
> > algorithms in the first place.
>
> Right. And one must be able to adaptively change allocations to
> maintain efficiency as the computation evolves, since many computations
> have more than one "locality" behavior, depending on what's going on.

Hence my lack of faith in languages that bind us too tightly to
(hardware) architectural borders.

> >>There is no "ideal" way--there are only real ways, and none of them
> >>have simple behavior if they're any good!
> >
> >
> > :-) I meant "ideal" in terms of the ones that currently are 'better'
> > versus the 'worse' ones, in terms of processing units on the same
> > system board having closer locality for similar system cost.
>
> Ah. Well, that's just physics. Unless the designers mess it up, things
> that are closer together can afford to be connected at a higher level of
> the hierarchy.
>
> Early multi-core designs were just sub-optimal "pasteup" jobs.

Many of them still on the market, being espoused as equivalent or
superior to competing solutions :-(

> >>>>Any practical system will consist of all levels of connectivity, with
> >>>>sharing at virtually all the different levels of the memory hierarchy.
> >>>>And I would add another set of levels, in which there is no "memory
> >>>>consistency" model, but message passing is the sharing mechanism.
> >>>>This extends the multiprocessing model across networks.
> >>>
> >>>
> >>>A brand new PC these days will get you two cores. There are different
> >>>scales of systems. The big ones, for sure, will exhibit all 4 levels,
> >>>but there's plenty of work to be done just to handle the opportunities
> >>>that exist in a new laptop.
> >>
> >>Agreed--and the tools for *optimizing* applications across even two
> >>processors are almost nonexistent. A suboptimal data partitioning
> >>can cause a large factor of performance degradation, and designers
> >>have virtually no tools to understand and adjust this.
> >
> >
> > Indeed. There are plenty of cases where even non-parallel application
> > performance drops when moving to multiprocessor systems, simply because
> > of operating system design. A classic example, which is largely solved
> > now is the naive scheduler which shifts a running process to another
> > CPU, throwing out the cache on the one it was on in the process, and
> > worse, this happens while the process in question was waiting on I/O
> > :-)
>
> And you correctly identify this as an immature OS scheduler design--yet
> another example of how even OS designers are blissfully unaware of the
> concurrency trends barreling toward them like trucks! (And they've
> been quite visibly coming for a decade, so no one can justly claim that
> they were surprised!)
>
> I've been amazed by how unaware OS designers are about the implications
> of memory hierarchy and multiprocessing on their algorithms. And those
> who figured out how to "cope" with a 4-way system can't understand
> that they will need a completely different approach to deal efficiently
> with a 32- or 64-way system. (Think about queueing theory, lock
> contention, resource pools, and error recovery.)

Absolutely, although there are designs out there the deal well with
machines of that size (Solaris is one example). PC based operating
systems have conventionally been optimised only for the hardware that
appears at the time, which considering the rate of improvement in that
field, has left a lot of stones unturned.

> >>>>This was always an inevitable result of higher levels of integration.
> >>>>As soon as a significant amount of cache can be shared on the chip,
> >>>>it becomes advantageous to adorn it with multiple processors.
> >>>
> >>>
> >>>Just be thankful you aren't the author of the process scheduler of a
> >>>current operating system :-)
> >>
> >>Actually, that's one of the most important areas to do *much* better
> >>and more adaptively than is done in current OSs.
> >
> >
> > Yes. The current approach adds effectively a 'timeout' constant to a
> > process. The timeout represents the cost of shifting it to the next
> > nearest locality, and processes don't move until this time has expired
> > without the opportunity to shift them. Once the timeout has expired,
> > you look for a gap in the scheduling table on processors of this
> > locality, and if one exists, you slot it in. If not, you increase the
> > timeout to the cost of moving one level further out, and so on. Each
> > time a process gets scheduled to run, you reset the timeout to nearest
> > locality.
> >
> > This approach works very well for avoiding unnecessary shifting of
> > single-thread processes on a multitasking system, and things tend to
> > settle into relative harmony as the system runs.
> >
> > Of course, handling scheduling and moving of massively parallel
> > processes is another kettle of fish!
>
> Not if approached with the right abstractions. ;-)

And ditching explicit pointer arithmetic in code is one of the
abstractions dammit ;-)

> >>The "distributed process allocation" problem in the OS is as fundamental
> >>as the "code selection" problem in a compiler, and will have much larger
> >>performance impact as the level of *system* concurrency continues to
> >>increase.
> >>
> >>I would even go so far as to say that the current *lack* of good
> >>solutions to this problem is a major reason for the limited utility
> >>of multicore systems. In a couple of years, we could all have 8-way
> >>systems on our desktops, but if the fundamental enablers for parallel
> >>apps don't get done, they won't perform much better than single-core
> >>systems (as is true today for dual-core/single-core).
> >
> >
> > There's a real need to crawl before we walk here. One big problem that
> > needs be solved even on current systems is how to deal with many
> > concurrent independant processes all in kernel space at the same time.
> > Unless we move to microkernel systems, which I think eventally we will,
> > we have to solve the problem of how to effectively schedule multiple
> > concurrent threads of operating system code competing for resources as
> > well, effectively meaning the problem has to be solved twice. Most
> > current OS's have pretty ordinary granularity when it comes to
> > intra-kernel concurrency.
>
> True--and significant changes in the low levels of OSs is inevitable.
> But "crawl before walk" implies that we are facing this as a *new*
> problem, when, in fact, it has been coming toward us for a generation,
> but we have failed to acknowledge it in our designs.
>
> I hope using our nose for a wall detector won't cause us too much more
> pain in this process (though note that lack of application support for
> even low levels of parallelism is really slowing the PC market down).

Agreed. The crawling should have been done a decade ago, not starting
now while oodles of performance going begging. The process we have to
go through remains the same regardless of the timeline unfortunately.

> >>>This is similar to the compiler you mentioned a while back that used
> >>>runtime instrumentation of it's p-code engine to determine critical
> >>>sections then tune them. Although modern VM's apply this technique
> >>>today, that's about as far as it's got.
> >>
> >>It doesn't need to be a non-existent machine--most machines have timer
> >>interrupts that permit background profiling and dynamic code changes.
> >>All of these tricks can be (and have been) done on actual machine code.
> >
> >
> > Yes. I've often considered playing with this myself with Apple Pascal,
> > and profiling the bytecode interpreter as it runs. Then, use the output
> > of it to selectively native compile various procedures or functions and
> > relink the application. Should be a hoot :-)
>
> Sounds like great fun!
>
> Using a source of timer interrupts makes dynamic profiling a breeze.
> (Of course, this would also identify the parts of the *interpreter* that
> were hot.)

I'm awaiting a TimeMaster HO which has a rather nice programmable
interrupt controller. Much more flexible than relying on the 60Hz VBL I
use at the moment for experimentation.

It's good fun exploring these ideas on smaller environments that have
nice predictable behaviors, but you already know this :-)

> >>>One of the reasons I still like Solaris is that it's the only common OS
> >>>that supports the concept of 'microstate' accounting. This essentially
> >>>means that you can log the transitions of a process through various
> >>>known 'microstates' So rather than know just that a process is blocked
> >>>on a resource, you can find out what resources, why, how long, etc.
> >>>This is integrated back into the development toolchain all the way
> >>>through to the Java level profiler, so you can visualise exactly what
> >>>different threads are doing, spending time on, and easily spot
> >>>bottlenecks which in many cases can be tuned out at either the OS or
> >>>the VM. Very fancy stuff.
> >>
> >>And not too difficult or intrusive if it is based on sampling
> >>information at interrupt time (which automatically bases sample
> >>density on performance impact).
> >
> >
> > Unfortunately on other operating systems, you don't have this level of
> > intrumentation, and the other systems are different enough in terms of
> > architecture, operating system design, etc. as to make many of the
> > performance tuning possibilities you uncover on Solaris not applicable
> > :-(
>
> Users should pester OS designers to provide them with background
> profiling capabilities. It's actually easy to do on any OS.
>
> Of course, if you can define interrupting timers, then you can do it
> yourself--with considerably more overhead.

One of the problems with the design of many operating systems is the
inability to expose interrupts to a user level process. This has
resulted in many awful design hacks in OS's over the years (The
embedding of the UI subsystem of Windows NT into the kernel from
version 4 onward being a good example)

> >>In many applications, there are relatively weak data dependencies
> >>connecting large segments of data, so the data can be divided up,
> >>processed, and then merged together--all very well suited to avoidance
> >>of excess communication. (For example, video and audio transcoding
> >>and most image processing are such cases.)
> >>
> >>Note that many cases of multidimensonal data have the characteristic
> >>that the amount of computation grows as a power of the data volume,
> >>while the size of the interacting "boundaries" between data chunks
> >>grows as a lesser power, so that by adjusting "chunk" size, the
> >>ratio of processing to communication can be easily adjusted over a
> >>wide range--and *should* be so adjusted adaptively, based on run-time
> >>realities and available system resources.
> >
> >
> > Indeed. Video encoding can very easily be parallelised through time
> > division, across many more processors than we're likely to have in
> > consumer machines any time soon. This can be achieved through fairly
> > simple multithreading plus rendezvous techniques. Of course, if the
> > number of parallel slices becomes excessive, you introduce to many full
> > frames into your encoding, and require a multitiered approach, but it's
> > still relatively feasible to do using simple tools.
>
> Now *there's* a degree of parallelism that I'd like to have access to!

When I've needed to do such things (being a linux man) I've simply
split the source file into two parts and ran two encoders at the same
time, then recombined. The fine granularity of tools offered by UNIX
style interface opens up a lot of flexibility with how you approach
problems.

I agree that it should be far simpler than this, though.

> >>There are well known techniques for ensuring stability--analogous
> >>to servo system design. (Programmers generally are not aware of
> >>the mass of work done on stability in control systems.)
> >
> >
> > Indeed.
> >
> >
> >>>>This is the kind of thinking that must go into the next generation
> >>>>of systems, and it is very different from the thinking that has
> >>>>inspired the systems and tools of today.
> >>>
> >>>
> >>>Agreed - although don't overlook the fact that there are plenty of
> >>>systems out there that have faced a lot of these issues already.
> >>
> >>A few have made a start--none have approached the problem as the
> >>absolutely fundamental problem that it is.
> >
> >
> > It's beyond the understanding of most to approach the problem this way.
> > There's a degree of research going on in the quantum computing field,
> > but this is for algorithm design for theoretical machines. I'd imagine
> > though that a lot of the techniques apply to real parallel systems as
> > well.
>
> I wouldn't say so. Quantum computing is so radically different from
> state machine programming that there is almost no comparison. I'd say
> it's at least as big a difference in paradigm as going from analog
> computing to digital computing.

I refer to the work done in selecting the correct computations from
many parallel ones, which should be applicable in some sense.

> What needs to be done is for trained people with an *engineering*
> mindset to sit down and squarely face the problem of multi-level
> concurrency.
>
> I expect the solutions will roll out in layers of successive refinement,
> but we haven't yet even chosen to directly address the problem.

It won't be long now. We can't wait much longer for companies to
engineer processors with massively faster sequential processing speeds
before realising that they can't :-)

> >>>>>Multithreading approaches are very important on these systems. In fact,
> >>>>>multithreading is important even on systems with single execution
> >>>>>units. The gap between I/O throughput and processing throughput means
> >>>>>you get a certain degree of 'parallelism' even though you can only run
> >>>>>one thread at a time. Free performance improvement if you employ
> >>>>>parallel design techniques.
> >>>>>
> >>>>>Of course, there are certain heavily compute-bound applications where
> >>>>>the degree of IPC is very low, and massive parallelism is possible
> >>>>>regardless of the interconnect used, as IPC constitutes a relatively
> >>>>>small part of the workload. For the rest of the cases though where lots
> >>>>>of data is being consumed, systems that allow low-overhead IPC through
> >>>>>multithreading are the way to go.
> >>>>
> >>>>And a tiny fraction of todays tools and designers are even out of
> >>>>kindergarten on issues of partitioning and locality.
> >>>
> >>>
> >>>Completely agree with this point. I'm continually surprised by the
> >>>number of people I encounter that do not understand the dining
> >>>philosophers problem, and that's perhaps the easiest one!
> >>>
> >>>
> >>>
> >>>>Object orientation is almost totally orthogonal, if not antithetical,
> >>>>to the *real* problems of highly parallel computing, which is the
> >>>>platform of the future. I expect we'll figure this out sometime
> >>>>in the next decade. ;-(
> >>>
> >>>
> >>>Almost, other than the additional abstraction and encapsulation being
> >>>required by both concepts.Current languages don't really provide
> >>>adequate abstractions for parallel computing primitives. Some of them
> >>>can be modelled effectively using OO techniques, some of them can't.
> >>
> >>Right--which is why I'm trying to emphasize that the truly *important*
> >>dimension of decomposition for concurrency has been sadly neglected,
> >>when it should be the real focus of our language work.
> >
> >
> > There is room, no, for multiple focus groups?
>
> Well, after spending 50 years on notation, how about starting in
> on parallelism? ;-)
>
> I realize that there have been a few seminal approaches to supporting
> parallelism in languages--including the radically concurrent applicative
> languages--but none of the "rubber" was ever brought anywhere near the
> "road" of real systems, let alone ones with hierarchical connectivity.
>
> There are already dozens of designers focusing on notation. I'm
> actually making a *plea* for more focus areas--and for one of
> them to be concurrency!

You're absolutely right, but there's a degree of dependency between the
two that needs to be addressed, and part of that is engineering out old
notational forms which inhibit the progress of parellel system design.

Matt

From: Michael J. Mahon on
Lyrical Nanoha wrote:
> On Sat, 10 Jun 2006, Michael J. Mahon wrote:
>
>> For over a decade I have preached to anyone who would listen that the
>> biggest design error in C is its use of null-terminated strings. This
>> inherently insecure data representation makes unsafe programming easy
>> and safe programming harder.
>>
>> Security demands that *all* data moves into fixed space handle the case
>> where the data does not fit. Yet programmers seduced by the "cuteness"
>> of null-termination argue against fixing this no-brainer security lapse.
>>
>> Count-prefixed strings can also be very efficiently implemented, as can
>> count-controlled move loops for null-terminated strings (the obvious fix
>> for C).
>
>
> I'd like to suggest that perhaps an alternate runtime library for C can
> be coded on a few systems, "libnewc" perhaps, that includes, among other
> things, better ways of handling strings. C's mechanism is very limited
> and the null termination thing doesn't help.

But the part of the problem is that many programmers wrote the simple
"for"-loop-until-null string copy right into their code.

-michael

Parallel computing for 8-bit Apple II's!
Home page: http://members.aol.com/MJMahon/

"The wastebasket is our most important design
tool--and it is seriously underused."
First  |  Prev  |  Next  |  Last
Pages: 42 43 44 45 46 47 48 49 50 51 52 53 54 55
Prev: what happened CBM=VGA
Next: 1581 Drive Kits on eBay