From: Michael on

Michael J. Mahon wrote:
> The Apple ROM was not significantly changed until the //e, where much
> of the actual code was placed in a bank-switched area. Much of the F8
> region became stubs at documented entry points vectoring to the actual
> routines. Updating the F8 region of the ROM was known to be a minefield
> of compatibility issues.
>
> The notion of defining the ROM entry points more architecturally was
> not widespread at the time the Apple II was designed, and the impact
> of the subsequent loss of control over ROM code became a problem.

This is related to another hobby of mine -- the HP48 calculator (which
in a lot of ways, has the same "Apple" feel.) It had the same problem,
all though it wasn't as bad, since there became known 'entry points' of
the system ROM; i.e. SYSEVAL on the HP28 which exposed the problem to
the general public. (You know how curious engineers/geeks are when
something has "undocumented" written all over it. ;-)

I'll agree that it was not widespread, but it was known. The question
is when?
Woz did work at HP -- when did the calculator people come up with a
solution? And did Woz know about it?



> (I've always wondered how much "compatibility issues" and how much
> "renegotiation issues" factored into the decision to never update
> the Applesoft ROMs to fix bugs...)
>
> Later systems used a combination of less documentation and more
> complexity to make calls into the middle of ROM less likely. Still
> not an ideal solution, but one well adapted to the Apple II. ;-)

It certainly looks like everyone was too busy writing their own code,
since the ROM code wasn't all that usefull.

Cheers

From: mdj on

Michael wrote:

> While most code doesn't need to know the bit size of types, you still
> need to know the min sizes, so you don't have to worry about underflow
> / overflow. The size issue comes up when serializing. By the language
> mandidating features, even if the hardware doesn't support them, say
> like doubles on DSPs, or the PS2, is one of the reasons Java is so
> slow.

Obviously if your application requires the use of a data size or type
that's unsupported by your hardware you're going to have performance
problems, but this applies regardless of language used.

> See: "How Java's Floating-Point Hurts Everyone Everywhere"
> http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf

For heavily numeric intensive applications, you need dedicated numerics
libraries. This is the case for C and C++ as well.

It'd be nice if people could come up with a nice standardised rich
numeric specification that could be added to Java. I'd imagine it'll
happen eventually.

> Maybe you have a different experience on "portability" you can comment
> on, compared to Carmack's (May 2006) one with Java and cell-pohones?

I've only used Java in desktop and server environments, where the
problems are mitigated by mature virtual machines.

> http://www.armadilloaerospace.com/n.x/johnc/Recent%20Updates
>
> It turns out that I'm a lot less fond of Java for
> resource-constrained work. I remember all the little gripes I had with
> the Java language, like no unsigned bytes, and the consequences of
> strong typing, like no memset, and the inability to read resources into
> anything but a char array, but the frustrating issues are details down
> close to the hardware.
>
> The biggest problem is that Java is really slow. On a pure cpu / memory
> / display / communications level, most modern cell phones should be
> considerably better gaming platforms than a Game Boy Advanced. With
> Java, on most phones you are left with about the CPU power of an
> original 4.77 mhz IBM PC, and lousy control over everything.
>
> I spent a fair amount of time looking at java byte code disassembly
> while optimizing my little rendering engine. This is interesting fun
> like any other optimization problem, but it alternates with a bleak
> knowledge that even the most inspired java code is going to be a
> fraction the performance of pedestrian native C code.
>
> Even compiled to completely native code, Java semantic requirements
> like range checking on every array access hobble it. One of the phones
> (Motorola i730) has an option that does some load time compiling to
> improve performance, which does help a lot, but you have no idea what
> it is doing, and innocuous code changes can cause the compilable
> heuristic to fail.

Every one of the issues you mention is the result of poorly implemented
virtual machines, or at least *young* virtual machines. There's no
real reason for Java code to execute more slowly than C or C++, and
benchmarks routinely show that it does indeed run as fast, and in some
cases faster. It all depends on the sophistication of your compiler, or
in the Java case, the JIT/HotSpot engine.

> Write-once-run-anywhere. Ha. Hahahahaha. We are only testing on four
> platforms right now, and not a single pair has the exact same quirks.
> All the commercial games are tweaked and compiled individually for each
> (often 100+) platform. Portability is not a justification for the awful
> performance.

It's still a very young platform that's changing quickly. As a result,
there's little API standardisation across phones. Then there's vendors
wanting to keep their own API under lock and key.

None of these are Java, or more generically VM based platform issues,
just immature platforms growing fast, and experiencing pains as a
result.

Matt

From: sicklittlemonkey on
mdj wrote:
> Michael wrote:
> > The biggest problem is that Java is really slow. ...
>
> Every one of the issues you mention is the result of poorly implemented
> virtual machines, or at least *young* virtual machines. There's no
> real reason for Java code to execute more slowly than C or C++, and
> benchmarks routinely show that it does indeed run as fast, and in some
> cases faster. It all depends on the sophistication of your compiler, or
> in the Java case, the JIT/HotSpot engine.

In case you didn't click the link, Michael doesn't make it clear that
he copied and pasted Carmack's actual comments, which are over one year
old. John also used NetBeans to develop, and criticized the IDE
performance. I use Eclipse, and see less slowdowns than I do in
VisualStudio Express. Java, .NET, same thing really, and MS is going
for the same markets (gaming etc).

Michael, you neglected to quote from John's recent post:
"O&E [game name] added a high end java version that kept most of the
quality of the high end BREW version on phones fast enough to support
it from carriers willing to allow the larger download. The download
size limits are probably the most significant restriction for gaming on
the high end phones."

This much more strongly implies that Matt is correct ...

> None of these are Java, or more generically VM based platform issues,
> just immature platforms growing fast, and experiencing pains as a result.

Cheers,
Nick.

From: Michael J. Mahon on
mdj wrote:
> Michael J. Mahon wrote:
>
>
>>The big "open door" opportunity is multiprocessor parallelism, but
>>we have invested so little in learning to apply parallelism that it
>>remains esoteric. (But AppleCrate makes it easy to experiment with! ;-)
>
>
> Parallelism is the big door, but I think the approaches that need to be
> explored cover a wider gamut than multiprocess parallelism, which as
> you point of has considerable latency issues.

And I would say that tools to help with the decomposition of algorithms
into parallel parts, while minimizing the effects of latency and limited
bandwidth, are the most important "tool frontier" today.

>>The popular "thread" model, in which all of memory is conceptually
>>shared by all threads, is a disaster for real multiprocessors, since
>>they will *always* have latency and bandwidth issues to move data
>>between them, and a "single, coherent memory image" is both slow
>>and wasteful.
>
>
> It is however an extremely efficient form of multiprocessing for
> applications with modest horizontal scaling potential.

And it offers unprecedented potential for data races an
nondeterministic behavior! ;-)

The thread model should have fundamentally segregated memory, so
that inter-thread references require special coordination commensurate
with their special risks and costs.

> There's essentially 3 basic models for parallelism that must be
> exploited:
>
> Multithread - in which one processor core can execute multiple threads
> simultaneously

This is the only case that can even approximate "uniform memory", since
at least most of the cache hierarchy will be common to all threads.

> Uniform Memory Multiprocessor - in which many processsor cores share
> the same physical memory subsystem. Note that this is further divided
> into multiple cores in the same package, plus other cores in different
> packages, which have very different latency properties.

Even within one package, only lower cache levels will be common, so
this is not fundamentally different from your next case...

> Non Uniform Memory Multiprocessor - In this case the latency can vary
> wildly depending on the system configuration.
>
> Modern multiprocessor servers employ all three approaches, both on the
> same system board, plus via high speed interconnects that join multiple
> system boards together. OS's must weight the 'distance' to another CPU
> when considering a potential execution unit for a process.

All of your cases are actually the same, differing only in the level
of memory hierarchy (and its corresponding latency and bandwidth) that
is shared.

Any practical system will consist of all levels of connectivity, with
sharing at virtually all the different levels of the memory hierarchy.
And I would add another set of levels, in which there is no "memory
consistency" model, but message passing is the sharing mechanism.
This extends the multiprocessing model across networks.

> What's slow and wasteful depends a great deal on the task at hand.
> Multithreading used to be just as expensive as multiprocessing. But
> consider a current generation CPU designed for low power, high
> concurrency, the UltraSPARC T1.
>
> These units have execution cores cable of running 4 concurrent threads.
> In the highest end configuration, there are 8 of these execution cores
> per physical processor. The cores have a 3.2GB/s interconnect. Each
> physical processor has 4 independant memory controllers, so you have
> non-uniform memory access on the one die.

Exactly. The general case is becoming the common case.

And multi-threaded processors are actually a very old idea. The
Honeywell 800 supported 8 "threads" (not called that, of course),
by executing instructions in "rotation", skipping slots that were
waiting for I/O to complete. At the time, it was considered to be
a hardware implementation of multiprogramming.

Today, multithreaded processors do much the same, but the "I/O wait"
has been replaced by the "cache miss".

The peripheral processor of the CDC 6600 was another salient example
of multi-threading. It was implemented in the same fast logic as
the central processor, but presented the appearance of 10 separate
PPs, each executing instructions at 10th the rate of the central
processor. This had the effect of matching its instruction rate to
the latency of memory, and provided 10-fold concurrency for managing
I/O and memory transfers.

> Peak power consumption for this part is 79W at 1Ghz. Considering you
> can in theory run 32 threads simulaneously, that's pretty impressive.
> How well you can exploit it depends on your application. An 'old
> school' web server for instance, can only get 8 way parallelism on this
> chip. A new school web server written in Java, can get 32 way, assuming
> at any given time there is at least 32 concurrent requests for the same
> dynamic page, or 32 static requests.
>
> It's getting to the stage where the power consumed by driving I/O over
> a pin on an IC package is significant, so expect to see systems like
> this grow in popularity.

This was always an inevitable result of higher levels of integration.
As soon as a significant amount of cache can be shared on the chip,
it becomes advantageous to adorn it with multiple processors.

> Interesting, you can download a VHDL description of this part from Sun,
> and synthesise it on one of the higer end FPGA's. Oh how I wish I had
> access to hardware like that!
>
> A top of the range Sun server uses parts that have 4 execution threads
> per core, four cores per board, each with it's own memory
> controller+memory, and up to 18 boards per system (coupled together by
> an 9GB/s crossbar switch). Exploiting all the resources in this system
> and doing it efficiently is *hard*, as it employs every different style
> of parallelism I mentioned before within the same 'machine'.
>
> And I haven't even considered computing clusters!

Exactly. And the full hierarchy of latency and bandwidth needs to be
addressed by both measurement tools and by behavioral models for code
partitioning and optimization. *This* is the tools frontier that I see,
with huge potential payoffs.

> The way it's panning out is that real multiprocessors are a disaster
> for parallelism. The problem is that essentially any task that can be
> parallelised needs to process the same data that it does in serial
> form. Because of this, you can utilise the same buses, I/O subsystems,
> and take advantage of 'nearness' to allow some pretty incredible IPC
> speeds.

No, that problem corresponds to a *very poor* partitioning of the
problem onto parallel processors--ironically, one that is encouraged
by current languages' simplistic "thread" models of parallel computing.

Let me give a little example.

Maximum efficiency of resource utilization is obtained by "pooling"
all of a particular resource together so that all requestors obtain
it by withdrawing from one pool. Then, you're not "out" of that
resource until you are *really* out of it.

But this creates a huge point of serial contention, since all
requestors must lock the pool, allocate some resource, then unlock
the pool. It is as if a large cafeteria put one giant salt shaker
in the middle of the room for all to share.

An alternative resource allocation scheme which is well adapted to
multiple concurrent users and a hierarchy of latencies is to provide
multiple local pools, shared by a small number of users at essentially
the same level of connection latency. This is like the more common
case of putting a small salt shaker within arms reach of each small
group of diners.

Of course, there is still the issue of resource balancing (when the
resource is really uniform--not like memory), and this can be done
by periodically re-balancing the amounts of resource in the local
pools, and across hierarchical levels if necessary.

This is the kind of thinking that must go into the next generation
of systems, and it is very different from the thinking that has
inspired the systems and tools of today.

> Multithreading approaches are very important on these systems. In fact,
> multithreading is important even on systems with single execution
> units. The gap between I/O throughput and processing throughput means
> you get a certain degree of 'parallelism' even though you can only run
> one thread at a time. Free performance improvement if you employ
> parallel design techniques.
>
> Of course, there are certain heavily compute-bound applications where
> the degree of IPC is very low, and massive parallelism is possible
> regardless of the interconnect used, as IPC constitutes a relatively
> small part of the workload. For the rest of the cases though where lots
> of data is being consumed, systems that allow low-overhead IPC through
> multithreading are the way to go.

And a tiny fraction of todays tools and designers are even out of
kindergarten on issues of partitioning and locality.

Object orientation is almost totally orthogonal, if not antithetical,
to the *real* problems of highly parallel computing, which is the
platform of the future. I expect we'll figure this out sometime
in the next decade. ;-(

-michael

Parallel computing for 8-bit Apple II's!
Home page: http://members.aol.com/MJMahon/

"The wastebasket is our most important design
tool--and it is seriously underused."
From: Michael J. Mahon on
Michael wrote:
> Michael J. Mahon wrote:
>
>>The Apple ROM was not significantly changed until the //e, where much
>>of the actual code was placed in a bank-switched area. Much of the F8
>>region became stubs at documented entry points vectoring to the actual
>>routines. Updating the F8 region of the ROM was known to be a minefield
>>of compatibility issues.
>>
>>The notion of defining the ROM entry points more architecturally was
>>not widespread at the time the Apple II was designed, and the impact
>>of the subsequent loss of control over ROM code became a problem.
>
>
> This is related to another hobby of mine -- the HP48 calculator (which
> in a lot of ways, has the same "Apple" feel.) It had the same problem,
> all though it wasn't as bad, since there became known 'entry points' of
> the system ROM; i.e. SYSEVAL on the HP28 which exposed the problem to
> the general public. (You know how curious engineers/geeks are when
> something has "undocumented" written all over it. ;-)
>
> I'll agree that it was not widespread, but it was known. The question
> is when?
> Woz did work at HP -- when did the calculator people come up with a
> solution? And did Woz know about it?

Woz was at HP in the early days of HP calculators, but before the
programmable handheld days.

Entry point vectors were well-known since the earliest days of
computing, but tended to be used only when their benefits were
seen as justifying the cost of the vector in RAM/ROM.

Woz did not envision the widespread popularity of the Apple II, and
so did not avail himself of such protective techniques.

And, of course, it's not really "protection" if you can address
anywhere in the ROM anyway. So he just published the entry points
he thought were useful and likely to be preserved, assuming that
fellow enthusiasts would take the information to heart.)

>>(I've always wondered how much "compatibility issues" and how much
>>"renegotiation issues" factored into the decision to never update
>>the Applesoft ROMs to fix bugs...)
>>
>>Later systems used a combination of less documentation and more
>>complexity to make calls into the middle of ROM less likely. Still
>>not an ideal solution, but one well adapted to the Apple II. ;-)
>
>
> It certainly looks like everyone was too busy writing their own code,
> since the ROM code wasn't all that usefull.

Have you tried writing 80-column scrolling? ;-)

-michael

Parallel computing for 8-bit Apple II's!
Home page: http://members.aol.com/MJMahon/

"The wastebasket is our most important design
tool--and it is seriously underused."
First  |  Prev  |  Next  |  Last
Pages: 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
Prev: what happened CBM=VGA
Next: 1581 Drive Kits on eBay