From: nmm1 on
In article <6830060f-e8a6-4ebb-a0ed-9bc42f14e319(a)5g2000yqj.googlegroups.com>,
Michael S <already5chosen(a)yahoo.com> wrote:
>>
>> And that's where you get the simplification. =A0No fiendishly complicated
>> FLIH, horrible and inadequate run-time system support, and so on.
>
>I think, you are wrong.
>This behavior for [async] interrupts (i.e. all instructions before
>return address are fully completed; all instructions at and above
>return address are not started) is architected on all current ISAs
>that could be considered general-purpose.

That was not true when I investigated this area, and my experiments
confirmed that it wasn't the case in practice, either. I have just
looked at the Intel x86 architecture manual, to see if things have
changed, and they haven't. See, for example, the NOTE at the end
of 6.5 and section 6.6 of Intel� 64 and IA-32 Architectures Software
Developer's Manual Volume 3A: System Programming Guide, Part 1.

You may have missed the subtlety that the guarantee that an interrupt
handler is called between two instructions needs a LOT of logic that
my scheme does not. You may also have missed the subtle gotcha that
synchronising the view of memory on a parallel system is not a direct
consequence of taking an interrupt between two instructions.

>I, at least, am not aware about better ways of for OS-level
>fragmentation-free memory allocation. Esp, when upfront the app is not
>sure about the real size of continuous buffer that it allocates and
>there is a big difference between max size and typical size.
>It (demand paging) also works very well for application stack(s).

Well, I am, and so are lots of other people. You are confusing
(relocatable) virtual memory with demand paging. There is absolutely
no difficulty in compiled code trapping stack overflow and extending
it, without relying on any form of interrupt, for example. I have
implemented that for several languages, including C. And buffer
extensibility is similar.



Regards,
Nick Maclaren.
From: Morten Reistad on
In article <hr6gmd$oir$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote:
>In article <6830060f-e8a6-4ebb-a0ed-9bc42f14e319(a)5g2000yqj.googlegroups.com>,
>Michael S <already5chosen(a)yahoo.com> wrote:
>>>

>You may have missed the subtlety that the guarantee that an interrupt
>handler is called between two instructions needs a LOT of logic that
>my scheme does not. You may also have missed the subtle gotcha that
>synchronising the view of memory on a parallel system is not a direct
>consequence of taking an interrupt between two instructions.

You all seem to be trapped in the single processor world, despite
valient efforts to break out.

We have already concluded that in a wide (tens or more) processor
layout a message passing architecture with a fifo using something
like hyperchannel and a fast mux may be the signalling method of
choice. Further, we have identified three "walls"; the "memory",
"power" and "synchronicity" walls.

Nick is perfectly correct in going for a "less is more" cpu design,
and get more cpus online, and more cache. A fast message-passing
fifo, conceptually similar to hyperchannel(s) and a routing mux, can
do what we previously did with interrupts. QNX did the design for
this in 1982, and this proved very viable.

Hardware then has to send and receive messages from this bus. This
is not very different from a SATA, etherchannel, or inter-cache
protocol on modern cpus. We then need to have one or more cpus
reading from this channel, performing kernel functions. And we
need some "channel-to-sata" and "channel-to-pci" etc bridges.

But dispensing with interrupts does not necessarily mean
ditching demand paging. It just means the hardware must be
sufficiently intelligent to freeze the process, send a message
and wait for the reply. Depending on the reply; continue, or fail.

Nothing particularly fancy there, we already do this for a handful
of layers of cache; except the channel is internal to the cpu.

As long as that cpu is waiting on a message can hibernate, and the
message system is fast and low latency I am willing to bet it can beat
an interrupt based system.

>>I, at least, am not aware about better ways of for OS-level
>>fragmentation-free memory allocation. Esp, when upfront the app is not
>>sure about the real size of continuous buffer that it allocates and
>>there is a big difference between max size and typical size.
>>It (demand paging) also works very well for application stack(s).
>
>Well, I am, and so are lots of other people. You are confusing
>(relocatable) virtual memory with demand paging. There is absolutely
>no difficulty in compiled code trapping stack overflow and extending
>it, without relying on any form of interrupt, for example. I have
>implemented that for several languages, including C. And buffer
>extensibility is similar.

Indeed. The important part there is to keep the instruction
stream rolling.

We all need to unthink the CPU as the core. We have reached a point
of very diminishing returns regarding cpu performance, and further
cpu fanciness will cost more than it is worth in terms of power, and
will be substantially hindered by the memory and synchronicity walls.
We are probably well past the optimum design point for cpu design
by now, and need to back down quite a bit.

Rather, think about how we can handle the cache-memory-i/o interconnects
well, save energy, and address synchronicity. The latter does not
need full, global synchronous operation except in a few, very rare
cases. A lock/semaphore manager will do nicely in most cases; where
defining a sequence and passing tokens is more important than absolute time.

QNX got that right, and that is an important part of the neatness
of that OS.

So, if we need to build a hypervisor for windows, fine. And if
windows cannot license-wise run on more than 2 cpus, utilise the
rest of the cpus for i/o, cache, paging, running video, generate
graphics etc. We probably just need to make a proof of concept
before Microsoft obliges with licenses. This is pretty close to
what we do with GPUs anyway.

Speaking of GPUs; what if we gave them an MMU, and access to a
cache/memory interconnect? Even if all non-cache references has
to go to a command channel, if that is sufficiently fast we can
do "paging" between gpu memory and dram.

Yes, it is wandering off the subject a bit but as a
"gedankenexperiment"; if the GPUs just have a small, fast memory, but
we can handle "page faults" through an mmu, and bring pages in an out
of cache at hyperchannel speeds we can use those gpus pretty much like
an ordinary cpu.

-- mrr
From: Tim McCaffrey on
In article <hr6gmd$oir$1(a)smaug.linux.pwf.cam.ac.uk>, nmm1(a)cam.ac.uk says...
>
>In article <6830060f-e8a6-4ebb-a0ed-9bc42f14e319(a)5g2000yqj.googlegroups.com>,
>Michael S <already5chosen(a)yahoo.com> wrote:
>>>
>>> And that's where you get the simplification. =A0No fiendishly complicated
>>> FLIH, horrible and inadequate run-time system support, and so on.
>>
>>I think, you are wrong.
>>This behavior for [async] interrupts (i.e. all instructions before
>>return address are fully completed; all instructions at and above
>>return address are not started) is architected on all current ISAs
>>that could be considered general-purpose.
>
>That was not true when I investigated this area, and my experiments
>confirmed that it wasn't the case in practice, either. I have just
>looked at the Intel x86 architecture manual, to see if things have
>changed, and they haven't. See, for example, the NOTE at the end
>of 6.5 and section 6.6 of Intel� 64 and IA-32 Architectures Software
>Developer's Manual Volume 3A: System Programming Guide, Part 1.
>
>You may have missed the subtlety that the guarantee that an interrupt
>handler is called between two instructions needs a LOT of logic that
>my scheme does not. You may also have missed the subtle gotcha that
>synchronising the view of memory on a parallel system is not a direct
>consequence of taking an interrupt between two instructions.
>
>>I, at least, am not aware about better ways of for OS-level
>>fragmentation-free memory allocation. Esp, when upfront the app is not
>>sure about the real size of continuous buffer that it allocates and
>>there is a big difference between max size and typical size.
>>It (demand paging) also works very well for application stack(s).
>
>Well, I am, and so are lots of other people. You are confusing
>(relocatable) virtual memory with demand paging. There is absolutely
>no difficulty in compiled code trapping stack overflow and extending
>it, without relying on any form of interrupt, for example. I have
>implemented that for several languages, including C. And buffer
>extensibility is similar.
>

So, how is this different from the CDC 6600?

The OS was in the PPs (although MSU moved it (mostly) back into the
CPU). The I/O was handled by the PPs (the CPU couldn't do I/O), there
were interrupts, and instruction faults, but they weren't precise (I
think they were thought of more as guidelines... argh :) ).

And there was no page faults (no paging).

If you think about it, you should be able to implement an entire (including
memory) CDC 7600 on a single chip these days. You can use DDR3 as a paging
device. It might even run at 4 Ghz...


- Tim

From: Robert Myers on
Rick Jones wrote:

> Robert Myers <rbmyersusa(a)gmail.com> wrote:
>> Genetic programming is only one possible model.
>
>> The current programming model is to tell the computer in detail what
>> to do.
>
>> The proposed paradigm is to shift from explicitly telling the
>> computer what to do to telling the computer what you want and
>> letting it figure out the details of how to go about it, with
>> appropriate environmental feedback, which could include human
>> intervention.
>
> Sounds like child rearing. I could handle a computer behaving like my
> nine year-old, at least most of the time. I'm not sure I want my
> computer behaving like my five year-old :)
>

It's a fair analogy, although computers have yet to reach the learning
capacity of infants.

"I am a HAL 9000 computer. I became operational at the H.A.L. plant in
Urbana, Illinois on the 12th of January 1992. My instructor was Mr.
Langley, and he taught me to sing a song."

It was a naively ambitious view of computers, but I think it was more
right than the Deist watchmaker-programmer God view we now have.

Robert.
From: Quadibloc on
On Apr 27, 1:27 am, n...(a)cam.ac.uk wrote:

> Well, actually, I blame the second-raters who turned some seminal
> results into dogma.
>
> None of Turing, Von Neumann or the best mathematicians and computer
> people would ever say that the model is the last word, still less
> that hardware architecture and programming languages must be forced
> into it.

The reason that, so far, parallel architectures are used to execute
programs which basically were written for a von Neumann machine, but
chopped into bits that can run in parallel, is not so much the fault
of a blind dogmatism as it is of the absence of a clear alternative.

While there are other models out there, such as genetic programming
and neural nets, (hey, let's not forget associative memory - Al Kossow
just put the STARAN manual up on bitsavers!) at the moment they're
only seen as applicable to a small number of isolated problem domains.

A computer tended to be conceived of as a device which automatically
does what people would have done by hand, perhaps with a desk
calculator or a slide rule or log tables, whether for a scientific
purpose or for accounting. How we addressed these problem domains
gradually evolved through the use of unit record equipment to the use
of digital computers. (During that evolution, though, another non-von
Neumann model was encountered - the mechanical differential analyzer
of Vannevar Bush, and its successors the analog computer and the
digital differential analyzer such as MADDIDA.)

So I go further: not only don't I "blame" Turing and von Neumann... I
also don't "blame" everyone else who came later for failing to come up
with some amazing new revolutionary insight that would transform how
we think about computing. Because unlike the von Neumann model, this
new insight would have to involve a way to fundamentally transform
algorithms from their old paper-and-pencil form.

Now, it _is_ true that there was a von Neumann bottleneck back when a
mainframe computer was considered impressive when it had 32K words of
memory (i.e. a 7090 with a filled address space) and that had become
worse by the time of the IBM 360/195 (up to 4 megabytes of regular
main memory, although you could fill the 16 megabyte address space if
you used bulk core).

When it comes to today's PCs, with possibly 2 gigabytes of DRAM, and
perhaps 2 megabytes of L2 cache on chip... the von Neumann bottleneck
has grown to grotesque proportions.

An "intelligent RAM" architecture that included a very low-power
processor for every 128 kilobytes of DRAM would provide considerable
additional parallel processing power without having to as drastically
change how we write programs as changing to a neural-net model, for
example, would require. But the required change would likely still be
so drastic as to lead to this power usually lying unused.

It would be different, of course, if one PC were time-shared between
hundreds of thin clients - thin clients that were accessing it in
order to obtain the processing power of a 7090 mainframe. The trouble
is, of course, that this doesn't make economic sense - that level of
processing power is cheaper than the wires needed to connect to it at
a distance.

So, instead, IRAM ends up being a slow, but flexible, associative
memory...

John Savard