From: John Savard on
On Sun, 4 Sep 2005 01:15:40 +0200, "Skybuck Flying" <nospam(a)hotmail.com>
wrote, in part:

>To keep the system responsive windows
>xp would need to shorten each time slice and thereby creating more contex
>switches thus leaving less processing power to do anything usefull.

To avoid context switching, you don't need to have another whole core.
Just use register renaming, and have a multi-threaded CPU.

First, make the fastest possible single core you can.

Then, make it multithreaded, so that the only context switches you have
are the unavoidable ones due to procedure calls.

Then, if you still have die area, add cores.

John Savard
http://home.ecn.ab.ca/~jsavard/index.html
http://www.quadibloc.com/index.html
_________________________________________
Usenet Zone Free Binaries Usenet Server
More than 140,000 groups
Unlimited download
http://www.usenetzone.com to open account
From: Anne & Lynn Wheeler on
jsavard(a)excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
> To avoid context switching, you don't need to have another whole
> core. Just use register renaming, and have a multi-threaded CPU.
>
> First, make the fastest possible single core you can.
>
> Then, make it multithreaded, so that the only context switches you
> have are the unavoidable ones due to procedure calls.
>
> Then, if you still have die area, add cores.

that was sort of the 370/195 dual i-stream proposal from the early
70s. the issue was that most codes kept the pipeline only about
half-full. having dual instruction counters and dual registers
.... with pipeline tagging registers and i-stream had a change of
maintaining close to aggregate, peak thruput of the pipeline.

amdahl in the early 80s ... had another variation on that.

running straight virtual machine hypervisor ... resulted in context
switch on privilege instructions and i/o interrupts (between the
virtual machine and the virtual machine hypervisor), including saving
registers, other state, etc (and then restoring).

starting with virtual machine assist on the 370/158 and continuing
with ECPS on the 138/148 to the LPAR support on modern machines ...
more and more of the virtual machine hypervisor was being implemented
in the microcode of the real machine ... aka the real machine
instruction implementation (for some instructions) would recognize
whether it was in real machine state or virtual machine state and
modify the instruction decode and execution appropriately. one of the
issues was that microcode tended to be a totally different beast and
with little in the way of software support tools.
http://www.garlic.com/~lynn/subtopic.html#mcode

amdahl 370 clones implemented an intermediate layer called macrocode
that effectively looked and tasted almost exactly like 370 ... but had
its own independent machine state. this basically allowed almost
exactly moving some virtual machine hypervisor code to macrocode level
.... w/o the sometimes difficult translation issues ... while
eliminating standard context switching overhead (register and other
state save/restore).

it was also an opportunity to do some quick speed up. standard 370
architecture allows for self-modifying instructions ... before stuff
like speculative execution (with rollback) support ... the checking
for catching whether the previous instruction had modified the current
(following) instruction being decoded & scheduled for execution
.... frequently doubled 370 instruction elapsed time processing.
macrocode architecture was specified as not supporting self-modifying
370 instruction streams.

i've frequently claimed that the 801/risc formulation in the mid-70s,
http://www.garlic.com/~lynn/subtopic.html#801

was opportunity to do the exact opposite of other stuff in the period:
combination of the failure of future system project (extremely complex
hardware architecture)
http://www.garlic.com/~lynn/subtopic.html#futuresys

and heavy overhead performance paid by 370 architecture supporting
self-modifying instruction stream and very strong memory coherency
(and overhead) in smp implementations.

separating instruction and data caches and providing for no coherency
support between stores to the data cache and what was in the
instruction cache ... precluded even worrying about self-modifying
instruction operation. no support for for any kind of cache coherency
.... down to the very lowest of levels ... also precluded such support
for multiprocessing operations.

i've sometimes observed that ibm/motorola/apple/etc somerset was sort
of taking the rios risc core and marrying it with 88k cache coherencyl.
recent, slightly related posting
http://www.garlic.com/~lynn/2005o.html#37 What ever happened to Tandem and NonStop OS?

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
From: robertwessel2@yahoo.com on

Skybuck Flying wrote:
> <robertwessel2(a)yahoo.com> wrote in message
> news:1125731066.086811.73210(a)z14g2000cwz.googlegroups.com...
> >
> > The number of context switches is not typically reduced in any
> > significant way by going to a dual (or multi-) CPU system. The vast
> > majority of context switches occur because a thread blocks or calls on
> > another thread or context for a service (really the same thing), and
> > those are *not* impacted by the number of CPUs in the system. Context
> > switches due to time slice expirations *can* be reduced by a multiple
> > CPU configuration, but only if the number of threads tends to be close
> > to the number of CPUs. If there are many runnable thread, and the time
> > slice interval does not change, even that can be slower on the dual
> > processor system, since slices will occur after only half the number of
> > instructions. In any event, time slice triggered context switches are
> > *very* rare. At best a few tens to a few hundred per second.
>
> As far as I can tell windows xp doesn't care how many threads are running,
> it will simply give each thread the same ammount of time slice thereby
> lagging the whole system.


Windows, like pretty much any other OS, will dole out CPU time in
time-slice increments to threads that run CPU bound, but only rarely
does a thread (in most systems) ever run out a time slice before
blocking on some event (and thus threads are rarely CPU bound).


> That's why the number of contex switches doesn't
> increase as the number of threads increase.


This is perhaps true for a collection of CPU bound threads, but the
vast majority of threads in a system are (typically) not.


> Windows xp makes absolutely no
> attempt to keep the system responsive. To keep the system responsive windows
> xp would need to shorten each time slice and thereby creating more contex
> switches thus leaving less processing power to do anything usefull.


Different versions of Windows use different time-slice intervals. And
there are some semi-documented ways that you can change that. Windows
takes considerable effort to keep the system responsive (CPU bound
threads get a priority reduction, foreground applications get a boost,
etc.), but in the presence of multiple CPU bound threads attached to
message queues (which is an application design problem), there's not a
whole lot you can do.

In short, time slice intervals have only a little to do with system
responsiveness.

From: Jan Vorbrüggen on
> Different versions of Windows use different time-slice intervals. And
> there are some semi-documented ways that you can change that. Windows
> takes considerable effort to keep the system responsive (CPU bound
> threads get a priority reduction, foreground applications get a boost,
> etc.), but in the presence of multiple CPU bound threads attached to
> message queues (which is an application design problem), there's not a
> whole lot you can do.

I have never seen this work properly in W2K...once you have one compute-
bound thread running, the GUI becomes totally unresponsive. Nextstep, now
that is another matter.

Jan
From: Anne & Lynn Wheeler on
"robertwessel2(a)yahoo.com" <robertwessel2(a)yahoo.com> writes:
> Windows, like pretty much any other OS, will dole out CPU time in
> time-slice increments to threads that run CPU bound, but only rarely
> does a thread (in most systems) ever run out a time slice before
> blocking on some event (and thus threads are rarely CPU bound).

i had done dynamic adaptive scheduling back in the 60s as an
undergraduate. this was sometimes referred to as fairshare (or
wheeler) scheduler ... because the default scheduling policy
was fairshare

basically light-weight interactive tasks got shorter quanta than more
heavy-weight longer running tasks. advistory deadline was calculated
proportional to size of the quanta and the recent resource
utilization. quanta could span certain short i/os (disk, etc) ... so
heavy utilization wasn't necessary restricted to absolutely pure cpu
bound (just reasonably high rate of cpu consumption).

light-weight interactive tasks would tend to have more frequent quanta
with shorter deadlines ... and heavy weight tasks would have less
frequent larger quanta. interactive tasks would appear to be more
responsive if they could complete within a few of the shorter quanta
and/or if they haven't been using a lot of resources recently.
basically the objective was to uniformally control of the overall rate
of resource consumption (regardless of the quanta size).

the other issue was that this was done as modifications to existing
cp67 system that had an extremely complex scheduler that wasn't
directly controlling resource utilization ... just moving priorities
up & done and itself could consumer 10-15 percent of total cpu
utilization. so a major objective of the scheduler change was to not
only implement effective resource consumption supporting a variety of
scheduling policies ... but do it with as close as possible to zero
pathlength.
http://www.garlic.com/~lynn/subtopic.html#fairshare

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/