From: Eric P. on
I was unable to send this back when we were discussing this in the
summer because my ISP dropped newsgroup support and it took a while
to find a free server that allows posting. Belated as it is, since
I put in the effort I figured I might as well send it anyway.

Anne & Lynn Wheeler wrote:
>
> grenoble modified they same cp67 for "working set" dispatcher and
> local LRU running on 1mbyte 360/67, 154 "pageable pages" after fixed
> kernel requirements (basically 50 percent more real storage for
> application paging). they published a paper on the work in the early
> 70s in cacm
>
> J. Rodriquez-Rosell, The design, implementation, and evaluation of a
> working set dispatcher, cacm16, apr73

I finally got a copy of this paper. The description is a bit vague and
confusing. While I accept your assertion that Cambridge outperformed
Grenoble, after reading this paper I see no basis for generalizing
from those results to conclude "global is superior to local".

This paper discusses adding a Denning style working set to their
existing system. A Denning working set manager is similar
to ***swapping*** and independent of global vs local issues.

This paper only shows that Grenoble's working set dispatcher,
a combination of scheduler and bulk memory manager, improved on
the CP67 scheduler and eliminated CP67's tendency to thrash. In fact
they explicitly state that this dispatcher can be applied to many
page replacement policies: random, FIFO, LRU or other. Though the
mechanism is quite different, the result looks similar to VMS swapping.

IIUC their working set is determined by scanning, at regular intervals,
a process's PTE Used bits, then resetting the bits. Pages not used
are candidates for replacement and the PTE is marked Invalid.
If touched again it soft-faults back into the address space.
By this it dynamically estimates the WS size for each process.
If the free list gets too small, a victim process is selected from
a FIFO list and the whole working set is tossed onto the free list
(equivalent to outswapping). If there is enough free pages for a
process WS size to fit, an outswapped process is inswapped.

As for the page replacement algorithm, they state their design options
were restricted by CP67 v3.0 existing structure. Prior to the Grenoble
modifications, on a page fault it would look for an unused page,
and if not available selects one from any active process at _Random_.
("it will choose the first page it finds in what essentially amounts
to random replacement"). (They also don't say how the unused page is
selected, which is as important as the selection of stealing page).

Their wording is somewhat vague but if it does steal pages from
any processes, then this is a ***global*** replacement policy.

With Grenoble mods, in order to track the process working sets it
does not use a linked list. It uses the working set determined
above to decide which pages are in use and which are not.
Shared pages are not part of this mechanism and are locked down.
This separates the pages into two piles, those in the working sets
and those not. IIUC replacement pages are still chosen at random from
the free pile or the active pile (based in interactive or batch class).

Suffice it to say that this is so different to VMS or WNT paging that
I certainly wouldn't extrapolate Grenoble's results to them.

> basically running same kind of workload, cambridge ran approx. 80
> users with the same interactive response and thruput as grenoble did
> with 35 users i.e. cambridge system with 1/3rd less real storage for
> application paging was able to support more than twice as many users
> with comperable thruput and no worse interactive response
> ... typically much better. the interactive response was somewhat more
> sensitive to latency associated with the "working dispatcher"
> attempting to avoid page thrashing and how effective local LRU was in
> selecting pages for replacement ... so the cambridge system response
> ... even with more than twice as many users typically had better
> interactive response and lower latency delays.

The following looks like it might be a write up of those Cambridge
tests (it includes an acknowledgment to L. Wheeler at the end)
where they compare two page replacement algorithms.

Experimental evaluation of system performance
http://www.research.ibm.com/journal/sj/123/ibmsj1203F.pdf

<quote>
CP-67 maintains a table of all pages in main storage and a pointer
Item 1 that cycles around this table. CP-67 also maintains, at any
given moment, a list of ?in-queue? users, and these are the only ones
eligible to receive service at that time. The manner in which this
list is maintained is further discussed below.

The two algorithms may now be described as follows:

Algorithm 1 (used in CP67 v3.0):
Move the pointer around the table until a page belonging
to a user not in queue is found. If no such page is found
in a complete trip around the table, select the next page for removal.

Algorithm 2 (used in CP67 v3.1):
Select the next page if its reference bit is off.
Otherwise, turn the reference bit off, move the pointer down one
page, and repeat. Note: The reference bit is turned on whenever
the user references the page in the course of running his program.
The bit is turned off when the user is removed from inqueue status.
<end quote>

Algorithm 1 sounds like a global random mechanism.
Algorithm 2 sounds like a global clock variant.

Rodriquez-Rosell states that they started from CP67 v3.0 (Algorithm 1)
which seems to confirm that Grenoble CP67 started from a global random
replacement policy and modified it to add the Denning working set
dispatching (aka swapping).

Again this is quite different from VMS or WNT paging.

Eric

From: Anne & Lynn Wheeler on
"Eric P." <eric_pattison(a)sympaticoREMOVE.ca> writes:

> I was unable to send this back when we were discussing this in the
> summer because my ISP dropped newsgroup support and it took a while
> to find a free server that allows posting. Belated as it is, since
> I put in the effort I figured I might as well send it anyway.
>
> ....
>
> I finally got a copy of this paper. The description is a bit vague and
> confusing. While I accept your assertion that Cambridge outperformed
> Grenoble, after reading this paper I see no basis for generalizing
> from those results to conclude "global is superior to local".
>
> This paper discusses adding a Denning style working set to their
> existing system. A Denning working set manager is similar
> to ***swapping*** and independent of global vs local issues.
>
> This paper only shows that Grenoble's working set dispatcher,
> a combination of scheduler and bulk memory manager, improved on
> the CP67 scheduler and eliminated CP67's tendency to thrash. In fact
> they explicitly state that this dispatcher can be applied to many
> page replacement policies: random, FIFO, LRU or other. Though the
> mechanism is quite different, the result looks similar to VMS swapping.
>
> IIUC their working set is determined by scanning, at regular intervals,
> a process's PTE Used bits, then resetting the bits. Pages not used
> are candidates for replacement and the PTE is marked Invalid.
> If touched again it soft-faults back into the address space.
> By this it dynamically estimates the WS size for each process.
> If the free list gets too small, a victim process is selected from
> a FIFO list and the whole working set is tossed onto the free list
> (equivalent to outswapping). If there is enough free pages for a
> process WS size to fit, an outswapped process is inswapped.
>
> As for the page replacement algorithm, they state their design options
> were restricted by CP67 v3.0 existing structure. Prior to the Grenoble
> modifications, on a page fault it would look for an unused page,
> and if not available selects one from any active process at _Random_.
> ("it will choose the first page it finds in what essentially amounts
> to random replacement"). (They also don't say how the unused page is
> selected, which is as important as the selection of stealing page).
>
> Their wording is somewhat vague but if it does steal pages from
> any processes, then this is a ***global*** replacement policy.
>
> With Grenoble mods, in order to track the process working sets it
> does not use a linked list. It uses the working set determined
> above to decide which pages are in use and which are not.
> Shared pages are not part of this mechanism and are locked down.
> This separates the pages into two piles, those in the working sets
> and those not. IIUC replacement pages are still chosen at random from
> the free pile or the active pile (based in interactive or batch class).
>
> Suffice it to say that this is so different to VMS or WNT paging that
> I certainly wouldn't extrapolate Grenoble's results to them.
>
>> basically running same kind of workload, cambridge ran approx. 80
>> users with the same interactive response and thruput as grenoble did
>> with 35 users i.e. cambridge system with 1/3rd less real storage for
>> application paging was able to support more than twice as many users
>> with comperable thruput and no worse interactive response
>> ... typically much better. the interactive response was somewhat more
>> sensitive to latency associated with the "working dispatcher"
>> attempting to avoid page thrashing and how effective local LRU was in
>> selecting pages for replacement ... so the cambridge system response
>> ... even with more than twice as many users typically had better
>> interactive response and lower latency delays.
>
> The following looks like it might be a write up of those Cambridge
> tests (it includes an acknowledgment to L. Wheeler at the end)
> where they compare two page replacement algorithms.
>
> Experimental evaluation of system performance
> http://www.research.ibm.com/journal/sj/123/ibmsj1203F.pdf
>
> <quote>
> CP-67 maintains a table of all pages in main storage and a pointer
> Item 1 that cycles around this table. CP-67 also maintains, at any
> given moment, a list of .in-queue. users, and these are the only ones
> eligible to receive service at that time. The manner in which this
> list is maintained is further discussed below.
>
> The two algorithms may now be described as follows:
>
> Algorithm 1 (used in CP67 v3.0):
> Move the pointer around the table until a page belonging
> to a user not in queue is found. If no such page is found
> in a complete trip around the table, select the next page for removal.
>
> Algorithm 2 (used in CP67 v3.1):
> Select the next page if its reference bit is off.
> Otherwise, turn the reference bit off, move the pointer down one
> page, and repeat. Note: The reference bit is turned on whenever
> the user references the page in the course of running his program.
> The bit is turned off when the user is removed from inqueue status.
> <end quote>
>
> Algorithm 1 sounds like a global random mechanism.
> Algorithm 2 sounds like a global clock variant.
>
> Rodriquez-Rosell states that they started from CP67 v3.0 (Algorithm 1)
> which seems to confirm that Grenoble CP67 started from a global random
> replacement policy and modified it to add the Denning working set
> dispatching (aka swapping).
>
> Again this is quite different from VMS or WNT paging.

again ... i did the original stuff when i was an undergraduate for cp67
release 2 in the 60s ... the changes weren't picked up and distributed
in the product until release 3.1 (effectively the implementation stayed
the same between release 2 and release 3)
http://www.garlic.com/~lynn/subtopic.html#wsclock

with any load on the system (i.e. all pages in memory belonged to
"in-queue" uses, having quickly swept out pages belonging to users
that weren't running) ... algorithm 1 quickly degenerated to FIFO ...
since it would always re-search all memory for a non-running user page
.... and then select the next page belonging to a running user. Except
for typically swift transition when user was descheduled ... the
algoritm would sequently move one page at a time around real memory
replacing the next encountered page with a newly faulted page.
From: Anne & Lynn Wheeler on
Anne & Lynn Wheeler <lynn(a)garlic.com> writes:
> so i had done global lru replacement as an undergraduate in the 60s.
> by the time, the "single" bit replacement code was shipping in cp67
> 3.1 ... we were investigating 2, 3, 4, ... 8bit ... variations as well
> as moving the reset of the reference bit offset from the examination
> of the reference bit.
>
> the cp67 release 3 distributed implementation didn't differ from the
> release 2 distributed implementation. the implementation that grenoble
> started with (in release 3) and modified to implement local LRU
> ... was the same implementation that I started with as an
> undergraduate in cp67 release 2 and modified to implement global
> LRU. My global LRU implementation existed concurrently with the
> grenoble local LRU implementation as modification against the same
> base implementation (both release 2 and release 3).
>
> My global LRU implementation was then selected to ship in cp67 release
> 3.1 (which was the single reference bit reset at the same time as the
> examination). However, about the time it started shipping, we were
> also starting to look at issues like multiple bits and offsetting the
> reset from the examination. However, on 360/67, the examination of the
> bit used the ISK instruction and the resetting of the bit used the SSK
> instruction. For 370, the syncronous examination and reset strategy
> resulted in the new RRB instruction ... which examined and reset in a
> single operation.

re:
http://www.garlic.com/~lynn/2006q.html#19 virtual memory
http://www.garlic.com/~lynn/2006q.html#20 virtual memory

one of the issues with cycling around real memory for global LRU
was that it tended to be naturally adaptive (self-regulating) over
a relatively broad range of environments and configurations.

the point of LRU replacement ... is that *least recently used* is
supposedly predictive of future use ... pages used in the recent past
have a high probability being used in the near future (than pages that
haven't been used in the recent past).

the interval between the reset of the reference bit and the next
examination of the reference bit is supposed to represent a "recent
interval" that retains the predictive characteristics associated with
*least recently used* observations (pages that have the reference bit
off haven't been used in the recent interval and pages that have the
reference bit on have been used in the recent interval).

the naturally adaptive characteristics is that if the implementation
has cycled memory too quickly (cycled around storage once) ... then
there will be a larger precentage of pages with the reference bit off
.... which means it will take much longer to make one complete cycle
through all of memory (increasing the interval that allows pages to
have their reference bit set). if the implemenation has cycled memory
too slowly ... then there will be a larger percentage of pages with
the reference bit on ... which means that it will take much shorter
time to make one complete cycle through all of memory (one complete
cycle of memory is a function by the rate of page/faults and the
percentage of pages with their reference bits off).

so naturally adaptive worked over a fairly broad range ... but not
completely. In page thrashing scenario (severe over commitment of real
storage), everybody is waiting for page fault and not executing ... so
nobody is getting any page reference bit set. majority of reference
bits are always off and the algorithm degenerates to pretty much FIFO
(as in the case of "alogorithm 1" for cp67 which had effectively
assumed all the reference bits were the same and didn't even both to
check).

the other situation where the natural adaptive starts to break down is
where there was increasing amounts of available real storage and the
time it takes to cycle around all storage begins to consistently
exceed any common definition of "recent" ... as applied to least
recently used algorithms. large percentage of pages may have their
reference bit on (because the interval since recent was so long) and
there was no differentiation between pages that had been "recently
used" and pages that hadn't been used for quite some time.

this is where the "offset reset" started to appear. you can think of
the original implementation as having a "offset reset" exactly equal
to one complete cycle of all storage. however, with increasing real
storage sizes the avg. interval to cycle around all pages (interval
between reset and examiniation) was becoming excesively larger and no
longer conformed to any reasonable definition of "recent". so a new
variation was introduced to offset the resetting the reference bit
from the examination of the reference bit. this offset could be
adjusted based on reasonable definition and expectation of "recent"
having some valid prediction of future page reference behavior. This
could possibly be one-half of all memory .... so the pages being reset
were effectively "half of real pages" away from the pages being
examined (cutting the average interval between reset and examination
in half).
From: Anne & Lynn Wheeler on

Anne & Lynn Wheeler <lynn(a)garlic.com> writes:
> again ... i did the original stuff when i was an undergraduate for cp67
> release 2 in the 60s ... the changes weren't picked up and distributed
> in the product until release 3.1 (effectively the implementation stayed
> the same between release 2 and release 3)
> http://www.garlic.com/~lynn/subtopic.html#wsclock

re:
http://www.garlic.com/~lynn/2006q.html#19 virtual memory

while the global LRU stuff that I had done as an undergraduate in the
60s (against release 1 and release 2 cp67 systems), didn't ship in
cp67 release 3 ... gobs of other stuff that I had done did ship in
release 3 (lots of kernel rewrite, pathlengths, fastpath, etc).

repeat reference to part of an old presentation that i gave at the
aug68 share user group meeting held in boston
http://www.garlic.com/~lynn/94.html#18 CP/67 & OS MFT14

a couple more recent references to that presentation ... mostly
concerned large amount of pathlength rewrites for reducing
processor kernel overhead:
http://www.garlic.com/~lynn/2006.html#7 EREP , sense ... manual
http://www.garlic.com/~lynn/2006e.html#45 using 3390 mod-9s
http://www.garlic.com/~lynn/2006h.html#57 PDS Directory Question
From: jacko on
hi

the problem with lru is that some programs are compiled for speed with
large size, so that there relative performance improves, while reducing
system thruput.

number of pages per task should be os priority settable, with only
grabbing pages from own tasks space.

cheers