|
Prev: what is the minimal Instruction set of a microprocessor ?
Next: Trying to design low level hard disk manipulation program
From: Eric P. on 11 Sep 2006 11:52 I was unable to send this back when we were discussing this in the summer because my ISP dropped newsgroup support and it took a while to find a free server that allows posting. Belated as it is, since I put in the effort I figured I might as well send it anyway. Anne & Lynn Wheeler wrote: > > grenoble modified they same cp67 for "working set" dispatcher and > local LRU running on 1mbyte 360/67, 154 "pageable pages" after fixed > kernel requirements (basically 50 percent more real storage for > application paging). they published a paper on the work in the early > 70s in cacm > > J. Rodriquez-Rosell, The design, implementation, and evaluation of a > working set dispatcher, cacm16, apr73 I finally got a copy of this paper. The description is a bit vague and confusing. While I accept your assertion that Cambridge outperformed Grenoble, after reading this paper I see no basis for generalizing from those results to conclude "global is superior to local". This paper discusses adding a Denning style working set to their existing system. A Denning working set manager is similar to ***swapping*** and independent of global vs local issues. This paper only shows that Grenoble's working set dispatcher, a combination of scheduler and bulk memory manager, improved on the CP67 scheduler and eliminated CP67's tendency to thrash. In fact they explicitly state that this dispatcher can be applied to many page replacement policies: random, FIFO, LRU or other. Though the mechanism is quite different, the result looks similar to VMS swapping. IIUC their working set is determined by scanning, at regular intervals, a process's PTE Used bits, then resetting the bits. Pages not used are candidates for replacement and the PTE is marked Invalid. If touched again it soft-faults back into the address space. By this it dynamically estimates the WS size for each process. If the free list gets too small, a victim process is selected from a FIFO list and the whole working set is tossed onto the free list (equivalent to outswapping). If there is enough free pages for a process WS size to fit, an outswapped process is inswapped. As for the page replacement algorithm, they state their design options were restricted by CP67 v3.0 existing structure. Prior to the Grenoble modifications, on a page fault it would look for an unused page, and if not available selects one from any active process at _Random_. ("it will choose the first page it finds in what essentially amounts to random replacement"). (They also don't say how the unused page is selected, which is as important as the selection of stealing page). Their wording is somewhat vague but if it does steal pages from any processes, then this is a ***global*** replacement policy. With Grenoble mods, in order to track the process working sets it does not use a linked list. It uses the working set determined above to decide which pages are in use and which are not. Shared pages are not part of this mechanism and are locked down. This separates the pages into two piles, those in the working sets and those not. IIUC replacement pages are still chosen at random from the free pile or the active pile (based in interactive or batch class). Suffice it to say that this is so different to VMS or WNT paging that I certainly wouldn't extrapolate Grenoble's results to them. > basically running same kind of workload, cambridge ran approx. 80 > users with the same interactive response and thruput as grenoble did > with 35 users i.e. cambridge system with 1/3rd less real storage for > application paging was able to support more than twice as many users > with comperable thruput and no worse interactive response > ... typically much better. the interactive response was somewhat more > sensitive to latency associated with the "working dispatcher" > attempting to avoid page thrashing and how effective local LRU was in > selecting pages for replacement ... so the cambridge system response > ... even with more than twice as many users typically had better > interactive response and lower latency delays. The following looks like it might be a write up of those Cambridge tests (it includes an acknowledgment to L. Wheeler at the end) where they compare two page replacement algorithms. Experimental evaluation of system performance http://www.research.ibm.com/journal/sj/123/ibmsj1203F.pdf <quote> CP-67 maintains a table of all pages in main storage and a pointer Item 1 that cycles around this table. CP-67 also maintains, at any given moment, a list of ?in-queue? users, and these are the only ones eligible to receive service at that time. The manner in which this list is maintained is further discussed below. The two algorithms may now be described as follows: Algorithm 1 (used in CP67 v3.0): Move the pointer around the table until a page belonging to a user not in queue is found. If no such page is found in a complete trip around the table, select the next page for removal. Algorithm 2 (used in CP67 v3.1): Select the next page if its reference bit is off. Otherwise, turn the reference bit off, move the pointer down one page, and repeat. Note: The reference bit is turned on whenever the user references the page in the course of running his program. The bit is turned off when the user is removed from inqueue status. <end quote> Algorithm 1 sounds like a global random mechanism. Algorithm 2 sounds like a global clock variant. Rodriquez-Rosell states that they started from CP67 v3.0 (Algorithm 1) which seems to confirm that Grenoble CP67 started from a global random replacement policy and modified it to add the Denning working set dispatching (aka swapping). Again this is quite different from VMS or WNT paging. Eric
From: Anne & Lynn Wheeler on 11 Sep 2006 12:25 "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> writes: > I was unable to send this back when we were discussing this in the > summer because my ISP dropped newsgroup support and it took a while > to find a free server that allows posting. Belated as it is, since > I put in the effort I figured I might as well send it anyway. > > .... > > I finally got a copy of this paper. The description is a bit vague and > confusing. While I accept your assertion that Cambridge outperformed > Grenoble, after reading this paper I see no basis for generalizing > from those results to conclude "global is superior to local". > > This paper discusses adding a Denning style working set to their > existing system. A Denning working set manager is similar > to ***swapping*** and independent of global vs local issues. > > This paper only shows that Grenoble's working set dispatcher, > a combination of scheduler and bulk memory manager, improved on > the CP67 scheduler and eliminated CP67's tendency to thrash. In fact > they explicitly state that this dispatcher can be applied to many > page replacement policies: random, FIFO, LRU or other. Though the > mechanism is quite different, the result looks similar to VMS swapping. > > IIUC their working set is determined by scanning, at regular intervals, > a process's PTE Used bits, then resetting the bits. Pages not used > are candidates for replacement and the PTE is marked Invalid. > If touched again it soft-faults back into the address space. > By this it dynamically estimates the WS size for each process. > If the free list gets too small, a victim process is selected from > a FIFO list and the whole working set is tossed onto the free list > (equivalent to outswapping). If there is enough free pages for a > process WS size to fit, an outswapped process is inswapped. > > As for the page replacement algorithm, they state their design options > were restricted by CP67 v3.0 existing structure. Prior to the Grenoble > modifications, on a page fault it would look for an unused page, > and if not available selects one from any active process at _Random_. > ("it will choose the first page it finds in what essentially amounts > to random replacement"). (They also don't say how the unused page is > selected, which is as important as the selection of stealing page). > > Their wording is somewhat vague but if it does steal pages from > any processes, then this is a ***global*** replacement policy. > > With Grenoble mods, in order to track the process working sets it > does not use a linked list. It uses the working set determined > above to decide which pages are in use and which are not. > Shared pages are not part of this mechanism and are locked down. > This separates the pages into two piles, those in the working sets > and those not. IIUC replacement pages are still chosen at random from > the free pile or the active pile (based in interactive or batch class). > > Suffice it to say that this is so different to VMS or WNT paging that > I certainly wouldn't extrapolate Grenoble's results to them. > >> basically running same kind of workload, cambridge ran approx. 80 >> users with the same interactive response and thruput as grenoble did >> with 35 users i.e. cambridge system with 1/3rd less real storage for >> application paging was able to support more than twice as many users >> with comperable thruput and no worse interactive response >> ... typically much better. the interactive response was somewhat more >> sensitive to latency associated with the "working dispatcher" >> attempting to avoid page thrashing and how effective local LRU was in >> selecting pages for replacement ... so the cambridge system response >> ... even with more than twice as many users typically had better >> interactive response and lower latency delays. > > The following looks like it might be a write up of those Cambridge > tests (it includes an acknowledgment to L. Wheeler at the end) > where they compare two page replacement algorithms. > > Experimental evaluation of system performance > http://www.research.ibm.com/journal/sj/123/ibmsj1203F.pdf > > <quote> > CP-67 maintains a table of all pages in main storage and a pointer > Item 1 that cycles around this table. CP-67 also maintains, at any > given moment, a list of .in-queue. users, and these are the only ones > eligible to receive service at that time. The manner in which this > list is maintained is further discussed below. > > The two algorithms may now be described as follows: > > Algorithm 1 (used in CP67 v3.0): > Move the pointer around the table until a page belonging > to a user not in queue is found. If no such page is found > in a complete trip around the table, select the next page for removal. > > Algorithm 2 (used in CP67 v3.1): > Select the next page if its reference bit is off. > Otherwise, turn the reference bit off, move the pointer down one > page, and repeat. Note: The reference bit is turned on whenever > the user references the page in the course of running his program. > The bit is turned off when the user is removed from inqueue status. > <end quote> > > Algorithm 1 sounds like a global random mechanism. > Algorithm 2 sounds like a global clock variant. > > Rodriquez-Rosell states that they started from CP67 v3.0 (Algorithm 1) > which seems to confirm that Grenoble CP67 started from a global random > replacement policy and modified it to add the Denning working set > dispatching (aka swapping). > > Again this is quite different from VMS or WNT paging. again ... i did the original stuff when i was an undergraduate for cp67 release 2 in the 60s ... the changes weren't picked up and distributed in the product until release 3.1 (effectively the implementation stayed the same between release 2 and release 3) http://www.garlic.com/~lynn/subtopic.html#wsclock with any load on the system (i.e. all pages in memory belonged to "in-queue" uses, having quickly swept out pages belonging to users that weren't running) ... algorithm 1 quickly degenerated to FIFO ... since it would always re-search all memory for a non-running user page .... and then select the next page belonging to a running user. Except for typically swift transition when user was descheduled ... the algoritm would sequently move one page at a time around real memory replacing the next encountered page with a newly faulted page.
From: Anne & Lynn Wheeler on 11 Sep 2006 14:03 Anne & Lynn Wheeler <lynn(a)garlic.com> writes: > so i had done global lru replacement as an undergraduate in the 60s. > by the time, the "single" bit replacement code was shipping in cp67 > 3.1 ... we were investigating 2, 3, 4, ... 8bit ... variations as well > as moving the reset of the reference bit offset from the examination > of the reference bit. > > the cp67 release 3 distributed implementation didn't differ from the > release 2 distributed implementation. the implementation that grenoble > started with (in release 3) and modified to implement local LRU > ... was the same implementation that I started with as an > undergraduate in cp67 release 2 and modified to implement global > LRU. My global LRU implementation existed concurrently with the > grenoble local LRU implementation as modification against the same > base implementation (both release 2 and release 3). > > My global LRU implementation was then selected to ship in cp67 release > 3.1 (which was the single reference bit reset at the same time as the > examination). However, about the time it started shipping, we were > also starting to look at issues like multiple bits and offsetting the > reset from the examination. However, on 360/67, the examination of the > bit used the ISK instruction and the resetting of the bit used the SSK > instruction. For 370, the syncronous examination and reset strategy > resulted in the new RRB instruction ... which examined and reset in a > single operation. re: http://www.garlic.com/~lynn/2006q.html#19 virtual memory http://www.garlic.com/~lynn/2006q.html#20 virtual memory one of the issues with cycling around real memory for global LRU was that it tended to be naturally adaptive (self-regulating) over a relatively broad range of environments and configurations. the point of LRU replacement ... is that *least recently used* is supposedly predictive of future use ... pages used in the recent past have a high probability being used in the near future (than pages that haven't been used in the recent past). the interval between the reset of the reference bit and the next examination of the reference bit is supposed to represent a "recent interval" that retains the predictive characteristics associated with *least recently used* observations (pages that have the reference bit off haven't been used in the recent interval and pages that have the reference bit on have been used in the recent interval). the naturally adaptive characteristics is that if the implementation has cycled memory too quickly (cycled around storage once) ... then there will be a larger precentage of pages with the reference bit off .... which means it will take much longer to make one complete cycle through all of memory (increasing the interval that allows pages to have their reference bit set). if the implemenation has cycled memory too slowly ... then there will be a larger percentage of pages with the reference bit on ... which means that it will take much shorter time to make one complete cycle through all of memory (one complete cycle of memory is a function by the rate of page/faults and the percentage of pages with their reference bits off). so naturally adaptive worked over a fairly broad range ... but not completely. In page thrashing scenario (severe over commitment of real storage), everybody is waiting for page fault and not executing ... so nobody is getting any page reference bit set. majority of reference bits are always off and the algorithm degenerates to pretty much FIFO (as in the case of "alogorithm 1" for cp67 which had effectively assumed all the reference bits were the same and didn't even both to check). the other situation where the natural adaptive starts to break down is where there was increasing amounts of available real storage and the time it takes to cycle around all storage begins to consistently exceed any common definition of "recent" ... as applied to least recently used algorithms. large percentage of pages may have their reference bit on (because the interval since recent was so long) and there was no differentiation between pages that had been "recently used" and pages that hadn't been used for quite some time. this is where the "offset reset" started to appear. you can think of the original implementation as having a "offset reset" exactly equal to one complete cycle of all storage. however, with increasing real storage sizes the avg. interval to cycle around all pages (interval between reset and examiniation) was becoming excesively larger and no longer conformed to any reasonable definition of "recent". so a new variation was introduced to offset the resetting the reference bit from the examination of the reference bit. this offset could be adjusted based on reasonable definition and expectation of "recent" having some valid prediction of future page reference behavior. This could possibly be one-half of all memory .... so the pages being reset were effectively "half of real pages" away from the pages being examined (cutting the average interval between reset and examination in half).
From: Anne & Lynn Wheeler on 11 Sep 2006 14:05 Anne & Lynn Wheeler <lynn(a)garlic.com> writes: > again ... i did the original stuff when i was an undergraduate for cp67 > release 2 in the 60s ... the changes weren't picked up and distributed > in the product until release 3.1 (effectively the implementation stayed > the same between release 2 and release 3) > http://www.garlic.com/~lynn/subtopic.html#wsclock re: http://www.garlic.com/~lynn/2006q.html#19 virtual memory while the global LRU stuff that I had done as an undergraduate in the 60s (against release 1 and release 2 cp67 systems), didn't ship in cp67 release 3 ... gobs of other stuff that I had done did ship in release 3 (lots of kernel rewrite, pathlengths, fastpath, etc). repeat reference to part of an old presentation that i gave at the aug68 share user group meeting held in boston http://www.garlic.com/~lynn/94.html#18 CP/67 & OS MFT14 a couple more recent references to that presentation ... mostly concerned large amount of pathlength rewrites for reducing processor kernel overhead: http://www.garlic.com/~lynn/2006.html#7 EREP , sense ... manual http://www.garlic.com/~lynn/2006e.html#45 using 3390 mod-9s http://www.garlic.com/~lynn/2006h.html#57 PDS Directory Question
From: jacko on 12 Sep 2006 15:22
hi the problem with lru is that some programs are compiled for speed with large size, so that there relative performance improves, while reducing system thruput. number of pages per task should be os priority settable, with only grabbing pages from own tasks space. cheers |