From: Keith Keller on
On 2010-07-20, Grant <omg(a)grrr.id.au> wrote:
>
> I ignore runaways as a valid thing to plan for. After all, in the
> last ten years I think only time I lost a Linux box is when I played
> with a recursion bomb, out of curiosity.

I've had two different users unintentionally start runaway processes
on at least three different occasions in the past two years. Obviously
there are ways to deal with this (ulimit resources for one) other than
simply not utilizing swap, but there could be a legitimate reason
someone needs 100GB of memory for one process.

> And, in that circumstance,
> a large swap area can give one time to take action before the machine
> dies.

A large swap area makes things worse, I've found. If you have a small
swap space, the OOM killer will be able to kill off processes without
allowing processes to spend a ton of time swapping out. If your swap
is large, then the OOM killer doesn't kick in right away, processes are
swapping like crazy, and even a task like getting a shell takes minutes.

There is lots of argument about the OOM killer and overcommitting memory
by the linux kernel. A Google search will turn up numerous links on the
subject (about which I am decidedly not an expert).

--keith

--
kkeller-usenet(a)wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

From: Grant on
On Tue, 20 Jul 2010 21:11:58 GMT, unruh <unruh(a)wormhole.physics.ubc.ca> wrote:

>On 2010-07-20, Grant <omg(a)grrr.id.au> wrote:
>> On Tue, 20 Jul 2010 09:20:19 -0700, Keith Keller <kkeller-usenet(a)wombat.san-francisco.ca.us> wrote:
>>
>>>On 2010-07-20, Grant <omg(a)grrr.id.au> wrote:
>>>>
>>>> I usually put swap in at partition five, first in the logicals, on each
>>>> drive than run them at same priority. Large swap rarely comes in handy,
>>>> but is good for the occasional large or silly task. Better than have
>>>> the kernel start killing off processes in response to out-of-memory.
>>>
>>>I don't think this is necessarily true. If your process is a runaway
>>>task, it's much much better to have the kernel kill it off right away
>>>than to let it fester in swap, dragging everything else down with it.
>>>This is of course assuming that the runaway process in question is
>>>using the most memory, which might not be the case if you have a big
>>>RDBMS running, for example. The OOM killer can be customized in recent
>>>kernels to help protect certain classes of processes.
>>
>> I ignore runaways as a valid thing to plan for. After all, in the
>> last ten years I think only time I lost a Linux box is when I played
>> with a recursion bomb, out of curiosity. And, in that circumstance,
>> a large swap area can give one time to take action before the machine
>> dies. Can be a fun race, particularly if one forgots the 'killall'
>> command at the time.
>
>How in the world could you "lose" the box? Do you mean it crashed, or
>that some irretrievable badness occured (CPU caught fire, hard disk was
>erased, screen exploded in a shower of glass....)

:-)

Lost, as in no services, not available, gone, deceased, dead, crashed.
Not a live box providing expected services, a navel gazer...
>
>>
>> Much more likely to lose the box on power failure.

Grant.
From: Grant on
On Tue, 20 Jul 2010 14:17:40 -0700, Keith Keller <kkeller-usenet(a)wombat.san-francisco.ca.us> wrote:

>On 2010-07-20, Grant <omg(a)grrr.id.au> wrote:
>>
>> I ignore runaways as a valid thing to plan for. After all, in the
>> last ten years I think only time I lost a Linux box is when I played
>> with a recursion bomb, out of curiosity.
>
>I've had two different users unintentionally start runaway processes
>on at least three different occasions in the past two years. Obviously
>there are ways to deal with this (ulimit resources for one) other than
>simply not utilizing swap, but there could be a legitimate reason
>someone needs 100GB of memory for one process.

Yes, and that's the problem. You could set reasonable limits for
users, allow the odd user more for a good reason?

I suppose a reasonable limit is where there are few problems, and
only a few requests for larger limits?

And, my viewpoint is from there being one user, me ;) No idea what's
best for a box serving many users.

Back when I was at uni, anyone fork-bombing the system 'lost'
their password until they reported to the sys-admin for a gentle
chat ;) Usually runaway programs in the unix lab ate the local
machine, not the shared filesystem or the server box (IRIX and
Indy, O2 lab machines, dunno what the server was).
>
>> And, in that circumstance,
>> a large swap area can give one time to take action before the machine
>> dies.
>
>A large swap area makes things worse, I've found. If you have a small
>swap space, the OOM killer will be able to kill off processes without
>allowing processes to spend a ton of time swapping out. If your swap
>is large, then the OOM killer doesn't kick in right away, processes are
>swapping like crazy, and even a task like getting a shell takes minutes.

Yes, here I make sure there's a root console open if I'm playing
dangerous, or after reboot, when I discover I am in fact, playing
dangerous.
>
>There is lots of argument about the OOM killer and overcommitting memory
>by the linux kernel. A Google search will turn up numerous links on the
>subject (about which I am decidedly not an expert).

I don't like the OOM killer (and, by implication the over-commit that it
has to cope with), but then, I've not triggered the thing in recent years.

There are techniques or tuning to better control the thing, but I'm
not in an environment that needs that tuning, not an area I've explored.

Grant.
From: Keith Keller on
On 2010-07-20, Grant <omg(a)grrr.id.au> wrote:
>
> Yes, and that's the problem. You could set reasonable limits for
> users, allow the odd user more for a good reason?
>
> I suppose a reasonable limit is where there are few problems, and
> only a few requests for larger limits?
>
> And, my viewpoint is from there being one user, me ;) No idea what's
> best for a box serving many users.

It depends a lot on the box and the users.

In my environment, we have about a half-dozen regular users, plus
another half-dozen occasional users, plus about a dozen seldom users.
The regular users all work on the same project together, so if one
writes a program that uses all available RAM (on our dev boxes, never on
our public-facing boxes!), the worst that happens is a forced reset;
next worst is OOM killer; next worst is I kill the process before it
gets that bad. The odds of losing critical data (most of which is
hosted on the fileserver, not on the dev boxes) is negligible, and the
odds of losing more than a few hours' computation is also slim. So
usually the aftermath is ridicule from the other developers, therefore I
tend to err on the side of letting the developers work with no resource
limits.

On a server with more risk, you would want to be less lenient about
resource limits.


--keith

--
kkeller-usenet(a)wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

From: Doug Freyburger on
Grant wrote:
>
>>There is lots of argument about the OOM killer and overcommitting memory
>>by the linux kernel. A Google search will turn up numerous links on the
>>subject (about which I am decidedly not an expert).
>
> I don't like the OOM killer (and, by implication the over-commit that it
> has to cope with), but then, I've not triggered the thing in recent years.

My latest run-in with the OOM killer is a system with Acronis doing
backups across a CIFS/SMB mount, plus Oracle RMAN doing backups across a
CIFS/SMB mount. Every so often I get a ticket that the paging rate has
gone through the roof and by the time I can get connected the system
does not respond to SSH. If I don't reset it it stays hung all night.
In the syslog file are lots of OOM lines from just before it hung.

I don't see why the client side of a CIFS/SMB mount would fill swap
space and hang the system so I figure Acronis does that. A kernel
invasive program to do snapshot OS backups? Such a program was
specified by the client and it was not my choice! There is an endless
stream of updates to Acronis - Clearly it is not ready for prime time.