From: Chris Webb on
We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some (but not all) of the machines. This is resulting in
customer reports of very poor response latency from the virtual machines
which have been swapped out, despite the hosts apparently having large
amounts of free memory, and running fine if swap is turned off.

All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
machines which apparently doesn't exhibit the problem, and a cluster of
2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
the affected machines is at

http://cdw.me.uk/tmp/config-2.6.32.7

This differs very little from the config on the unaffected Xeon machines,
essentially just

-CONFIG_MCORE2=y
+CONFIG_MK8=y
-CONFIG_X86_P6_NOP=y

On a typical affected machine, the virtual machines and other processes
would apparently leave around 5.5GB of RAM available for buffers, but the
system seems to want to swap out 3GB of anonymous pages to give itself more
like 9GB of buffers:

# cat /proc/meminfo
MemTotal: 33083420 kB
MemFree: 693164 kB
Buffers: 8834380 kB
Cached: 11212 kB
SwapCached: 1443524 kB
Active: 21656844 kB
Inactive: 8119352 kB
Active(anon): 17203092 kB
Inactive(anon): 3729032 kB
Active(file): 4453752 kB
Inactive(file): 4390320 kB
Unevictable: 5472 kB
Mlocked: 5472 kB
SwapTotal: 25165816 kB
SwapFree: 21854572 kB
Dirty: 4300 kB
Writeback: 4 kB
AnonPages: 20780368 kB
Mapped: 6056 kB
Shmem: 56 kB
Slab: 961512 kB
SReclaimable: 438276 kB
SUnreclaim: 523236 kB
KernelStack: 10152 kB
PageTables: 67176 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 41707524 kB
Committed_AS: 39870868 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 150880 kB
VmallocChunk: 34342404996 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5824 kB
DirectMap2M: 3205120 kB
DirectMap1G: 30408704 kB

We see this despite the machine having vm.swappiness set to 0 in an attempt
to skew the reclaim as far as possible in favour of releasing page cache
instead of swapping anonymous pages.

After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB.

We could run with these machines without swap (in the worst cases we're
already doing so), but I'd prefer to have a reserve of swap available in
case of genuine emergency. If it's a choice between swapping out a guest or
oom-killing it, I'd prefer to swap... but I really don't want to swap out
running virtual machines in order to have eight gigabytes of page cache
instead of five!

Is this a problem with the page reclaim priorities, or am I just tuning
these hosts incorrectly? Is there more detailed info than /proc/meminfo
available which might shed more light on what's going wrong here?

Best wishes,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Minchan Kim on
On Mon, Aug 2, 2010 at 9:47 PM, Chris Webb <chris(a)arachsys.com> wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
> �http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
> �-CONFIG_MCORE2=y
> �+CONFIG_MK8=y
> �-CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
> �# cat /proc/meminfo
> �MemTotal: � � � 33083420 kB
> �MemFree: � � � � �693164 kB
> �Buffers: � � � � 8834380 kB
> �Cached: � � � � � �11212 kB
> �SwapCached: � � �1443524 kB
> �Active: � � � � 21656844 kB
> �Inactive: � � � �8119352 kB
> �Active(anon): � 17203092 kB
> �Inactive(anon): �3729032 kB
> �Active(file): � �4453752 kB
> �Inactive(file): �4390320 kB
> �Unevictable: � � � �5472 kB
> �Mlocked: � � � � � �5472 kB
> �SwapTotal: � � �25165816 kB
> �SwapFree: � � � 21854572 kB
> �Dirty: � � � � � � �4300 kB
> �Writeback: � � � � � � 4 kB
> �AnonPages: � � �20780368 kB
> �Mapped: � � � � � � 6056 kB
> �Shmem: � � � � � � � �56 kB
> �Slab: � � � � � � 961512 kB
> �SReclaimable: � � 438276 kB
> �SUnreclaim: � � � 523236 kB
> �KernelStack: � � � 10152 kB
> �PageTables: � � � �67176 kB
> �NFS_Unstable: � � � � �0 kB
> �Bounce: � � � � � � � �0 kB
> �WritebackTmp: � � � � �0 kB
> �CommitLimit: � �41707524 kB
> �Committed_AS: � 39870868 kB
> �VmallocTotal: � 34359738367 kB
> �VmallocUsed: � � �150880 kB
> �VmallocChunk: � 34342404996 kB
> �HardwareCorrupted: � � 0 kB
> �HugePages_Total: � � � 0
> �HugePages_Free: � � � �0
> �HugePages_Rsvd: � � � �0
> �HugePages_Surp: � � � �0
> �Hugepagesize: � � � 2048 kB
> �DirectMap4k: � � � �5824 kB
> �DirectMap2M: � � 3205120 kB
> �DirectMap1G: � �30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>

Hmm, Strange.
We reclaim only anon pages when the system has few page cache.
(ie, file + free <= high_water_mark)
But in your meminfo, your system has lots of page cache page.
So It isn't likely.

Another possibility is _zone_reclaim_ in NUMA.
Your working set has many anonymous page.

The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
It can make reclaim mode to lumpy so it can page out anon pages.

Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Webb on
Minchan Kim <minchan.kim(a)gmail.com> writes:

> Another possibility is _zone_reclaim_ in NUMA.
> Your working set has many anonymous page.
>
> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> It can make reclaim mode to lumpy so it can page out anon pages.
>
> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?

Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
these are

# cat /proc/sys/vm/zone_reclaim_mode
0
# cat /proc/sys/vm/min_unmapped_ratio
1

I haven't changed either of these from the kernel default.

Many thanks,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Minchan Kim on
On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris(a)arachsys.com> wrote:
> Minchan Kim <minchan.kim(a)gmail.com> writes:
>
>> Another possibility is _zone_reclaim_ in NUMA.
>> Your working set has many anonymous page.
>>
>> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> It can make reclaim mode to lumpy so it can page out anon pages.
>>
>> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>
> Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> these are
>
> �# cat /proc/sys/vm/zone_reclaim_mode
> �0
> �# cat /proc/sys/vm/min_unmapped_ratio
> �1

if zone_reclaim_mode is zero, it doesn't swap out anon_pages.

1) How does VM reclaim anonymous pages even though vm_swappiness ==
zero and has big page cache?
2) I doubt file pages of your system is fulled by Buffers while Cached
is almost 10M.
Why is it remained although anon pages is swapped out and cached page
are reclaimed?

Hmm. I have no idea. :(

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <chris(a)arachsys.com> wrote:
> > Minchan Kim <minchan.kim(a)gmail.com> writes:
> >
> >> Another possibility is _zone_reclaim_ in NUMA.
> >> Your working set has many anonymous page.
> >>
> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >>
> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >
> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > these are
> >
> >  # cat /proc/sys/vm/zone_reclaim_mode
> >  0
> >  # cat /proc/sys/vm/min_unmapped_ratio
> >  1
>
> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.

If there are lots of order-1 or higher allocations, anonymous pages
will be randomly evicted, regardless of their LRU ages. This is
probably another factor why the users claim. Are there easy ways to
confirm this other than patching the kernel?

Chris, what's in your /proc/slabinfo?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/