PROBLEM + POSS FIX: kernel stack overflow, xfs, many disks, heavy write load, 8k stack, x86-64 [Kernel]

Prev: AMD IOMMU kdump fix plus cleanups (v2)
Next: [PATCH] mfd: Fix WM831x I2C suspend prototype

From: John Berthels on 7 Apr 2010 07:20

Hi folks,

[I'm afraid that I'm not subscribed to the list, please cc: me on any
reply].

Problem: kernel.org 2.6.33.2 x86_64 kernel locks up under write-heavy
I/O load. It is "fixed" by changing THREAD_ORDER to 2.

Is this an OK long-term solution/should this be needed? As far as I can
see from searching, there is an expectation that xfs would generally
work with 8k stacks (THREAD_ORDER 1). We don't have xfs stacked over LVM
or anything else.

If anyone can offer any advice on this, that would be great. I
understand larger kernel stacks may introduce problems in getting an
allocation of the appropriate size. So am I right in thinking the
symptom we need to look out for would be an error on fork() or clone()?
Or will the box panic in that case?

Details below.

regards,

jb

Background: We have a cluster of systems with roughly the following
specs (2GB RAM, 24 (twenty-four) 1TB+ disks, Intel Core2 Duo @ 2.2GHz).

Following a the addition of three new servers to the cluster, we started
seeing a high incidence of intermittent lockups (up to several times per
day for some servers) across both the old and new servers. Prior to
that, we saw this problem only rarely (perhaps once per 3 months).

Adding the new servers will have changed the I/O patterns to all
servers. The servers receive a heavy write load, often with many slow
writers (as well as a read load).

Servers would become unresponsive, with nothing written to
/var/log/messages. Setting sysctl kernel.panic=300 caused a restart
(which showed the kernel was panicing and unable to write at the time).
netconsole showed a variety of stack traces, mostly related to xfs_write
activity (but then, that's what the box spends it's time doing).

22/24 of the disks have 1 partition, formatted with xfs (over the
partition, not over LVM). The other 2 disks have 3 partitions: xfs data,
swap and a RAID1 partition contributing to an ext3 root filesystem
mounted on /dev/md0.

We have tried various solutions (different kernels from ubuntu server
2.6.28->2.6.32).

Vanilla 2.6.33.2 from kernel.org + stack tracing still has the problem,
and logged:

kernel: [58552.740032] flush-8:112 used greatest stack depth: 184 bytes left

a short while before dying.

Vanilla 2.6.33.2 + stack tracing + THREAD_ORDER 2 is much more stable
(no lockups so far, we would have expected 5-6 by now) and has logged:

kernel: [44798.183507] apache2 used greatest stack depth: 7208 bytes left

which I understand (possibly wrongly) as concrete evidence that we have
exceeded 8k of stack space.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Sandeen on 7 Apr 2010 13:50

John Berthels wrote:
> Dave Chinner wrote:
>> I'm not seeing stacks deeper than about 5.6k on XFS under heavy write
>> loads. That's nowhere near blowing an 8k stack, so there must be
>> something special about what you are doing. Can you post the stack
>> traces that are being generated for the deepest stack generated -
>> /sys/kernel/debug/tracing/stack_trace should contain it.
>>
> Appended below. That doesn't seem to reach 8192 but the box it's from
> has logged:
>
> [74649.579386] apache2 used greatest stack depth: 7024 bytes left

but that's -left- (out of 8k or is that from a THREAD_ORDER=2 box?)

I guess it must be out of 16k...

> Depth Size Location (47 entries)
> ----- ---- --------
> 0) 7568 16 mempool_alloc_slab+0x16/0x20
> 1) 7552 144 mempool_alloc+0x65/0x140
> 2) 7408 96 get_request+0x124/0x370
> 3) 7312 144 get_request_wait+0x29/0x1b0
> 4) 7168 96 __make_request+0x9b/0x490
> 5) 7072 208 generic_make_request+0x3df/0x4d0
> 6) 6864 80 submit_bio+0x7c/0x100
> 7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
> 8) 6688 48 xfs_buf_iorequest+0x75/0xd0 [xfs]
> 9) 6640 32 _xfs_buf_read+0x36/0x70 [xfs]
> 10) 6608 48 xfs_buf_read+0xda/0x110 [xfs]
> 11) 6560 80 xfs_trans_read_buf+0x2a7/0x410 [xfs]
> 12) 6480 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
> 13) 6400 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
> 14) 6320 176 xfs_btree_lookup+0xd7/0x490 [xfs]
> 15) 6144 16 xfs_alloc_lookup_eq+0x19/0x20 [xfs]
> 16) 6128 96 xfs_alloc_fixup_trees+0xee/0x350 [xfs]
> 17) 6032 144 xfs_alloc_ag_vextent_near+0x916/0xb30 [xfs]
> 18) 5888 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
> 19) 5856 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
> 20) 5760 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
> 21) 5600 208 xfs_btree_split+0xb3/0x6a0 [xfs]
> 22) 5392 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
> 23) 5296 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
> 24) 5072 128 xfs_btree_insert+0x86/0x180 [xfs]
> 25) 4944 352 xfs_bmap_add_extent_delay_real+0x41e/0x1660 [xfs]
> 26) 4592 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
> 27) 4384 448 xfs_bmapi+0x982/0x1200 [xfs]

This one, I'm afraid, has always been big.

> 28) 3936 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
> 29) 3680 208 xfs_iomap+0x3d8/0x410 [xfs]
> 30) 3472 32 xfs_map_blocks+0x2c/0x30 [xfs]
> 31) 3440 256 xfs_page_state_convert+0x443/0x730 [xfs]
> 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 33) 3120 384 shrink_page_list+0x65e/0x840
> 34) 2736 528 shrink_zone+0x63f/0xe10

that's a nice one (actually the two together at > 900 bytes, ouch)

> 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> 36) 2096 128 try_to_free_pages+0x77/0x80
> 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> 38) 1728 48 alloc_pages_current+0x8c/0xe0
> 39) 1680 16 __get_free_pages+0xe/0x50
> 40) 1664 48 __pollwait+0xca/0x110
> 41) 1616 32 unix_poll+0x28/0xc0
> 42) 1584 16 sock_poll+0x1d/0x20
> 43) 1568 912 do_select+0x3d6/0x700

912, ouch!

int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
{
ktime_t expire, *to = NULL;
struct poll_wqueues table;

(gdb) p sizeof(struct poll_wqueues)
$1 = 624

I guess that's been there forever, though.

> 44) 656 416 core_sys_select+0x18c/0x2c0

416 hurts too.

The xfs callchain is deep, no doubt, but the combination of the select path
and the shrink calls is almost 2k in just a few calls, and that doesn't
help much.

-Eric

> 45) 240 112 sys_select+0x4f/0x110
> 46) 128 128 system_call_fastpath+0x16/0x1b
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> xfs mailing list
> xfs(a)oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: John Berthels on 8 Apr 2010 08:30

Dave Chinner wrote:
> On Thu, Apr 08, 2010 at 09:43:41AM +1000, Dave Chinner wrote:
>
> And there's a patch attached that stops direct reclaim from writing
> back dirty pages - it seems to work fine from some rough testing
> I've done. Perhaps you might want to give it a spin on a
> test box, John?
>
Thanks very much for this. The patch is in and soaking on a THREAD_ORDER
1 kernel (2.6.33.2 + patch + stack instrumentation), so far so good, but
it's early days. After about 2hrs of uptime:

$ dmesg | grep stack | tail -1
[ 60.350766] apache2 used greatest stack depth: 2544 bytes left

(which tallies well with your 5 1/2Kbytes usage figure).

I'll reply again after it's been running long enough to draw conclusions.

jb
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: John Berthels on 8 Apr 2010 10:50

John Berthels wrote:
> I'll reply again after it's been running long enough to draw conclusions.
We're getting pretty close on the 8k stack on this box now. It's running
2.6.33.2 + your patch, with THREAD_ORDER 1, stack tracing and
CONFIG_LOCKDEP=y. (Sorry that LOCKDEP is on, please advise if that's
going to throw the figures and we'll restart the test systems with new
kernels).

This is significantly more than 5.6K, so it shows a potential problem?
Or is 720 bytes enough headroom?

jb

[ 4005.541869] apache2 used greatest stack depth: 2480 bytes left
[ 4005.541973] apache2 used greatest stack depth: 2240 bytes left
[ 4005.542070] apache2 used greatest stack depth: 1936 bytes left
[ 4005.542614] apache2 used greatest stack depth: 1616 bytes left
[ 5531.406529] apache2 used greatest stack depth: 720 bytes left

$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (55 entries)
----- ---- --------
0) 7440 48 add_partial+0x26/0x90
1) 7392 64 __slab_free+0x1a9/0x380
2) 7328 64 kmem_cache_free+0xb9/0x160
3) 7264 16 free_buffer_head+0x25/0x50
4) 7248 64 try_to_free_buffers+0x79/0xc0
5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs]
6) 7024 16 try_to_release_page+0x33/0x60
7) 7008 384 shrink_page_list+0x585/0x860
8) 6624 528 shrink_zone+0x636/0xdc0
9) 6096 112 do_try_to_free_pages+0xc2/0x3c0
10) 5984 112 try_to_free_pages+0x64/0x70
11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710
12) 5616 48 alloc_pages_current+0x8c/0xe0
13) 5568 32 __page_cache_alloc+0x67/0x70
14) 5536 80 find_or_create_page+0x50/0xb0
15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
16) 5296 64 xfs_buf_get+0x74/0x1d0 [xfs]
17) 5232 48 xfs_buf_read+0x2f/0x110 [xfs]
18) 5184 80 xfs_trans_read_buf+0x2bf/0x430 [xfs]
19) 5104 80 xfs_btree_read_buf_block+0x5d/0xb0 [xfs]
20) 5024 80 xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
21) 4944 176 xfs_btree_lookup+0xd7/0x490 [xfs]
22) 4768 16 xfs_alloc_lookup_ge+0x1c/0x20 [xfs]
23) 4752 144 xfs_alloc_ag_vextent_near+0x58/0xb30 [xfs]
24) 4608 32 xfs_alloc_ag_vextent+0xe5/0x140 [xfs]
25) 4576 96 xfs_alloc_vextent+0x49f/0x630 [xfs]
26) 4480 160 xfs_bmbt_alloc_block+0xbe/0x1d0 [xfs]
27) 4320 208 xfs_btree_split+0xb3/0x6a0 [xfs]
28) 4112 96 xfs_btree_make_block_unfull+0x151/0x190 [xfs]
29) 4016 224 xfs_btree_insrec+0x39c/0x5b0 [xfs]
30) 3792 128 xfs_btree_insert+0x86/0x180 [xfs]
31) 3664 352 xfs_bmap_add_extent_delay_real+0x564/0x1670 [xfs]
32) 3312 208 xfs_bmap_add_extent+0x41c/0x450 [xfs]
33) 3104 448 xfs_bmapi+0x982/0x1200 [xfs]
34) 2656 256 xfs_iomap_write_allocate+0x248/0x3c0 [xfs]
35) 2400 208 xfs_iomap+0x3d8/0x410 [xfs]
36) 2192 32 xfs_map_blocks+0x2c/0x30 [xfs]
37) 2160 256 xfs_page_state_convert+0x443/0x730 [xfs]
38) 1904 64 xfs_vm_writepage+0xab/0x160 [xfs]
39) 1840 32 __writepage+0x1a/0x60
40) 1808 288 write_cache_pages+0x1f7/0x400
41) 1520 16 generic_writepages+0x27/0x30
42) 1504 48 xfs_vm_writepages+0x5a/0x70 [xfs]
43) 1456 16 do_writepages+0x24/0x40
44) 1440 64 writeback_single_inode+0xf1/0x3e0
45) 1376 128 writeback_inodes_wb+0x31e/0x510
46) 1248 16 writeback_inodes_wbc+0x1e/0x20
47) 1232 224 balance_dirty_pages_ratelimited_nr+0x277/0x410
48) 1008 192 generic_file_buffered_write+0x19b/0x240
49) 816 288 xfs_write+0x849/0x930 [xfs]
50) 528 16 xfs_file_aio_write+0x5b/0x70 [xfs]
51) 512 272 do_sync_write+0xd1/0x120
52) 240 48 vfs_write+0xcb/0x1a0
53) 192 64 sys_write+0x55/0x90
54) 128 128 system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: John Berthels on 8 Apr 2010 12:20

John Berthels wrote:
> John Berthels wrote:
>> I'll reply again after it's been running long enough to draw
>> conclusions.
The box with patch+THREAD_ORDER 1+LOCKDEP went down (with no further
logging retrievable by /var/log/messages or netconsole).

We're loading up a 2.6.33.2 + patch + THREAD_ORDER 2 (no LOCKDEP) to get
better info as to whether we are still blowing the 8k limit with the
patch in place.

jb

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: AMD IOMMU kdump fix plus cleanups (v2)
Next: [PATCH] mfd: Fix WM831x I2C suspend prototype