From: Chris Wilson on
On Wed, 30 Jun 2010 18:24:04 -0700, Linus Torvalds <torvalds(a)linux-foundation.org> wrote:
> On Wed, Jun 30, 2010 at 4:07 PM, Linus Torvalds
> <torvalds(a)linux-foundation.org> wrote:
> >
> > That commit changes the page cache allocation to use
> >
> > +                                          mapping_gfp_mask (mapping) |
> > +                                          __GFP_COLD |
> > +                                          gfpmask);
> >
> > if I read it right. And the default mapping_gfp_mask() is
> > GFP_HIGHUSER_MOVABLE, so I think you get all of
> > (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM)
> > set by default.
>
> .. and then I left out the one flag I _meant_ to have there, namely
> __GFP_MOVABLE.
>
> > The old code didn't just play games with ~__GFP_NORETRY and change
> > that at runtime (which was buggy - no locking, no protection, no
> > nothing), it also initialized the gfp mask. And that code also got
> > removed:

That code I added with the original shrinker patch, and the flags lifted
from the shmem defaults, tweaked to what seemed sane with the addition of
NORETRY and friends. I see that i915 is unique in using shmem as the page
allocator, which perhaps explains why this failure is not observed with
the ttm drivers. ttm uses two sets of gfp mask: HIGHUSER and USER | DMA32.
So replacing the mapping_gfp_mask() with HIGHUSER would seem appropriate.
And the interaction of MOVABLE could explain why hibernate broke with the
introduction of GEM.

* turns to his trusty copy of LDD to explain the various meanings of
gfp_t...
--
Chris Wilson, Intel Open Source Technology Centre
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: M. Vefa Bicakci on
On 01/07/10 04:24 AM, Linus Torvalds wrote:
> On Wed, Jun 30, 2010 at 4:07 PM, Linus Torvalds
> <torvalds(a)linux-foundation.org> wrote:
>>
>> That commit changes the page cache allocation to use
>>
>> + mapping_gfp_mask (mapping) |
>> + __GFP_COLD |
>> + gfpmask);
>>
>> if I read it right. And the default mapping_gfp_mask() is
>> GFP_HIGHUSER_MOVABLE, so I think you get all of
>> (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM)
>> set by default.
>
> .. and then I left out the one flag I _meant_ to have there, namely
> __GFP_MOVABLE.
>
>> The old code didn't just play games with ~__GFP_NORETRY and change
>> that at runtime (which was buggy - no locking, no protection, no
>> nothing), it also initialized the gfp mask. And that code also got
>> removed:
>
> In fact, I don't really see why we should use that mapping_gfp_mask()
> at all, since all allocations should be going through that
> i915_gem_object_get_pages() function anyway. So why not just change
> that function to ignore the default gfp mask for the mapping, and just
> use the mask that the o915 driver wants?
>
> Btw, why did it want to mark the pages reclaimable?
>
> Anyway, what I'm suggesting somebody who sees this test is just
> something like the patch below (whitespace-damage - I'm cutting and
> pasting, it's a trivial one-liner). Does this change any behavior?
> Vefa?
>
> Linus

Dear Linus,

I made the code change you documented below to a vanilla 2.6.34 tree,
compiled it and tested hibernate/thaw cycles. In total, I tested
16 cycles, with 8 consecutive cycles in one installation (Debian Sid)
and 8 consecutive cycles in another one (Fedora 13). For every cycle,
I tried to run "old" and "new" programs, in terms of whether they were
run in previous cycles. I tried a few extra cycles with uswsusp as well.

Based on my testing, I am happy to report that the change you suggest
fixes the "memory corruption (segfaults) after thaw" issue for me.
I can't thank you enough times for this.

Now, the obligatory question: Could we have this fix applied to 2.6.32,
2.6.33 and 2.6.34 ?

Thanks a lot again!

M. Vefa Bicakci

--- linux-2.6.34/drivers/gpu/drm/i915/i915_gem.c.orig 2010-07-01 17:47:30.000000000 +0000
+++ linux-2.6.34/drivers/gpu/drm/i915/i915_gem.c 2010-07-01 17:54:03.000000000 +0000
@@ -2312,7 +2312,7 @@
mapping = inode->i_mapping;
for (i = 0; i < page_count; i++) {
page = read_cache_page_gfp(mapping, i,
- mapping_gfp_mask (mapping) |
+ __GFP_HIGHMEM |
__GFP_COLD |
gfpmask);
if (IS_ERR(page))

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Thu, Jul 1, 2010 at 5:49 PM, Dave Airlie <airlied(a)gmail.com> wrote:
>
> RECLAIMABLE added also seems fine, of course you can't have
> RECLAIMABLE and MOVABLE (I find this out when it oopses on boot).

Yes. They are both flags for the anti-fragmentation code, and I think
I'll leave the decision as to whether the i915 driver should use
__GFP_RECLAIMABLE to the people who work with and care about the
fragmentation issues. I doubt it matters much in practice, at least
not for the loads that the fragmentation people tend to care most
about.

> So I suspect MOVABLE is the problem. but I don't know enough about gfp
> flags to know what RECLAIMABLE buys us, and where it �might bite us so
> I can test some more.

I think I'll just apply your previous tested patch - GFP_HIGHUSER
should take care of all the flags that matter fundamentally, and then
the reclaimable flag is really just a small detail for others to worry
about.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: M. Vefa Bicakci on
On 02/07/10 04:28 AM, Linus Torvalds wrote:
> On Thu, Jul 1, 2010 at 5:49 PM, Dave Airlie <airlied(a)gmail.com> wrote:
>>
>> RECLAIMABLE added also seems fine, of course you can't have
>> RECLAIMABLE and MOVABLE (I find this out when it oopses on boot).
>
> Yes. They are both flags for the anti-fragmentation code, and I think
> I'll leave the decision as to whether the i915 driver should use
> __GFP_RECLAIMABLE to the people who work with and care about the
> fragmentation issues. I doubt it matters much in practice, at least
> not for the loads that the fragmentation people tend to care most
> about.
>
>> So I suspect MOVABLE is the problem. but I don't know enough about gfp
>> flags to know what RECLAIMABLE buys us, and where it might bite us so
>> I can test some more.
>
> I think I'll just apply your previous tested patch - GFP_HIGHUSER
> should take care of all the flags that matter fundamentally, and then
> the reclaimable flag is really just a small detail for others to worry
> about.
>

Dear Linus,

I have bad news regarding your fix for self-reclaim and i915.

Apparently, I haven't tried enough hibernate/thaw cycles while
initially testing your fix.

After applying your fix to 2.6.34.1 and using it for two weeks,
I noticed that every now and then I get a black screen or random
kernel errors after thawing. I thought maybe this might be the
same problem caused by d8e0902806c0bd2ccc4f6a267ff52565a3ec933b .
(It turns out that my guess was right.)

So I compiled two vanilla 2.6.34.1 kernels. One with
d8e0902806c0bd2ccc4f6a267ff52565a3ec933b reverted to get back
to pre 2.6.32.8 state, and another one with your fix applied.

Then I set up an automated process where the computer would
hibernate, and reboot at the end of the hibernation sequence
(by setting /sys/power/disk to reboot) and then thaw back.
I made this process loop at least 20 times.

The kernel with d8e0902806c0bd2ccc4f6a267ff52565a3ec933b reverted
was able to hibernate/thaw at least 40 times in one go, while
the one with your fix applied was able to hibernate/thaw at most
17 times (in two separate trials) after which it crashed during
the next thaw.

Is there anything I can do find out the correct flags to use
in addition to GFP_HIGHUSER ? Can I do something like a bisection
for the flags one by one starting from the pre 2.6.32.8 state?
If you could outline a procedure to do this, I would be glad to
follow it.

Sorry to bug you again about this problem because of incomplete
testing on my part.

Regards,

M. Vefa Bicakci
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on
On Sat, Jul 17, 2010 at 11:58 AM, M. Vefa Bicakci
<bicave(a)superonline.com> wrote:
>
> The kernel with d8e0902806c0bd2ccc4f6a267ff52565a3ec933b reverted
> was able to hibernate/thaw at least 40 times in one go, while
> the one with your fix applied was able to hibernate/thaw at most
> 17 times (in two separate trials) after which it crashed during
> the next thaw.

Ok. I do wonder if the bug is possibly something entirely different,
and the allocation patterns just happen to expose/hide it. Reverting
the original commit should be pretty darn close to applying my fix.
Any remaining issues would seem to be more about the actual bug in the
original code (racing on changing that mapping->gfp_mask witthout any
locking) than about anything else.

> Is there anything I can do find out the correct flags to use
> in addition to GFP_HIGHUSER ? Can I do something like a bisection
> for the flags one by one starting from the pre 2.6.32.8 state?
> If you could outline a procedure to do this, I would be glad to
> follow it.

You can try adding __GFP_RECLAIMABLE | __GFP_NOMEMALLOC to the set of
flags in i915_gem_object_get_pages(). That's what the old code had
(and then it played games with NORETRY|NOWARN). I've attached a patch
(UNTESTED! Maybe it won't compile).

Now, I don't see why those flags would matter, but NOMEMALLOC in
particular does make a difference for memory allocation patterns under
low memory conditions, so maybe it could make a difference.

And if it _does_ make a difference, it would be interesting to know
which of the two flags matter. So try both flags first, and see if
that gets you something reliable. And if it does, remove one of them
and try again - just to see _which_ flag it is that the i915 driver
would care about. That would hopefully give us a hint.

> Sorry to bug you again about this problem because of incomplete
> testing on my part.

Oh, never be sorry for testing even more, and testing something nobody
else bothered to test.

Linus