From: Jerome Glisse on
On Tue, Feb 02, 2010 at 04:42:21PM +0100, Ingo Molnar wrote:
>
> * Jerome Glisse <glisse(a)freedesktop.org> wrote:
>
> > On Tue, Feb 02, 2010 at 09:17:27AM +0100, Ingo Molnar wrote:
> > >
> > > * Dave Airlie <airlied(a)linux.ie> wrote:
> > >
> > > > > Hi Linus,
> > > > >
> > > > > Please pull the 'drm-linus' branch from
> > > > > ssh://master.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git drm-linus
> > > > >
> > > >
> > > > I've also added an oops fix I seem to lose off my radar to this tree.
> > > >
> > > > commit 17aafccab4352b422aa01fa6ebf82daff693a5b3
> > > > Author: Michel D??nzer <daenzer(a)vmware.com>
> > > > Date: Fri Jan 22 09:20:00 2010 +0100
> > > >
> > > > drm/radeon/kms: Fix oops after radeon_cs_parser_init() failure.
> > >
> > > FYI, this drm pull into mainline has triggered quick boot crashes in -tip
> > > testing (even with the above fix applied), on an Athlon64 whitebox PC with:
> > >
> > > 01:00.0 VGA compatible controller: ATI Technologies Inc RV370 5B60 [Radeon X300 (PCIE)]
> > > 01:00.1 Display controller: ATI Technologies Inc RV370 [Radeon X300SE]
> > >
> > > the crash is:
> > >
> > > [ 7.111003] radeon 0000:01:00.0: Disabling GPU acceleration
> > > [ 7.273547] Failed to wait GUI idle while programming pipes. Bad things might happen.
> > > [ 7.436296] [drm:r100_cp_fini] *ERROR* Wait for CP idle timeout, shutting down CP.
> > > [ 7.598755] Failed to wait GUI idle while programming pipes. Bad things might happen.
> > > [ 7.599306] BUG: unable to handle kernel paging request at f8380000
> > > [ 7.599999] IP: [<c149f0de>] rv370_pcie_gart_set_page+0x2d/0x3c
> > > [ 7.599999] *pde = 36d44067 *pte = 00000000
> > > [ 7.599999] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
> > > [ 7.599999] last sysfs file:
> > >
> > > i have bisected it back to:
> > >
> > > | 97b94ccb9aa1b82ed7a9a045d0ae5b32c99b84a0 is the first bad commit
> > > | commit 97b94ccb9aa1b82ed7a9a045d0ae5b32c99b84a0
> > > | Author: Dave Airlie <airlied(a)redhat.com>
> > > | Date: Fri Jan 29 15:31:47 2010 +1000
> > > |
> > > | drm/radeon/kms: fix incorrect logic in DP vs eDP connector checking.
> > > |
> > > | This makes displayport work again here.
> > >
> > > Unfortunately even with that patch reverted it still crashes. Config and
> > > bootlog attached.
> > >
> > > It's the moving of radeom KMS out of staging after -rc6 that causes it,
> > > because it brought it into the scope of my testing:
> > >
> > > f71d018: drm/radeon/kms: move radeon KMS on/off switch out of staging.
> > >
> > > So at least on this box it's clearly not ready for mainline enablement yet.
> > > I've attached the revert patch further below.
> > >
> > > Ingo
> > >
> >
> > Attached is a patch which will fix the oops, still it's strange that CP
> > fails to init on your config. [...]
>
> Thanks, that fixes the crash here!
>
> Tested-by: Ingo Molnar <mingo(a)elte.hu>
>
> > [...] Do you have IOMMU enabled ? I haven't played with iommu stuff thus i
> > wonder if we are missing somethings in this area.
>
> No IOMMU here - this is a 5 years old box. (beyond GART that is)
>
> Your patch fixes a bona-fide illegal-access bug in the DRM code, that's more
> than enough to crash the box ;-)
>
> Btw., there's a new warning in the DRM code
>
> drivers/gpu/drm/ati_pcigart.c: In function 'drm_ati_pcigart_init':
> drivers/gpu/drm/ati_pcigart.c:115: warning: format '%Lx' expects type 'long long unsigned int', but argument 3 has type 'dma_addr_t'
>
> Please fix that too, the kernel build is noisy enough as-is.
>
> Thanks,
>
> Ingo

I think i saw a patch for this, it's often a nice thing to let people do their first
patch on this kind of thing, but i try to fix such thing when i run into it, thought
i haven't always been a well behaving kid in the % format area.

Cheers,
Jerome
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Dave Airlie <airlied(a)gmail.com> wrote:

> On Wed, Feb 3, 2010 at 1:46 AM, Ingo Molnar <mingo(a)elte.hu> wrote:
> >
> > * Dave Airlie <airlied(a)gmail.com> wrote:
> >
> >> On Tue, Feb 2, 2010 at 6:17 PM, Ingo Molnar <mingo(a)elte.hu> wrote:
> >> >
> >> > * Dave Airlie <airlied(a)linux.ie> wrote:
> >> >
> >> >> > Hi Linus,
> >> >> >
> >> >> > Please pull the 'drm-linus' branch from
> >> >> > ssh://master.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git drm-linus
> >> >> >
> >> >>
> >> >> I've also added an oops fix I seem to lose off my radar to this tree.
> >> >>
> >> >> commit 17aafccab4352b422aa01fa6ebf82daff693a5b3
> >> >> Author: Michel D??nzer <daenzer(a)vmware.com>
> >> >> Date: ? Fri Jan 22 09:20:00 2010 +0100
> >> >>
> >> >> ? ? drm/radeon/kms: Fix oops after radeon_cs_parser_init() failure.
> >> >
> >>
> >> Wierd this suggests something else is wrong on that machine can you get me
> >> the whole dmesg? I'm guessing some iommu or swiotlb issue.
> >
> > This box has no known hardware or software problems, just this week it booted
> > in excess of 1000 kernels so i'd exclude that angle for now.
> >
> > I have bisected the crash back to the DRM tree and the crash went away with
> > the Kconfig revert i applied - and it got fixed by Jerome's patch. I posted
> > my config and i posted the relevant boot log as well. Find below the full
> > bootlog as well with vanilla -git (ab65832) and the config. (i dont think it
> > matters)
> >
> >> I've asked Jerome to fix the oops, but really anyone with an old .config
> >> won't get hit by this, and we've booted this on quite a lot of machines at
> >> this point.
> >
> > I dont see the commit in yesterday's linux-next. It has very fresh
> > timestamps:
> >
> > ?commit f71d0187987e691516cd10c2702f002c0e2f0edc
> > ?Author: ? ? Dave Airlie <airlied(a)redhat.com>
> > ?AuthorDate: Mon Feb 1 11:35:47 2010 +1000
> > ?Commit: ? ? Dave Airlie <airlied(a)redhat.com>
> > ?CommitDate: Mon Feb 1 11:35:47 2010 +1000
> >
> > What kind of widespread testing could this commit have gotten in the less
> > than 24 hours before it hit mainline?
> >
>
> Its shipping in a major distro by default, its planned to be shipped in an
> even more major distro. Its been boot tested on 1000s of machines by 1000s
> of ppl.

Well but that's not the precise tree you sent to Linus, is it?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Dave Airlie <airlied(a)gmail.com> wrote:

> >> On Wed, Feb 3, 2010 at 1:46 AM, Ingo Molnar <mingo(a)elte.hu> wrote:
> >> >
> >> > * Dave Airlie <airlied(a)gmail.com> wrote:
> >> >
> >> >> On Tue, Feb 2, 2010 at 6:17 PM, Ingo Molnar <mingo(a)elte.hu> wrote:
> >> >> >
> >> >> > * Dave Airlie <airlied(a)linux.ie> wrote:
> >> >> >
> >> >> >> > Hi Linus,
> >> >> >> >
> >> >> >> > Please pull the 'drm-linus' branch from
> >> >> >> > ssh://master.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git drm-linus
> >> >> >> >
> >> >> >>
> >> >> >> I've also added an oops fix I seem to lose off my radar to this tree.
> >> >> >>
> >> >> >> commit 17aafccab4352b422aa01fa6ebf82daff693a5b3
> >> >> >> Author: Michel D??nzer <daenzer(a)vmware.com>
> >> >> >> Date: ? Fri Jan 22 09:20:00 2010 +0100
> >> >> >>
> >> >> >> ? ? drm/radeon/kms: Fix oops after radeon_cs_parser_init() failure.
> >> >> >
> >> >>
> >> >> Wierd this suggests something else is wrong on that machine can you get me
> >> >> the whole dmesg? I'm guessing some iommu or swiotlb issue.
> >> >
> >> > This box has no known hardware or software problems, just this week it booted
> >> > in excess of 1000 kernels so i'd exclude that angle for now.
> >> >
> >> > I have bisected the crash back to the DRM tree and the crash went away with
> >> > the Kconfig revert i applied - and it got fixed by Jerome's patch. I posted
> >> > my config and i posted the relevant boot log as well. Find below the full
> >> > bootlog as well with vanilla -git (ab65832) and the config. (i dont think it
> >> > matters)
> >> >
> >> >> I've asked Jerome to fix the oops, but really anyone with an old .config
> >> >> won't get hit by this, and we've booted this on quite a lot of machines at
> >> >> this point.
> >> >
> >> > I dont see the commit in yesterday's linux-next. It has very fresh
> >> > timestamps:
> >> >
> >> > ?commit f71d0187987e691516cd10c2702f002c0e2f0edc
> >> > ?Author: ? ? Dave Airlie <airlied(a)redhat.com>
> >> > ?AuthorDate: Mon Feb 1 11:35:47 2010 +1000
> >> > ?Commit: ? ? Dave Airlie <airlied(a)redhat.com>
> >> > ?CommitDate: Mon Feb 1 11:35:47 2010 +1000
> >> >
> >> > What kind of widespread testing could this commit have gotten in the less
> >> > than 24 hours before it hit mainline?
> >> >
> >>
> >> Its shipping in a major distro by default, its planned to be shipped in an
> >> even more major distro. Its been boot tested on 1000s of machines by 1000s
> >> of ppl.
> >
> > Well but that's not the precise tree you sent to Linus, is it?
>
> It pretty much is. If I could blame your crash on any of the recent patches
> I would but its something new and unfun. [...]

You dont seem to realize the plain and simple fact that the bug (and some
other bug) was obscure before because this particular KMS aspect of the
radeon driver was in drivers/staging/, and it became more prominent via this
post-rc6 commit:

| From f71d0187987e691516cd10c2702f002c0e2f0edc Mon Sep 17 00:00:00 2001
| From: Dave Airlie <airlied(a)redhat.com>
| Date: Mon, 1 Feb 2010 11:35:47 +1000
| Subject: [PATCH] drm/radeon/kms: move radeon KMS on/off switch out of staging.
|
| We are happy enough that the KMS driver is stable enough for enough people
| for the kms enable/disable to leave staging. Distros can now contemplate
| turning this on.
|
| Signed-off-by: Dave Airlie <airlied(a)redhat.com>
| ---
| drivers/gpu/drm/Kconfig | 2 ++
| drivers/staging/Kconfig | 2 --
| 2 files changed, 2 insertions(+), 2 deletions(-)

I never claimed (and still dont claim) that the bug is 'new' per se, so why
do you keep beating down on that straw man argument? I said it in my very
first mail that this bug got brought upon us by the Kconfig commit above:

> > It's the moving of radeom KMS out of staging after -rc6 that causes it,
> > because it brought it into the scope of my testing:
> >
> > f71d018: drm/radeon/kms: move radeon KMS on/off switch out of staging.
> >
> > So at least on this box it's clearly not ready for mainline enablement
> > yet.

I dont mind reporting bugs and testing patches (as i did), all i said is that
from a QA angle it's somewhat late to do that in -rc7. (It's not even a
completely new driver either, which people would know to stay away from -
it's a new config option of an existing driver, so i'd expect many people to
turn it on when they see it in the oldconfig - even though it's default-off.)

You made the bug more prominent by moving it into the driver proper, after
-rc6, and while i dont mind reporting and working on bugs, your constant
denial is somewhat counter-productive, as (beyond the waste of time on these
emails) it suggests that we might see repeat incidents of this kind in the
future.

Anyway, with two bugs in a row this commit is clearly too problematic for me
so i have reverted f71d018 from -tip.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Dave Airlie <airlied(a)gmail.com> wrote:

> On Wed, Feb 3, 2010 at 1:44 AM, Ingo Molnar <mingo(a)elte.hu> wrote:
> >
> > * Dave Airlie <airlied(a)linux.ie> wrote:
> >
> >> > It's the moving of radeom KMS out of staging after -rc6 that causes it,
> >> > because it brought it into the scope of my testing:
> >> >
> >> > ?f71d018: drm/radeon/kms: move radeon KMS on/off switch out of staging.
> >> >
> >> > So at least on this box it's clearly not ready for mainline enablement
> >> > yet. I've attached the revert patch further below.
> >>
> >> Its not enabled by default so reverting this doesn't make much sense.
> >
> > I boot allyesconfig kernels regularly, which testing method works fine
> > with another 2000+ upstream drivers. (including the dozens of drivers
> > which match to active hardware components on that box)
>
> Okay this was something I wondered about, since these are *not*
> allyesconfig .configs, I've generated some and CONFIG_FB_RADEON is always
> on here, and you seem to not have that enabled (not that enabling it is a
> good idea it is in fact a really bad idea).

These were random configs - the size doesnt match an allyesconfig, those are
way bigger. My above comment related to the first crash, and to my argument
that all other drivers are fine during bootup - and there's a lot of them.

> So do you have something you are running after allyesconfig to fix things?
> or have you just got a config that is close enough to allyesconfig.
>
> I'm building kernels with your .config now and boot testing them on the
> full range of hardware I have/

Thanks. Is there something i can enable to get a better log for you to find
out where (and why) it's hanging? It's still early during bootup so the box
is not particularly debuggable - so i'm not sure i can get a task list dump,
etc., unfortunately.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on

* Dave Airlie <airlied(a)gmail.com> wrote:

> >>
> >> These were random configs - the size doesnt match an allyesconfig, those are
> >> way bigger. My above comment related to the first crash, and to my argument
> >> that all other drivers are fine during bootup - and there's a lot of them.
> >>
> >>> So do you have something you are running after allyesconfig to fix things?
> >>> or have you just got a config that is close enough to allyesconfig.
> >>>
> >>> I'm building kernels with your .config now and boot testing them on the
> >>> full range of hardware I have/
> >>
> >> Thanks. Is there something i can enable to get a better log for you to find
> >> out where (and why) it's hanging? It's still early during bootup so the box
> >> is not particularly debuggable - so i'm not sure i can get a task list dump,
> >> etc., unfortunately.
> >>
> > Do you have NMI watchdog enabled? (does it work that early)
> >
> > a backtrace of where it hangs would be nice,
> >
> > Also a dmesg from booting with drm.debug=15 might help narrow it down
> > also.
> >
>
> Okay I've booted this on the rv370 + rv380 machines I have, I've no old
> Athlon's though so I'm trying to get RHTS to give me access to one or two
> internally,
>
> Another question, that came to mind, what is there any monitors plugged in?
> or a KVM or something? if yes can you try without and if no can you try
> with?
>
> Also can you add CONFIG_FRAMEBUFFER_CONSOLE as well since I don't think we
> test often without it.

ok - i'll try your suggestions and send an update - will probably have to
wait until next week. Feel free to deprioritize the bug until then (i have my
revert as a short-term band-aid), unless others report similar problems too.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/