stable? quality assurance? [Kernel]

Prev: KVM: MMU: introduce pte_prefetch_topup_memory_cache()
Next: NET_NS: unregister_netdevice: waiting for lo to become free (after using openvpn) (was Re: sysfs bug when using tun with network namespaces)

From: David Newall on 12 Jul 2010 16:00

Stefan Richter wrote:
> David Newall wrote:
>
>> Thus 2.6.34 is the latest gamma-test kernel. It's not stable and I
>> doubt anybody honestly thinks otherwise.
>>
>
> It works stable for what I use it for.
>
Mea culpa. I didn't mean that 2.6.34 is unstable, but that the term
"stable" is not appropriate for a newly released kernel; "gamma" should
be used instead.

Merely six months ago 2.6.32 was released; today we're preparing for
2.6.35; a new kernel every two months! Perhaps 2.6.31 is truly the
latest stable kernel; or else 2.6.27 does, which is the other 2.6 on the
front page of kernel.org. I'm pretty sure 2.4 is stable (which might
explain why I see it embedded *much* more frequently than 2.6.)

> If it doesn't for you, then I hope you are already in contact with the
> respective subsystem developers to get the regressions that you
> experience fixed.
>
(Segue to a problem which follows from calling bleeding-edge kernels
"stable".)

When reporting bugs, the first response is often, "we're not interested
in such an old kernel; try it with the latest." That's not hugely
useful when the latest kernels are not suitable for production use. If
kernels weren't marked stable until they had earned the moniker, for
example 2.6.27, then the expectation of developers and of users would be
consistent: developers could expect users to try it again with latest
stable kernel, and users could reasonably expect that trying it wouldn't
break their system.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nix on 12 Jul 2010 16:30

On 11 Jul 2010, Martin Steigerwald said:

> 2.6.34 was a desaster for me: bug #15969 - patch was availble before
> 2.6.34 already, bug #15788, also reported with 2.6.34-rc2 already, as well
> as most important two complete lockups - well maybe just X.org and radeon
> KMS, I didn't start my second laptop to SSH into the locked up one - on my
> ThinkPad T42. I fixed the first one with the patch, but after the lockups I
> just downgraded to 2.6.33 again.
[...]
> hang on hibernation with kernel 2.6.34.1 and TuxOnIce 3.1.1.1
>
> on this mailing list just a moment ago. But then 2.6.33 did hang with
> TuxOnIce which apparently (!) wasn't a TuxOnIce problem either, since
> 2.6.34 did not hang with it anymore which was a reason for me to try
> 2.6.34 earlier.

To introduce yet more anecdata into this thread, I too had problems with
TuxOnIce-driven suspend/resume from just post-2.6.32 to just pre-2.6.34.
The solution was, surprise surprise, to *raise a bug report*, whereupon
in short order I had a workaround. In 2.6.34, the problem vanished as
mysteriously as it appeared, as did the bug whereby X coredumped and the
screen stayed dark forever upon quitting X. 2.6.34 and 2.6.34.1 have
worked better for me than any kernel I've used since 2.6.30, with no
bugs noticeable on any of my machines (that's a first since 2.6.26).

I speculate that there may be some subtle piece of overwriting inside
the Radeon KMS and/or DRM code, which is obscure enough that it is
relatively easily perturbed by changes elsewhere in the kernel.

But nonetheless, one cannot extrapolate from a single bug in a subsystem
as complex as DRM/KMS to the quality of the entire kernel. This is
doubly true given the degree of difference between different cards
labelled as Radeons: I'd venture to state that most of the Radeon bugs
I've seen flow past over the last year or so only affect a small subset
of cards: but if you add them all up, it's likely that most users have
been bitten by at least one. But the problem here is not the kernel
developers, nor the kernel quality: it's that ATI Radeons are a
horrifically complicated and tangled web of slightly variable hardware.
(In this they are no different from any other modern graphics card.)

Martin, might I suggest considering stable kernels 'experimental' until
at least .1 is out? Before Linus releases a kernel, its only users are
dedicated masochists and developers: after the release, piles of regular
early adopters pour in, and heaps of bug reports head to lkml and fixes
head to -stable. The .1 kernels, with fixes for some of those, are the
first you can really call *stable*, as they've got fixes for bugs
isolated after testing by a much larger userbase of suckers.

-- N., dedicated sucker and masochist
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Stefan Richter on 12 Jul 2010 17:20

David Newall wrote:
> Stefan Richter wrote:
>> If it doesn't for you, then I hope you are already in contact with the
>> respective subsystem developers to get the regressions that you
>> experience fixed.
>>
> (Segue to a problem which follows from calling bleeding-edge kernels
> "stable".)
>
> When reporting bugs, the first response is often, "we're not interested
> in such an old kernel; try it with the latest."

Because there are continuously going bug fixes into the new kernels.

> That's not hugely useful when the latest kernels are not suitable for
> production use.

"I have this bug here." - "It might be fixed in 2.6.mn. Try it." - "I
don't want to because I got burned by 2.6.jk." Well, then don't do it
and keep using the old buggy kernel. Or use a forked kernel where
somebody adds bugfix backports and feature backports as you require
them, if that somebody does a really good job.

> If kernels weren't marked stable until they had earned the moniker,
> for example 2.6.27, then the expectation of developers and of users
> would be consistent:

2.6.27.y is what you call stable exactly because none of the boatloads
of bug fixes and improvements of each subsequent 2.6.x release goes into
it anymore.

That's the nature of the beast. You can't have the cake and eat it.
Which is why it is important that we keep the regression count in new
kernels low and try to detect and fix regressions as early as possible.
I admit that I do not really help with this myself outside the subsystem
which I maintain. I usually start to run -rc kernel at later -rc's only
(say, -rc5, only sometimes earlier) and don't test them beyond the one
or to two configurations that I use personally. There were occasionally
regressions in the subsystem that I maintain but they were few and
always fixed quickly, and each one was a lesson how to do better. So,
for that subsystem, the "Latest Stable Kernel" that is advertised on the
front page of kernel.org really and truly /is/ the latest stable release
that is recommended for production use, as far as that subsystem is
concerned.
--
Stefan Richter
-=====-==-=- -=== -==--
http://arcgraph.de/sr/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Martin Steigerwald on 12 Jul 2010 17:50

Am Montag 12 Juli 2010 schrieb David Newall:
> Stefan Richter wrote:
> > David Newall wrote:
> >> Thus 2.6.34 is the latest gamma-test kernel. It's not stable and I
> >> doubt anybody honestly thinks otherwise.
> >
> > It works stable for what I use it for.
>
> Mea culpa. I didn't mean that 2.6.34 is unstable, but that the term
> "stable" is not appropriate for a newly released kernel; "gamma" should
> be used instead.

I indeed think stable should mean "stable for the majority of users". Its
difficult to estimate. But I doubt that every dot-0 release qualified for
that.

> Merely six months ago 2.6.32 was released; today we're preparing for
> 2.6.35; a new kernel every two months! Perhaps 2.6.31 is truly the
> latest stable kernel; or else 2.6.27 does, which is the other 2.6 on
> the front page of kernel.org. I'm pretty sure 2.4 is stable (which
> might explain why I see it embedded *much* more frequently than 2.6.)

I have these metrics:

martin(a)shambhala:~> uprecords -m 20 | cut -c1-70
# Uptime | System
----------------------------+-----------------------------------------
1 36 days, 09:57:31 | Linux 2.6.32.3-tp42-toi- Tue Jan 12 09:
2 31 days, 01:07:24 | Linux 2.6.26.5-tp42-toi- Tue Sep 30 13:
3 24 days, 13:29:07 | Linux 2.6.33.2-tp42-toi- Mon May 31 22:
4 21 days, 15:08:21 | Linux 2.6.29.2-tp42-toi- Tue Apr 28 22:
5 19 days, 21:22:14 | Linux 2.6.33.2-tp42-toi- Tue May 11 17:
6 19 days, 09:49:05 | Linux 2.6.32.8-tp42-toi- Fri Mar 5 11:
7 18 days, 02:31:41 | Linux 2.6.29.6-tp42-toi- Thu Jul 9 09:
8 17 days, 12:38:36 | Linux 2.6.28.8-tp42-toi- Wed Mar 18 10:
9 16 days, 16:10:28 | Linux 2.6.31-tp42-toi-3. Tue Sep 22 21:
10 15 days, 14:39:26 | Linux 2.6.28.4-tp42-toi- Mon Feb 9 22:
11 15 days, 13:58:12 | Linux 2.6.27.7-tp42-toi- Tue Dec 9 22:
12 13 days, 21:11:06 | Linux 2.6.31-rc7-tp42-to Mon Aug 31 21:
13 13 days, 18:34:00 | Linux 2.6.29.2-tp42-toi- Wed May 27 19:
14 12 days, 21:54:18 | Linux 2.6.26.5-tp42-toi- Fri Oct 31 13:
15 10 days, 22:02:14 | Linux 2.6.28.7-tp42-toi- Thu Feb 26 16:
16 10 days, 16:29:02 | Linux 2.6.33.2-tp42-toi- Fri Jun 25 19:
17 10 days, 08:04:52 | Linux 2.6.26.2-tp42-toi- Thu Sep 18 14:
18 10 days, 03:52:30 | Linux 2.6.31.3-tp42-toi- Thu Oct 15 09:
19 9 days, 22:03:29 | Linux 2.6.31.5-tp42-toi- Tue Nov 3 11:
20 9 days, 00:24:22 | Linux 2.6.29.2-tp42-toi- Thu Jun 25 14:
----------------------------+-----------------------------------------
-> 116 0 days, 00:52:03 | Linux 2.6.33.6-tp42-toi- Mo
----------------------------+-----------------------------------------
1up in 0 days, 00:31:56 | at Mon Jul 12 23:
t10 in 15 days, 13:47:24 | at Wed Jul 28 12:
no1 in 36 days, 09:05:29 | at Wed Aug 18 08:
up 608 days, 02:40:08 | since Thu Sep 18 14:
down 54 days, 06:12:57 | since Thu Sep 18 14:
%up 91.808 | since Thu Sep 18 14:

And 228 entries in there in total since 2.6.26, with

martin(a)shambhala:~> uprecords -m 300 | cut -c1-70 | grep "0 days" | wc -l
148

entries for shorter than one day.

Sure these are not to be read without the experiences I made and the
reasons for rebooting, since sometimes just I messed up with some kernel
option and compiled another one.

AFAIR 2.6.26 upto 2.6.32 has been fine, except 2.6.30 where TuxOnIce just
didn't work, but I am not yet sure whether this was caused by TuxOnIce or
by some problem with general hibernation infrastructure. I then just
omitted 2.6.30. Since I only tried 2.6.31 with my T42 I got an whooping
uptime of over 100 days for 2.6.29 on my T23! Thats stable. Well any
kernels that reproducably reach more than 15 or 30 days are quite stable
in my own subjective consideration. Most kernels that got that far would
likely have lastest much longer if I didn't just compile the next one, be
it a dot release or a major release.

This all without Radeon KMS!

2.6.33.2 was only stable when I used Radeon KMS without TuxOnIce. Ok, so
might be a TuxOnIce problem, but then at least those quite frequent hangs
on hibernation at the place where the screen goes black for a few seconds
and comes back then which I had with 2.6.33.2 where gone for 2.6.34. Maybe
they are gone with 2.6.33.6 since it carries some more radeon drm fixes.

2.6.34 did not reach an uptime of more than 2 or 3 days yet.

Well maybe Nix is right and its just that Radeon KMS has not been
stabilized enough and rest of kernel is quite stable.

And when the combination of 2.6.33 now .6 and userspace software suspend
works for me - for the first time, often it was TuxOnIce that worked, but
not any in kernel method I tried from time to time - so be it for the time
being, even if userspace software suspend is way slower and doesn't
satisfy the disk on writing the image.

> > If it doesn't for you, then I hope you are already in contact with
> > the respective subsystem developers to get the regressions that you
> > experience fixed.
>
> (Segue to a problem which follows from calling bleeding-edge kernels
> "stable".)
>
> When reporting bugs, the first response is often, "we're not interested
> in such an old kernel; try it with the latest." That's not hugely
> useful when the latest kernels are not suitable for production use. If
> kernels weren't marked stable until they had earned the moniker, for
> example 2.6.27, then the expectation of developers and of users would
> be consistent: developers could expect users to try it again with
> latest stable kernel, and users could reasonably expect that trying it
> wouldn't break their system.

I think thats really a question on how to attract more widespread testing.
For wider spread testing it needs to be stable enough to have enough users
deal with it. But without wider spread testing it might not get there.

I just dropped 2.6.34 for now and I will wait for more dot releases. Maybe
I am really the only one for whom 2.6.34 doesn't work, maybe just other
people did so to frustrated without telling here or in bugzilla.

Maybe providing better ways to report bugs and gather information even on
freeze bugs without setting up too much manually could help. I certainly
think that the enhanced DrKonqi crash reported from KDE 4.3 and up helped
users to provide *good bug reports*. Maybe there could be something like
that for the kernel and an easy option to have the kernel store even
backtraces for hard crashes. Unfortunately there is no reset button on
notebooks, so memory might be the wrong place. Well one could dedicate a
ring buffer space on the swap partition for that or something like that -
that area should be writable even when no filesystem is not working
anymore. On next reboot the bug report application recovers the crash data
from there. Would impose a risk that on severe memory corruption the
kernels write crash data elsewhere, where it shouldn't save it. An USB
stick comes to mind, but what when the USB stack doesn't work anymore?

Well not every bug is a freeze bug and maybe something could be done for
non freeze bugs. Like an application which records selected data while the
user reproduces the bug. Just like enhanced DrKonqi collects crash data
and even helps the user to install necessary debug packages.

But I think when a kernel behaves to unstable for lots of users they just
drop it. Some bugs are okay, but especially freeze bugs and even more so
fs corruptions bugs scare non die-hard kernel debuggers who bisect a
kernel a day away.

Maybe I just had lots of bad luck, so I would love to hear other
experiences, some already said 2.6.34 works pretty stable for them.

I will leave 2.6.34.1 on my T23 which has a Savage which maybe will never
get KMS, who knows, and on the workstation at work, which doesn't use
Radeon KMS due to rock solid stable Debian Lenny userspace. Maybe this at
least sheds a light, whether most of my issues have likely been Radeon KMS
related.

As a side note: Ext4 is absolutely rock stable for me! As is XFS on my T23
and even BTRFS for the T23 /home and some work directory on the
workstation (not yet on my production T42).

Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

From: Stefan Richter on 12 Jul 2010 18:50

Martin Steigerwald wrote:
> And when the combination of 2.6.33 now .6 and userspace software suspend
> works for me - for the first time, often it was TuxOnIce that worked, but
> not any in kernel method I tried from time to time - so be it for the time
> being, even if userspace software suspend is way slower and doesn't
> satisfy the disk on writing the image.

BTW, the need to rely on a quite fundamental kernel component that is
not in the mainline (for whichever reason) in the long term, almost
guarantees you a lot of recurring pain, one way or another.
--
Stefan Richter
-=====-==-=- -=== -==-=
http://arcgraph.de/sr/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: KVM: MMU: introduce pte_prefetch_topup_memory_cache()
Next: NET_NS: unregister_netdevice: waiting for lo to become free (after using openvpn) (was Re: sysfs bug when using tun with network namespaces)