stable? quality assurance? [Kernel]

Prev: KVM: MMU: introduce pte_prefetch_topup_memory_cache()
Next: NET_NS: unregister_netdevice: waiting for lo to become free (after using openvpn) (was Re: sysfs bug when using tun with network namespaces)

From: Theodore Tso on 14 Jul 2010 02:40

On Jul 13, 2010, at 4:45 PM, David Newall wrote:
>
> Calling it stable instils and reinforces a Pavlovian response in typical users, that recent Linux kernels are dangerous and unreliable; one year old was suggested as a safe benchmark. Typical users being 99% of the population, testing hardly begins until a kernel is "sufficiently old." This Pavlovian response is what really delays finding and fixing bugs. Being up-front and saying which kernels are likely to fail would help many users calculate the risk and improve their willingness to try newer kernels. "Sufficiently old" might well come down to six months, maybe four.

Most typical users should be using distribution kernels. Period.

We can't say which kernels are likely to fail, because we don't know. If people don't test newer kernels, the mere passage of time, whether it's four months, or six months, or a year, or two years, is not going to magically make problems go away and get fixed. That only happens if someone steps up and tries it out, and if it breaks submits bug reports or patches. A fairly large number of Linux developers seem to prefer relatively recent vintage Thinkpads, preferably without Nvidia or ATI chipsets. These laptops are generally safe and reliable by -rc3 or so --- because if they aren't the Linux developers step up and complain and do code bisections and they fix the problem.

If someone has a T23 laptop, and they help out by doing the same, then it will also be safe and reliable by the time of 2.6.X.0. If they just kvetch and complain, and stamp their feet, and say "Linux is unsafe and unreliable", and no other T23 owners step up to the challenge, then two years might go by and the same kernel might still be unreliable --- for them.

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: david on 15 Jul 2010 03:30

On Tue, 13 Jul 2010, David Newall wrote:

> (Segue to a problem which follows from calling bleeding-edge kernels
> "stable".)
>
> When reporting bugs, the first response is often, "we're not interested in
> such an old kernel; try it with the latest." That's not hugely useful when
> the latest kernels are not suitable for production use. If kernels weren't
> marked stable until they had earned the moniker, for example 2.6.27, then the
> expectation of developers and of users would be consistent: developers could
> expect users to try it again with latest stable kernel, and users could
> reasonably expect that trying it wouldn't break their system.

2.6.27 didn't get declared 'stable' because it had very few bugs, it was
declared 'stable' because someone volunteered to maintain it longer and
back-port patches to it long past the normal process.

2.6.32 got declared 'long-term stable' before 2.6.33 was released, again
not because it was especially good, but because it didn't appear to be
especially bad and several distros were shipping kernels based on it, so
again someone volunteered (or was volunteered by the distro that pays
their paycheck) to badk-port patches to it longer.

I have been running kernel.org kernels on my production systems for >13
years. I am _very_ short of time, so I generally don't get a chance to
test the -rc kernels (once in a while I do get a chance to do so on my
laptop). What I do is every 2-3 kernel releases I wait a couple days after
the kernel release to see if there are show-stopper bugs, and if nothing
shows up (which is the common case for the last several years) I compile a
kernel and load it on machines in my lab. I try to have a selection of
machines that match the systems I have in production in what I have found
are the 'important' ways (a defintition that changes once in a while when
I find something that should 'just work' that doesn't ;-). This primarily
includes systems with all the network card types and Raid card types that
I use in production, but now also includes a machine with a SSD (after I
found a bug that only affected that combination)

if my lab machiens don't crash immediatly, I leave them running (usually
not even stress testing them, again lack of time) for a week or so, then I
put the new kernel on my development machiens, wait a few days, then put
them on QA machines, wait a few days, then put them in production. I have
the old kernel around so that I can re-boot into it if needed.

This tends to work very well for me. It's not perfect and every couple of
cycles I run into grief and have to report a bug to the kernel list.
Usually I find it before I get into production, but I have run into cases
that got all the way into production before I found a problem.

with the 'new' -stable series, I generally wait until at least 2.6.x.1 is
released before I consider it ready to go anywhere outside my lab (I'll
still install the 2.6.x kernel in the lab, but I'll wait for the
additional testing that comes with the .1 stable kernels before moving it
on)

I don't go through this entire process with the later -stable kernels, If
I'm already running 2.6.x and there is a 2.6.x.y released that contains
fixes that look like they are relavent to the configuration that I run
(which lets out the majority of changes, I do fairly minimal kernel
configs) I will just test it in the lab to do a smoke test, then schedule
a rollout through the rest of my network. If there are no problems before
I get permission to deploy to production I put it on half my boxes,
failover to them, then wait a little bit (a day to a week) before
upgrading the backups.

this writeup actually makes it sound like I spend a lot of time working
with kernels, but I really don't. I'll spend couple half days twice a year
on testing, and then additional time rolling it out to the 150+ clusters
of servers I have in place. If you can't spend at least this much time on
the kernel you are probably better off just running your distro kernel,
but even there you really should do a very similar set of tests on it's
kernel releases.

There's another department in my company that uses distro kernels (big
name distro, but I will avoid flames by not naming names) without the
testing routine that I use and my track record for stability compares
favorablely to theirs over the last 7 years or so (they haven't been
running linux as long as I have, so we can't go back as far ;-) They also
do more updates than I do simply because they can't as easily look at the
kernel release and decide it doesn't apply to them.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: david on 15 Jul 2010 03:40

On Tue, 13 Jul 2010, Stefan Richter wrote:

> Plus, a
> good bug report often requires experience or good intuition, besides
> patience and rigor.

In my experience these are less of a requirement than patience and
persistence. With these attributes you will be able to work your way
through figuring out what data is needed for this bug report by answering
questions (and if you get no response, trying again)

nobody starts off knowing how to report a bug, and frequently you don't
start off knowing all the info that will be needed to solve the bug, but
if you report it and keep digging you will almost always get helped.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Greg KH on 16 Jul 2010 03:00

On Sun, Jul 11, 2010 at 07:58:42PM +0400, William Pitcock wrote:
> 2.6.32.16 (possibly 2.6.32.15) has a regression where it is unusable
> as a Xen domU. I would say 2.6.32.12 is the best choice since who knows
> what other regressions there are in .16.

Did you happen to tell the stable maintainer about this and do a simple
'git bisect' to find the offending patch so that it can be resolved?

{sigh}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Greg KH on 16 Jul 2010 03:10

On Thu, Jul 15, 2010 at 10:09:03AM +0100, Valeo de Vries wrote:
> That said, from what I've seen of late, there's only one guy (Greg) handling
> most of the stable stuff (there are probably others working behind the
> scenes),?and he has a hell of a lot on his plate.

Nope, it's just me :)

thanks,

greg "i need some minions" k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: KVM: MMU: introduce pte_prefetch_topup_memory_cache()
Next: NET_NS: unregister_netdevice: waiting for lo to become free (after using openvpn) (was Re: sysfs bug when using tun with network namespaces)