reiserfs broken in 2.6.32 was Re: [GIT PULL] reiserfs fixes [Kernel]

Prev: [PATCH -next] libs: force lzma_wrapper to be retained
Next: staging Patch 02/03: Crystal HD

From: Andi Kleen on 2 Jan 2010 12:50

> I only have reiserfs partitions in my laptop and my testbox,
> nothing else. And that because I'm now maintaining it de facto.

AFAIK it's widely used in SUSE installations. It was the default
for a long time.

And right now as in 2.6.32 it's in a state of
"may randomly explode/deadlock". And no clear path out of it. Not good.

I am very concerned about destabilizing a widely used file system
like this. This has the potential to really hurt users.

> - that would require a notifier in schedule(), one notifier
> per sub-bkl. That's horrible for performances. And for
> the scheduler. I will be the first to NAK.

I thought the original idea was to find everything that
can sleep in reiserfs and simply wrap it with lock dropping?

That should be roughly equivalent to the old BKL semantics.

Where did it go wrong?

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 2 Jan 2010 14:10

On Sat, Jan 02, 2010 at 06:43:12PM +0100, Andi Kleen wrote:
> > I only have reiserfs partitions in my laptop and my testbox,
> > nothing else. And that because I'm now maintaining it de facto.
>
> AFAIK it's widely used in SUSE installations. It was the default
> for a long time.
>
> And right now as in 2.6.32 it's in a state of
> "may randomly explode/deadlock". And no clear path out of it. Not good.
>
> I am very concerned about destabilizing a widely used file system
> like this. This has the potential to really hurt users.

I understand your worries. And I've been very cautious with that,
waiting for three cycles before requesting an upstream merge. I did
it because the isolated tree model did not scale anymore.

Now that it's upstream, I get more testing and I expect that, in
the end of this cycle, I get most of these issues reported and
fixed.

Serious users who run serious datas won't ship 2.6.33, they will ship
a further stable version 2.6.33.x (if they haven't converted their
filesystems already).
And at this time, things should be 99% fixed.

> > - that would require a notifier in schedule(), one notifier
> > per sub-bkl. That's horrible for performances. And for
> > the scheduler. I will be the first to NAK.
>
> I thought the original idea was to find everything that
> can sleep in reiserfs and simply wrap it with lock dropping?
>
> That should be roughly equivalent to the old BKL semantics.
>
> Where did it go wrong?

That's the theory. Fitting into this strict scheme brings performance
regressions. The bkl is a spinlock, it disables preemption, it is
relaxed on sleep, and doesn't have locking dependencies. Moreover
it's not a lock but a simulation of a NO_PREEMPT UP flow, with all
the fixup guardians that come with (fixup if we schedule, as
scheduling brings races).

From the conversion is borned a mutex. Even though we have
adaptive spinning, we don't catch up spinlock performances
as it's not a pure optimized looping fast path, and it may
actually just sleep.

The bkl is relaxed only when we sleep. Now simulating that with
a mutex that gets explicitly relaxed is not the same thing as
we need to relax the lock each time we _might_ sleep. It means
we relax more and that brings performance regressions.

That said it's sometimes a drawback for the bkl to be relaxed
every time we schedule, because we need to fixup after that,
sometimes we need to re-walk into the entire tree, etc...

So sometimes we can do better. There are some places where
we don't relax like did the bkl, so that we don't need to fixup,
and we get a win of performances.

You see? The bkl semantics must not be always strictly imitated on
such conversion. It depends on what does the code. In reiserfs,
sometimes it was desired that the bkl get relaxed, sometimes it
wasn't. And all the reiserfs code deals with that.

With a mutex we have the choice. So the conversion has been
a balance between performance regressions brought by the mutex
conversion, and the performance win because we have actually
more control with a traditional lock.

That said there are places where we really need to sleep, like
when we grab another lock, so that we don't create inverted
dependencies.

That said, if the general opinion is in favour of unmerging
the bkl removal changes in reiserfs. Then please do.

Just to express my point of view, as my primary goal is not
to fix reiserfs but the kernel: If you are afraid of such
changes, your kernel will just become mildewed by the time.
You need to drop such bad ill-legacies if you want it to
evolve. Until every users of the big kernel lock will remain
in the kernel, vanilla upstream will keep it as ball and chain,
won't ever be able to perform any serious real time service,
etc...

So yes this is risky. But I think this is necessary. And as I
explained above, things will be fine as serious datas are not
manipulated with a random -rc2 (except my own datas...).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andi Kleen on 2 Jan 2010 14:30

On Sat, Jan 02, 2010 at 08:02:15PM +0100, Frederic Weisbecker wrote:
> On Sat, Jan 02, 2010 at 06:43:12PM +0100, Andi Kleen wrote:
> > > I only have reiserfs partitions in my laptop and my testbox,
> > > nothing else. And that because I'm now maintaining it de facto.
> >
> > AFAIK it's widely used in SUSE installations. It was the default
> > for a long time.
> >
> > And right now as in 2.6.32 it's in a state of
> > "may randomly explode/deadlock". And no clear path out of it. Not good.
> >
> > I am very concerned about destabilizing a widely used file system
> > like this. This has the potential to really hurt users.
>
>
> I understand your worries. And I've been very cautious with that,
> waiting for three cycles before requesting an upstream merge. I did
> it because the isolated tree model did not scale anymore.
>
> Now that it's upstream, I get more testing and I expect that, in
> the end of this cycle, I get most of these issues reported and
> fixed.

Will you?

How many users systems could it break by then?

>
> Serious users who run serious datas won't ship 2.6.33, they will ship
> a further stable version 2.6.33.x (if they haven't converted their
> filesystems already).
> And at this time, things should be 99% fixed.

That seems very risky. For some rarely used obscure subsystems
that might work but a widely used file system that keeps people's $HOME?
I don't think seriously destabilizing that for a potentially longer
time is a good idea. There's the potential to break
a lot of porcelain.

Probably you could do a ext3/ext4 like thing by starting
with a "reiserfs3.5" copy and do the work there and then
merge back once things work and have been reasonably verified
by code review.

> That's the theory. Fitting into this strict scheme brings performance
> regressions. The bkl is a spinlock, it disables preemption, it is
> relaxed on sleep, and doesn't have locking dependencies. Moreover
> it's not a lock but a simulation of a NO_PREEMPT UP flow, with all
> the fixup guardians that come with (fixup if we schedule, as
> scheduling brings races).
>
> From the conversion is borned a mutex. Even though we have
> adaptive spinning, we don't catch up spinlock performances
> as it's not a pure optimized looping fast path, and it may
> actually just sleep.

Fix the adaptive spinlock then?

>
> The bkl is relaxed only when we sleep. Now simulating that with
> a mutex that gets explicitly relaxed is not the same thing as
> we need to relax the lock each time we _might_ sleep. It means
> we relax more and that brings performance regressions.

At least in the cases where the decision is in reiserfs code
directly you could predict it by using need_resched(), couldn't you?

That might not be 100% accurate, but good enough.

> That said, if the general opinion is in favour of unmerging
> the bkl removal changes in reiserfs. Then please do.

For me it seems too aggressive at this point.

If it was just a case of fixing a few known bugs, but
if you're not even sure how many problems are left ...

Perhaps do the reiserfs35 variant?

> Just to express my point of view, as my primary goal is not
> to fix reiserfs but the kernel: If you are afraid of such
> changes, your kernel will just become mildewed by the time.

Better some mildew than a seriously-broken-for-enough people's
release (although I have my doubts that's the right metapher
for the BKL anyways)

Having stable releases is an important part for
getting enough testers (we already have too little). And
if we start breaking their $HOMEs they might become
even less.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 2 Jan 2010 15:20

* Andi Kleen <andi(a)firstfloor.org> wrote:

> > I only have reiserfs partitions in my laptop and my testbox, nothing else.
> > And that because I'm now maintaining it de facto.
>
> AFAIK it's widely used in SUSE installations. It was the default for a long
> time.

[ Btw., if so then SuSE/Novell should sponsor Frederic's reiserfs maintanence
work. ]

> And right now as in 2.6.32 it's in a state of "may randomly
> explode/deadlock". And no clear path out of it. Not good.

You are quite wrong about that accusation: Frederic's changes are not in
v2.6.32 - they were merged by Linus in the v2.6.33 cycle so that code cannot
physicaly have caused any problems in v2.6.32 ...

If there's stability problems with reiserfs in v2.6.32 then it's doubly good
that Linus merged Frederic's tree in v2.6.33 [beyond the obvious advantage
that it gets rid of the BKL, which was a serious and oft reported limit to
reiserfs scalability]: finally there's again reiserfs development activity,
which might lead to further fixes, and which might solve the v2.6.32 stability
problems you mention.

In my view this work could have been merged sooner, in v2.6.32 already, maybe
that way the v2.6.32 reiserfs stability problems you mentioned could have been
avoided.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Frederic Weisbecker on 2 Jan 2010 15:20

On Sat, Jan 02, 2010 at 08:23:37PM +0100, Andi Kleen wrote:
> On Sat, Jan 02, 2010 at 08:02:15PM +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 02, 2010 at 06:43:12PM +0100, Andi Kleen wrote:
> > > > I only have reiserfs partitions in my laptop and my testbox,
> > > > nothing else. And that because I'm now maintaining it de facto.
> > >
> > > AFAIK it's widely used in SUSE installations. It was the default
> > > for a long time.
> > >
> > > And right now as in 2.6.32 it's in a state of
> > > "may randomly explode/deadlock". And no clear path out of it. Not good.
> > >
> > > I am very concerned about destabilizing a widely used file system
> > > like this. This has the potential to really hurt users.
> >
> >
> > I understand your worries. And I've been very cautious with that,
> > waiting for three cycles before requesting an upstream merge. I did
> > it because the isolated tree model did not scale anymore.
> >
> > Now that it's upstream, I get more testing and I expect that, in
> > the end of this cycle, I get most of these issues reported and
> > fixed.
>
> Will you?
>
> How many users systems could it break by then?

I've never lost any datas since I began this work. And
I run it every day. If I had experienced lock inversions,
and sometimes soft lockups, I did not experienced serious
damages. It's a journalized filesystem that can fixup the things
pretty well.

Also we are talking about potential lock inversions, in potential
rare path, that could potentially raise soft lockups. That makes
a lot of potentials, for things that are going to be fixed and
for which I've never seen serious damages.

> >
> > Serious users who run serious datas won't ship 2.6.33, they will ship
> > a further stable version 2.6.33.x (if they haven't converted their
> > filesystems already).
> > And at this time, things should be 99% fixed.
>
> That seems very risky. For some rarely used obscure subsystems
> that might work but a widely used file system that keeps people's $HOME?
> I don't think seriously destabilizing that for a potentially longer
> time is a good idea. There's the potential to break
> a lot of porcelain.
>
> Probably you could do a ext3/ext4 like thing by starting
> with a "reiserfs3.5" copy and do the work there and then
> merge back once things work and have been reasonably verified
> by code review.

I fear nobody else than me will review it that deeply, which
limits the scalability of this plan.

We could make a new reiserfs version by duplicating the code
base. But nobody will test it. That would require to patch
mkreiserfs, waiting for distros to ship it, waiting for
users to ship the distros. Assuming at this time there
will be remaining users to set up new reiserfs partitions.

We could also have a reiserfs-no-bkl config option that
would pick the duplicated code base. Again I fear few people
will test it.

>
> > That's the theory. Fitting into this strict scheme brings performance
> > regressions. The bkl is a spinlock, it disables preemption, it is
> > relaxed on sleep, and doesn't have locking dependencies. Moreover
> > it's not a lock but a simulation of a NO_PREEMPT UP flow, with all
> > the fixup guardians that come with (fixup if we schedule, as
> > scheduling brings races).
> >
> > From the conversion is borned a mutex. Even though we have
> > adaptive spinning, we don't catch up spinlock performances
> > as it's not a pure optimized looping fast path, and it may
> > actually just sleep.
>
> Fix the adaptive spinlock then?

Believe me, I've reviewed the mutex code several dozens of time.
I just fail to find weaknesses inside, especially in the adaptive
spinning code.

We just can not make it as fast as a spinlock fast path, as it needs
to do regular checks to ensure it can continue to spin.

> >
> > The bkl is relaxed only when we sleep. Now simulating that with
> > a mutex that gets explicitly relaxed is not the same thing as
> > we need to relax the lock each time we _might_ sleep. It means
> > we relax more and that brings performance regressions.
>
> At least in the cases where the decision is in reiserfs code
> directly you could predict it by using need_resched(), couldn't you?
>
>
> That might not be 100% accurate, but good enough.

Sometimes I do. Sometimes it's just wasteful. We don't want to relax
the lock just because of a kmalloc(__GFP_NOFS).

Sometimes relaxing the lock even when we are going to schedule is not
something we want for performances.

>
> > That said, if the general opinion is in favour of unmerging
> > the bkl removal changes in reiserfs. Then please do.
>
> For me it seems too aggressive at this point.
>
> If it was just a case of fixing a few known bugs, but
> if you're not even sure how many problems are left ...
>
> Perhaps do the reiserfs35 variant?

As explained above, I think this just reschedule the problem
for later. This model won't have any testers and won't evolve.

>
> > Just to express my point of view, as my primary goal is not
> > to fix reiserfs but the kernel: If you are afraid of such
> > changes, your kernel will just become mildewed by the time.
>
> Better some mildew than a seriously-broken-for-enough people's
> release (although I have my doubts that's the right metapher
> for the BKL anyways)
>
> Having stable releases is an important part for
> getting enough testers (we already have too little). And
> if we start breaking their $HOMEs they might become
> even less.

This is very unlikely to break their $HOME.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4
Prev: [PATCH -next] libs: force lzma_wrapper to be retained
Next: staging Patch 02/03: Crystal HD