From: Christof Schmitt on
On Tue, Jun 01, 2010 at 09:16:35AM -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
>
> >> Yep, known bug. Page writeback locking is messed up for buffer_head
> >> users. The extNfs folks volunteered to look into this a while back
> >> but I don't think they have found the time yet.
>
> Christof> Thanks for the info. This means that this bug appears with all
> Christof> filesystems?
>
> XFS and btrfs should be fine.

XFS looks good in my test, thanks for the hint. I am going to use XFS
for anything related to DIF for now. It would be nice to have a
solution that works for all filesystems, but it looks like this will
take some time and work.

Christof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> >>>>> "Nick" == Nick Piggin <npiggin(a)suse.de> writes:
>
> >> 1) filesystem changed it
> >> 2) corruption on the wire or in the raid controller
> >> 3) the page was corrupted while the IO layer was doing the IO.
> >>
> >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> >> their lives. With #3, we'll recrc and send the IO down again
> >> thinking the data is correct when really we're writing garbage.
> >>
> >> How can we tell these three cases apart?
>
> Nick> Do we really need to handle #3? It could have happened before the
> Nick> checksum was calculated.
>
> Reason #3 is one of the main reasons for having the checksum in the
> first place. The whole premise of the data integrity extensions is that
> the checksum is calculated in close temporal proximity to the data
> creation. I.e. eventually in userland.
>
> Filesystems will inevitably have to be integrity-aware for that to work.
> And it will be their job to keep the data pages stable during DMA.

Let's just think hard about what windows can actually be closed versus
how much effort goes in to closing them. I also prefer not to accept
half-solutions in the kernel because they don't want to implement real
solutions in hardware (it's pretty hard to checksum and protect all
kernel data structures by hand).

For "normal" writes into pagecache, the data can get corrupted anywhere
from after it is generated in userspace, during the copy, while it is
dirty in cache, and while it is being written out.

Closing the while it is dirty, while it is being written back window
still leaves a pretty big window. Also, how do you handle mmap writes?
Write protect and checksum the destination page after every store? Or
leave some window between when the pagecache is dirtied and when it is
written back? So I don't know whether it's worth putting a lot of effort
into this case.

If you had an interface for userspace to insert checksums to direct IO
requests or pagecache ranges, then not only can you close the entire gap
between userspace data generation, and writeback. But you also can
handle mmap writes and anything else just fine: userspace knows about
the concurrency details, so it can add the right checksum (and
potentially fsync etc) when it's ready.

And the bounce-retry method would be sufficient to handle IO
transmission errors for normal IOs without being intrusive.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Wed, Jun 02, 2010 at 03:37:48PM +0200, Christof Schmitt wrote:
> On Tue, Jun 01, 2010 at 09:16:35AM -0400, Martin K. Petersen wrote:
> > >>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
> >
> > >> Yep, known bug. Page writeback locking is messed up for buffer_head
> > >> users. The extNfs folks volunteered to look into this a while back
> > >> but I don't think they have found the time yet.
> >
> > Christof> Thanks for the info. This means that this bug appears with all
> > Christof> filesystems?
> >
> > XFS and btrfs should be fine.
>
> XFS looks good in my test, thanks for the hint. I am going to use XFS
> for anything related to DIF for now. It would be nice to have a
> solution that works for all filesystems, but it looks like this will
> take some time and work.

If you are running DIF hardware, then XFS is only OK for direct IO.
XFS will still get torn writes if you are overwriting buffered data
(either by write() or mmap()) because there are no interlocks to
prevent cached pages under writeback from being modified while DMA
is being performed.....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vladislav Bolkhovitin on
James Bottomley, on 06/01/2010 05:27 PM wrote:
> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>> What is the best strategy to continue with the invalid guard tags on
>> write requests? Should this be fixed in the filesystems?
>
> For write requests, as long as the page dirty bit is still set, it's
> safe to drop the request, since it's already going to be repeated. What
> we probably want is an error code we can return that the layer that sees
> both the request and the page flags can make the call.
>
>> Another idea would be to pass invalid guard tags on write requests
>> down to the hardware, expect an "invalid guard tag" error and report
>> it to the block layer where a new checksum is generated and the
>> request is issued again. Basically implement a retry through the whole
>> I/O stack. But this also sounds complicated.
>
> No, no ... as long as the guard tag is wrong because the fs changed the
> page, the write request for the updated page will already be queued or
> in-flight, so there's no need to retry.

There's one interesting problem here, at least theoretically, with SCSI
or similar transports which allow to have commands queue depth >1 and
allowed to internally reorder queued requests. I don't know the FS/block
layers sufficiently well to tell if sending several requests for the
same page really possible or not, but we can see a real life problem,
which can be well explained if it's possible.

The problem could be if the second (rewrite) request (SCSI command) for
the same page queued to the corresponding device before the original
request finished. Since the device allowed to freely reorder requests,
there's a probability that the original write request would hit the
permanent storage *AFTER* the retry request, hence the data changes it's
carrying would be lost, hence welcome data corruption.

For single parallel SCSI or SAS devices such race may look practically
impossible, but for sophisticated clusters when many nodes pretending to
be a single SCSI device in a load balancing configuration, it becomes
very real.

The real life problem we can see in an active-active DRBD-setup. In this
configuration 2 nodes act as a single SCST-powered SCSI device and they
both run DRBD to keep their backstorage in-sync. The initiator uses them
as a single multipath device in an active-active round-robin
load-balancing configuration, i.e. sends requests to both nodes in
parallel, then DRBD takes care to replicate the requests to the other node.

The problem is that sometimes DRBD complies about concurrent local
writes, like:

kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
[DISCARD L] new: 144072784s +8192; pending: 144072784s +8192

This message means that DRBD detected that both nodes received
overlapping writes on the same block(s) and DRBD can't figure out which
one to store. This is possible only if the initiator sent the second
write request before the first one completed.

The topic of the discussion could well explain the cause of that. But,
unfortunately, people who reported it forgot to note which OS they run
on the initiator, i.e. I can't say for sure it's Linux.

Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Boaz Harrosh on
On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>
> There's one interesting problem here, at least theoretically, with SCSI
> or similar transports which allow to have commands queue depth >1 and
> allowed to internally reorder queued requests. I don't know the FS/block
> layers sufficiently well to tell if sending several requests for the
> same page really possible or not, but we can see a real life problem,
> which can be well explained if it's possible.
>
> The problem could be if the second (rewrite) request (SCSI command) for
> the same page queued to the corresponding device before the original
> request finished. Since the device allowed to freely reorder requests,
> there's a probability that the original write request would hit the
> permanent storage *AFTER* the retry request, hence the data changes it's
> carrying would be lost, hence welcome data corruption.
>

I might be totally wrong here but I think NCQ can reorder sectors but
not writes. That is if the sector is cached in device memory and a later
write comes to modify the same sector then the original should be
replaced not two values of the same sector be kept in device cache at the
same time.

Failing to do so is a scsi device problem.

Please note that page-to-sector is not necessary constant. And the same page
might get written at a different sector, next time. But FSs will have to
barrier in this case.

> For single parallel SCSI or SAS devices such race may look practically
> impossible, but for sophisticated clusters when many nodes pretending to
> be a single SCSI device in a load balancing configuration, it becomes
> very real.
>
> The real life problem we can see in an active-active DRBD-setup. In this
> configuration 2 nodes act as a single SCST-powered SCSI device and they
> both run DRBD to keep their backstorage in-sync. The initiator uses them
> as a single multipath device in an active-active round-robin
> load-balancing configuration, i.e. sends requests to both nodes in
> parallel, then DRBD takes care to replicate the requests to the other node.
>
> The problem is that sometimes DRBD complies about concurrent local
> writes, like:
>
> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>
> This message means that DRBD detected that both nodes received
> overlapping writes on the same block(s) and DRBD can't figure out which
> one to store. This is possible only if the initiator sent the second
> write request before the first one completed.
>

It is totally possible in today's code.

DRBD should store the original command_sn of the write and discard
the sector with the lower SN. It should appear as a single device
to the initiator.

> The topic of the discussion could well explain the cause of that. But,
> unfortunately, people who reported it forgot to note which OS they run
> on the initiator, i.e. I can't say for sure it's Linux.
>
> Vlad
>

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/