Wrong DIF guard tag on ext2 write [Kernel]

Prev: KVM, Fix QEMU-KVM is killed by guest SRAO MCE (resend)
Next: drivers/watchdog: Eliminate a NULL pointer dereference

From: James Bottomley on 31 May 2010 11:10

On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> >>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
>
> Christof> Since the guard tags are created in Linux, it seems that the
> Christof> data attached to the write request changes between the
> Christof> generation in bio_integrity_generate and the call to
> Christof> sd_prep_fn.
>
> Yep, known bug. Page writeback locking is messed up for buffer_head
> users. The extNfs folks volunteered to look into this a while back but
> I don't think they have found the time yet.
>
>
> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
>
> Last I looked there were still code paths in ext3 and ext4 that
> permitted pages to be changed during flight. I guess you've just been
> lucky.

Pages have always been modifiable in flight. The OS guarantees they'll
be rewritten, so the drivers can drop them if it detects the problem.
This is identical to the iscsi checksum issue (iscsi adds a checksum
because it doesn't trust TCP/IP and if the checksum is generated in
software, there's time between generation and page transmission for the
alteration to occur). The solution in the iscsi case was not to
complain if the page is still marked dirty.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Boaz Harrosh on 31 May 2010 11:40

On 05/31/2010 06:01 PM, James Bottomley wrote:
> On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
>>>>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
>>
>> Christof> Since the guard tags are created in Linux, it seems that the
>> Christof> data attached to the write request changes between the
>> Christof> generation in bio_integrity_generate and the call to
>> Christof> sd_prep_fn.
>>
>> Yep, known bug. Page writeback locking is messed up for buffer_head
>> users. The extNfs folks volunteered to look into this a while back but
>> I don't think they have found the time yet.
>>
>>
>> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
>>
>> Last I looked there were still code paths in ext3 and ext4 that
>> permitted pages to be changed during flight. I guess you've just been
>> lucky.
>
> Pages have always been modifiable in flight. The OS guarantees they'll
> be rewritten, so the drivers can drop them if it detects the problem.
> This is identical to the iscsi checksum issue (iscsi adds a checksum
> because it doesn't trust TCP/IP and if the checksum is generated in
> software, there's time between generation and page transmission for the
> alteration to occur). The solution in the iscsi case was not to
> complain if the page is still marked dirty.
>

And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
would prevent data writing given a device queue flag that requests
it. So all these devices and modes could just flag the VFS/filesystems
that: "please don't allow concurrent writes, otherwise I need to copy data"

From what Chris Mason has said before, all the mechanics are there, and it's
what btrfs is doing. Though I don't know how myself?

> James
>

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 31 May 2010 11:50

On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> On 05/31/2010 06:01 PM, James Bottomley wrote:
> > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> >>>>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
> >>
> >> Christof> Since the guard tags are created in Linux, it seems that the
> >> Christof> data attached to the write request changes between the
> >> Christof> generation in bio_integrity_generate and the call to
> >> Christof> sd_prep_fn.
> >>
> >> Yep, known bug. Page writeback locking is messed up for buffer_head
> >> users. The extNfs folks volunteered to look into this a while back but
> >> I don't think they have found the time yet.
> >>
> >>
> >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> >>
> >> Last I looked there were still code paths in ext3 and ext4 that
> >> permitted pages to be changed during flight. I guess you've just been
> >> lucky.
> >
> > Pages have always been modifiable in flight. The OS guarantees they'll
> > be rewritten, so the drivers can drop them if it detects the problem.
> > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > because it doesn't trust TCP/IP and if the checksum is generated in
> > software, there's time between generation and page transmission for the
> > alteration to occur). The solution in the iscsi case was not to
> > complain if the page is still marked dirty.
> >
>
> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> would prevent data writing given a device queue flag that requests
> it. So all these devices and modes could just flag the VFS/filesystems
> that: "please don't allow concurrent writes, otherwise I need to copy data"
>
> >From what Chris Mason has said before, all the mechanics are there, and it's
> what btrfs is doing. Though I don't know how myself?

The filesystems can do it themselves, they should have everything
required.

Easiest way would be to not unlock page during the writeback, unmap
mmaps before taking the checksum, and using vm_flags to prevent
get_user_pages.

More complex and maybe more performant would be to avoid holding page
lock but wait_on_page_writeback in page-modification (write, fault)
paths. More complex again could opportunistically replace the page
with a duplicate one and allow modification ops to continue from there.

That's all possible by overriding existing callbacks though. I don't
think I would like to put branches and flag dependent locking all
over existing functions.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Boaz Harrosh on 31 May 2010 12:30

On 05/31/2010 06:49 PM, Nick Piggin wrote:
> On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
>> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
>> would prevent data writing given a device queue flag that requests
>> it. So all these devices and modes could just flag the VFS/filesystems
>> that: "please don't allow concurrent writes, otherwise I need to copy data"
>>
>> >From what Chris Mason has said before, all the mechanics are there, and it's
>> what btrfs is doing. Though I don't know how myself?
>
> The filesystems can do it themselves, they should have everything
> required.
>
> Easiest way would be to not unlock page during the writeback, unmap
> mmaps before taking the checksum, and using vm_flags to prevent
> get_user_pages.
>
> More complex and maybe more performant would be to avoid holding page
> lock but wait_on_page_writeback in page-modification (write, fault)
> paths. More complex again could opportunistically replace the page
> with a duplicate one and allow modification ops to continue from there.
>
> That's all possible by overriding existing callbacks though. I don't
> think I would like to put branches and flag dependent locking all
> over existing functions.
>

Thanks. I'll need to get to this soon enough when doing raid5/6.
At the exofs level at least. This is most valuable information.
I'll keep it in mind. (And I'll also need to do it at NFS which
will be a fight)

I agree that doing it clean at VFS level is match harder. But then
also duplicating code all over is also hard. As it stands RAID copies
data, and iscsi checksums are turned off by distros. And so will DIF.
I guess we need to leave room for the HW vendors, out there ;-)

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: FUJITA Tomonori on 31 May 2010 22:50

On Mon, 31 May 2010 10:01:42 -0500
James Bottomley <James.Bottomley(a)suse.de> wrote:

> On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > >>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
> >
> > Christof> Since the guard tags are created in Linux, it seems that the
> > Christof> data attached to the write request changes between the
> > Christof> generation in bio_integrity_generate and the call to
> > Christof> sd_prep_fn.
> >
> > Yep, known bug. Page writeback locking is messed up for buffer_head
> > users. The extNfs folks volunteered to look into this a while back but
> > I don't think they have found the time yet.
> >
> >
> > Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> >
> > Last I looked there were still code paths in ext3 and ext4 that
> > permitted pages to be changed during flight. I guess you've just been
> > lucky.
>
> Pages have always been modifiable in flight. The OS guarantees they'll
> be rewritten, so the drivers can drop them if it detects the problem.
> This is identical to the iscsi checksum issue (iscsi adds a checksum
> because it doesn't trust TCP/IP and if the checksum is generated in
> software, there's time between generation and page transmission for the
> alteration to occur). The solution in the iscsi case was not to
> complain if the page is still marked dirty.

In the iSCSI case, what we do is decreasing chance that pages in
flight could be modified (use sendmsg instead of sendpage) and praying.

If the initiator sends the wrong checksum, the target complains (and
starts the recovery process). The target can't know if the page is
marked dirty or not.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: KVM, Fix QEMU-KVM is killed by guest SRAO MCE (resend)
Next: drivers/watchdog: Eliminate a NULL pointer dereference