From: Christof Schmitt on
On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> On 05/31/2010 06:01 PM, James Bottomley wrote:
> > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> >>>>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
> >>
> >> Christof> Since the guard tags are created in Linux, it seems that the
> >> Christof> data attached to the write request changes between the
> >> Christof> generation in bio_integrity_generate and the call to
> >> Christof> sd_prep_fn.
> >>
> >> Yep, known bug. Page writeback locking is messed up for buffer_head
> >> users. The extNfs folks volunteered to look into this a while back but
> >> I don't think they have found the time yet.
> >>
> >>
> >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> >>
> >> Last I looked there were still code paths in ext3 and ext4 that
> >> permitted pages to be changed during flight. I guess you've just been
> >> lucky.
> >
> > Pages have always been modifiable in flight. The OS guarantees they'll
> > be rewritten, so the drivers can drop them if it detects the problem.
> > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > because it doesn't trust TCP/IP and if the checksum is generated in
> > software, there's time between generation and page transmission for the
> > alteration to occur). The solution in the iscsi case was not to
> > complain if the page is still marked dirty.
> >
>
> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> would prevent data writing given a device queue flag that requests
> it. So all these devices and modes could just flag the VFS/filesystems
> that: "please don't allow concurrent writes, otherwise I need to copy data"
>
> From what Chris Mason has said before, all the mechanics are there, and it's
> what btrfs is doing. Though I don't know how myself?

I also tested with btrfs and invalid guard tags in writes have been
encountered as well (again in 2.6.34). The only difference is that no
error was reported to userspace, although this might be a
configuration issue.

What is the best strategy to continue with the invalid guard tags on
write requests? Should this be fixed in the filesystems?

Another idea would be to pass invalid guard tags on write requests
down to the hardware, expect an "invalid guard tag" error and report
it to the block layer where a new checksum is generated and the
request is issued again. Basically implement a retry through the whole
I/O stack. But this also sounds complicated.

--
Christof Schmitt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Boaz Harrosh on
On 06/01/2010 01:30 PM, Christof Schmitt wrote:
> On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
>> On 05/31/2010 06:01 PM, James Bottomley wrote:
>>> On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
>>>>>>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
>>>>
>>>> Christof> Since the guard tags are created in Linux, it seems that the
>>>> Christof> data attached to the write request changes between the
>>>> Christof> generation in bio_integrity_generate and the call to
>>>> Christof> sd_prep_fn.
>>>>
>>>> Yep, known bug. Page writeback locking is messed up for buffer_head
>>>> users. The extNfs folks volunteered to look into this a while back but
>>>> I don't think they have found the time yet.
>>>>
>>>>
>>>> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
>>>>
>>>> Last I looked there were still code paths in ext3 and ext4 that
>>>> permitted pages to be changed during flight. I guess you've just been
>>>> lucky.
>>>
>>> Pages have always been modifiable in flight. The OS guarantees they'll
>>> be rewritten, so the drivers can drop them if it detects the problem.
>>> This is identical to the iscsi checksum issue (iscsi adds a checksum
>>> because it doesn't trust TCP/IP and if the checksum is generated in
>>> software, there's time between generation and page transmission for the
>>> alteration to occur). The solution in the iscsi case was not to
>>> complain if the page is still marked dirty.
>>>
>>
>> And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
>> would prevent data writing given a device queue flag that requests
>> it. So all these devices and modes could just flag the VFS/filesystems
>> that: "please don't allow concurrent writes, otherwise I need to copy data"
>>
>> From what Chris Mason has said before, all the mechanics are there, and it's
>> what btrfs is doing. Though I don't know how myself?
>
> I also tested with btrfs and invalid guard tags in writes have been
> encountered as well (again in 2.6.34). The only difference is that no
> error was reported to userspace, although this might be a
> configuration issue.
>

I think in btrfs you need a raid1/5 multi-device configuration for this
to work. If you use a single device then it is just like ext4.

BTW: you could use DM or MD and it will guard your DIF by coping the
data before IO.

> What is the best strategy to continue with the invalid guard tags on
> write requests? Should this be fixed in the filesystems?
>
> Another idea would be to pass invalid guard tags on write requests
> down to the hardware, expect an "invalid guard tag" error and report
> it to the block layer where a new checksum is generated and the
> request is issued again. Basically implement a retry through the whole
> I/O stack. But this also sounds complicated.
>

I suggest we should talk about this issue in upcoming LSF, because it does
not only affects DIF but any checksum subsystem. And it could enhance Linux
raid performance.

> --
> Christof Schmitt

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Mason on
On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > >>>>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
> > >>
> > >> Christof> Since the guard tags are created in Linux, it seems that the
> > >> Christof> data attached to the write request changes between the
> > >> Christof> generation in bio_integrity_generate and the call to
> > >> Christof> sd_prep_fn.
> > >>
> > >> Yep, known bug. Page writeback locking is messed up for buffer_head
> > >> users. The extNfs folks volunteered to look into this a while back but
> > >> I don't think they have found the time yet.
> > >>
> > >>
> > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > >>
> > >> Last I looked there were still code paths in ext3 and ext4 that
> > >> permitted pages to be changed during flight. I guess you've just been
> > >> lucky.
> > >
> > > Pages have always been modifiable in flight. The OS guarantees they'll
> > > be rewritten, so the drivers can drop them if it detects the problem.
> > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > software, there's time between generation and page transmission for the
> > > alteration to occur). The solution in the iscsi case was not to
> > > complain if the page is still marked dirty.
> > >
> >
> > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > would prevent data writing given a device queue flag that requests
> > it. So all these devices and modes could just flag the VFS/filesystems
> > that: "please don't allow concurrent writes, otherwise I need to copy data"
> >
> > From what Chris Mason has said before, all the mechanics are there, and it's
> > what btrfs is doing. Though I don't know how myself?
>
> I also tested with btrfs and invalid guard tags in writes have been
> encountered as well (again in 2.6.34). The only difference is that no
> error was reported to userspace, although this might be a
> configuration issue.

This would be a btrfs bug. We have strict checks in place that are
supposed to prevent buffers changing while in flight. What was the
workload that triggered this problem?

>
> What is the best strategy to continue with the invalid guard tags on
> write requests? Should this be fixed in the filesystems?
>

Long term, I think the filesystems shouldn't be changing pages in
flight. Bouncing just hurts way too much.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin K. Petersen on
>>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:

>> Yep, known bug. Page writeback locking is messed up for buffer_head
>> users. The extNfs folks volunteered to look into this a while back
>> but I don't think they have found the time yet.

Christof> Thanks for the info. This means that this bug appears with all
Christof> filesystems?

XFS and btrfs should be fine.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Martin K. Petersen on
>>>>> "Nick" == Nick Piggin <npiggin(a)suse.de> writes:

Nick> More complex and maybe more performant would be to avoid holding
Nick> page lock but wait_on_page_writeback in page-modification (write,
Nick> fault) paths.

That's what I was doing last I looked at this. I seem to recall that my
head exploded once I added buffer_heads to the mix. And then the extfs
folks promised to take a look so I didn't mess more with it.

--
Martin K. Petersen Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/