From: Christof Schmitt on
On Mon, Jun 07, 2010 at 01:40:21PM -0400, Martin K. Petersen wrote:
> >>>>> "Boaz" == Boaz Harrosh <bharrosh(a)panasas.com> writes:
>
> Boaz> Do you remember some performance numbers that show degradation /
> Boaz> sameness?
>
> Boaz> What type of work loads?
>
> I haven't been using XFS much for over a year. I'm using an internal
> async I/O tool and btrfs for most of my DIX/DIF testing these days.
>
> But my original changes were along the lines of what Jan mentioned
> earlier (hooking into page_mkwrite and waiting for writeback. I could
> have sworn that I only did it for ext[23] and that XFS waited out of the
> box but git proves me wrong). Anyway, I'll try to get some benchmarking
> happening later this week.

Is there a patch with this change available somewhere? It might be
useful to patch a kernel with this XFS change for reliable DIF/DIX
testing.

> This won't fix things completely, though. ext2fs, for instance,
> frequently changes metadata buffers in flight so it trips the guard
> check in no time.
>
> --
> Martin K. Petersen Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christof Schmitt on
On Tue, Jun 01, 2010 at 09:58:18AM -0400, Chris Mason wrote:
> On Tue, Jun 01, 2010 at 03:50:59PM +0200, Christof Schmitt wrote:
> > On Tue, Jun 01, 2010 at 09:03:25AM -0400, Chris Mason wrote:
> > > On Tue, Jun 01, 2010 at 12:30:42PM +0200, Christof Schmitt wrote:
> > > > On Mon, May 31, 2010 at 06:30:05PM +0300, Boaz Harrosh wrote:
> > > > > On 05/31/2010 06:01 PM, James Bottomley wrote:
> > > > > > On Mon, 2010-05-31 at 10:20 -0400, Martin K. Petersen wrote:
> > > > > >>>>>>> "Christof" == Christof Schmitt <christof.schmitt(a)de.ibm.com> writes:
> > > > > >>
> > > > > >> Christof> Since the guard tags are created in Linux, it seems that the
> > > > > >> Christof> data attached to the write request changes between the
> > > > > >> Christof> generation in bio_integrity_generate and the call to
> > > > > >> Christof> sd_prep_fn.
> > > > > >>
> > > > > >> Yep, known bug. Page writeback locking is messed up for buffer_head
> > > > > >> users. The extNfs folks volunteered to look into this a while back but
> > > > > >> I don't think they have found the time yet.
> > > > > >>
> > > > > >>
> > > > > >> Christof> Using ext3 or ext4 instead of ext2 does not show the problem.
> > > > > >>
> > > > > >> Last I looked there were still code paths in ext3 and ext4 that
> > > > > >> permitted pages to be changed during flight. I guess you've just been
> > > > > >> lucky.
> > > > > >
> > > > > > Pages have always been modifiable in flight. The OS guarantees they'll
> > > > > > be rewritten, so the drivers can drop them if it detects the problem.
> > > > > > This is identical to the iscsi checksum issue (iscsi adds a checksum
> > > > > > because it doesn't trust TCP/IP and if the checksum is generated in
> > > > > > software, there's time between generation and page transmission for the
> > > > > > alteration to occur). The solution in the iscsi case was not to
> > > > > > complain if the page is still marked dirty.
> > > > > >
> > > > >
> > > > > And also why RAID1 and RAID4/5/6 need the data bounced. I wish VFS
> > > > > would prevent data writing given a device queue flag that requests
> > > > > it. So all these devices and modes could just flag the VFS/filesystems
> > > > > that: "please don't allow concurrent writes, otherwise I need to copy data"
> > > > >
> > > > > From what Chris Mason has said before, all the mechanics are there, and it's
> > > > > what btrfs is doing. Though I don't know how myself?
> > > >
> > > > I also tested with btrfs and invalid guard tags in writes have been
> > > > encountered as well (again in 2.6.34). The only difference is that no
> > > > error was reported to userspace, although this might be a
> > > > configuration issue.
> > >
> > > This would be a btrfs bug. We have strict checks in place that are
> > > supposed to prevent buffers changing while in flight. What was the
> > > workload that triggered this problem?
> >
> > I am running an internal test tool that creates files with a known
> > pattern until the disk is full, reads the data to verify if the
> > pattern is still intact, removes the files and starts over.
>
> Ok, is the lba in the output the sector offset? We can map that to a
> btrfs block and figure out what it was.
>
> Btrfs never complains about the IO error? We really should explode.

Right now, i don't have a system available to continue with this. When
i have a change to run it again, i can try the latest rc kernel to see
if there is any difference.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Tue, Jun 08, 2010 at 09:15:35AM +0200, Christof Schmitt wrote:
> On Mon, Jun 07, 2010 at 01:40:21PM -0400, Martin K. Petersen wrote:
> > >>>>> "Boaz" == Boaz Harrosh <bharrosh(a)panasas.com> writes:
> >
> > Boaz> Do you remember some performance numbers that show degradation /
> > Boaz> sameness?
> >
> > Boaz> What type of work loads?
> >
> > I haven't been using XFS much for over a year. I'm using an internal
> > async I/O tool and btrfs for most of my DIX/DIF testing these days.
> >
> > But my original changes were along the lines of what Jan mentioned
> > earlier (hooking into page_mkwrite and waiting for writeback. I could
> > have sworn that I only did it for ext[23] and that XFS waited out of the
> > box but git proves me wrong). Anyway, I'll try to get some benchmarking
> > happening later this week.
>
> Is there a patch with this change available somewhere? It might be
> useful to patch a kernel with this XFS change for reliable DIF/DIX
> testing.

Adding a wait_on_page_writeback() call to page_mkwrite() won't help
by itself - you need to unmap the pages after transitioning them
from dirty to writeback but before the hardware starts processing
the IO if you want to lock out mmap writes this way....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on
On Tue, Jun 08, 2010 at 06:47:18PM +1000, Dave Chinner wrote:
> On Tue, Jun 08, 2010 at 09:15:35AM +0200, Christof Schmitt wrote:
> > > But my original changes were along the lines of what Jan mentioned
> > > earlier (hooking into page_mkwrite and waiting for writeback. I could
> > > have sworn that I only did it for ext[23] and that XFS waited out of the
> > > box but git proves me wrong). Anyway, I'll try to get some benchmarking
> > > happening later this week.
> >
> > Is there a patch with this change available somewhere? It might be
> > useful to patch a kernel with this XFS change for reliable DIF/DIX
> > testing.
>
> Adding a wait_on_page_writeback() call to page_mkwrite() won't help
> by itself - you need to unmap the pages after transitioning them
> from dirty to writeback but before the hardware starts processing
> the IO if you want to lock out mmap writes this way....

Actually as Jan pointed out, clear_page_dirty_for_io already does
this. So it seems pretty doable. get_user_pages remains a problem
(which I believe NFS just ignores) but it could be avoided by
just disallowing it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vladislav Bolkhovitin on
Vladislav Bolkhovitin, on 06/03/2010 04:46 PM wrote:
>
> Vladislav Bolkhovitin, on 06/03/2010 04:41 PM wrote:
>> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>>> There's one interesting problem here, at least theoretically, with SCSI
>>>> or similar transports which allow to have commands queue depth >1 and
>>>> allowed to internally reorder queued requests. I don't know the FS/block
>>>> layers sufficiently well to tell if sending several requests for the
>>>> same page really possible or not, but we can see a real life problem,
>>>> which can be well explained if it's possible.
>>>>
>>>> The problem could be if the second (rewrite) request (SCSI command) for
>>>> the same page queued to the corresponding device before the original
>>>> request finished. Since the device allowed to freely reorder requests,
>>>> there's a probability that the original write request would hit the
>>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>>> carrying would be lost, hence welcome data corruption.
>>>>
>>> I might be totally wrong here but I think NCQ can reorder sectors but
>>> not writes. That is if the sector is cached in device memory and a later
>>> write comes to modify the same sector then the original should be
>>> replaced not two values of the same sector be kept in device cache at the
>>> same time.
>>>
>>> Failing to do so is a scsi device problem.
>> SCSI devices supporting Full task management model (almost all) and
>> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
>> allowed to freely reorder any commands with SIMPLE task attribute. If an
>> application wants to maintain order of some commands for such devices,
>> it must issue them with ORDERED task attribute and over a _single_ MPIO
>> path to the device.
>>
>> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
>> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
>> order dependencies (overlapping writes in our case) over a single MPIO path.
>>
>>> Please note that page-to-sector is not necessary constant. And the same page
>>> might get written at a different sector, next time. But FSs will have to
>>> barrier in this case.
>>>
>>>> For single parallel SCSI or SAS devices such race may look practically
>>>> impossible, but for sophisticated clusters when many nodes pretending to
>>>> be a single SCSI device in a load balancing configuration, it becomes
>>>> very real.
>>>>
>>>> The real life problem we can see in an active-active DRBD-setup. In this
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>>> as a single multipath device in an active-active round-robin
>>>> load-balancing configuration, i.e. sends requests to both nodes in
>>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received
>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>> one to store. This is possible only if the initiator sent the second
>>>> write request before the first one completed.
>>> It is totally possible in today's code.
>>>
>>> DRBD should store the original command_sn of the write and discard
>>> the sector with the lower SN. It should appear as a single device
>>> to the initiator.
>> How can it find the SN? The commands were sent over _different_ MPIO
>> paths to the device, so at the moment of the sending all the order
>> information was lost.
>>
>> Until SCSI generally allowed to preserve ordering information between
>> MPIO paths in such configurations the only way to maintain commands
>> order would be queue draining. Hence, for safety all initiators working
>> with such devices must do it.
>>
>> But looks like Linux doesn't do it, so unsafe with MPIO clusters?
>
> I meant load balancing MPIO clusters.

Actually, if consider processing of exception conditions like Task Set
Full status or Unit Attentions, queuing of several write commands for
the same page(s) is not safe also for all other MPIO clusters as well as
for single path SCSI-transport devices, like regular HDDs or other
SAS/FC/iSCSI/... storage.

This is because in case of exception conditions the first write command
could be preliminary finished to deliver the exception condition status
to the initiator, but all the queued after it commands would be neither
aborted, nor suspended. So, after retrying the command can be queued
_after_ the second write command, hence they would be executed in the
reverse order with related data corruption.

To prevent such things, SCSI standard provides ACA and UA interlock
facilities, but Linux doesn't use them.

Thus, to be safe Linux should:

1. Either don't write on pages under IO, hence don't queue retries,

2. Or queue retries only after the original write finished.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/