Wrong DIF guard tag on ext2 write [Kernel]

Prev: Suspend and VT switch hangs since 2.6.34
Next: writeback: sync expired inodes first in background writeback

From: Gennadiy Nerubayev on 23 Jul 2010 14:00

On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin <vst(a)vlnb.net> wrote:
>
> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>
>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>
>>> What is the best strategy to continue with the invalid guard tags on
>>> write requests? Should this be fixed in the filesystems?
>>
>> For write requests, as long as the page dirty bit is still set, it's
>> safe to drop the request, since it's already going to be repeated. �What
>> we probably want is an error code we can return that the layer that sees
>> both the request and the page flags can make the call.
>>
>>> Another idea would be to pass invalid guard tags on write requests
>>> down to the hardware, expect an "invalid guard tag" error and report
>>> it to the block layer where a new checksum is generated and the
>>> request is issued again. Basically implement a retry through the whole
>>> I/O stack. But this also sounds complicated.
>>
>> No, no ... as long as the guard tag is wrong because the fs changed the
>> page, the write request for the updated page will already be queued or
>> in-flight, so there's no need to retry.
>
> There's one interesting problem here, at least theoretically, with SCSI or similar transports which allow to have commands queue depth >1 and allowed to internally reorder queued requests. I don't know the FS/block layers sufficiently well to tell if sending several requests for the same page really possible or not, but we can see a real life problem, which can be well explained if it's possible.
>
> The problem could be if the second (rewrite) request (SCSI command) for the same page queued to the corresponding device before the original request finished. Since the device allowed to freely reorder requests, there's a probability that the original write request would hit the permanent storage *AFTER* the retry request, hence the data changes it's carrying would be lost, hence welcome data corruption.
>
> For single parallel SCSI or SAS devices such race may look practically impossible, but for sophisticated clusters when many nodes pretending to be a single SCSI device in a load balancing configuration, it becomes very real.
>
> The real life problem we can see in an active-active DRBD-setup. In this configuration 2 nodes act as a single SCST-powered SCSI device and they both run DRBD to keep their backstorage in-sync. The initiator uses them as a single multipath device in an active-active round-robin load-balancing configuration, i.e. sends requests to both nodes in parallel, then DRBD takes care to replicate the requests to the other node.
>
> The problem is that sometimes DRBD complies about concurrent local writes, like:
>
> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>
> This message means that DRBD detected that both nodes received overlapping writes on the same block(s) and DRBD can't figure out which one to store. This is possible only if the initiator sent the second write request before the first one completed.
>
> The topic of the discussion could well explain the cause of that. But, unfortunately, people who reported it forgot to note which OS they run on the initiator, i.e. I can't say for sure it's Linux.

Sorry for the late chime in, but here's some more information of
potential interest as I've previously inquired about this to the drbd
mailing list:

1. It only happens when using blockio mode in IET or SCST. Fileio,
nv_cache, and write_through do not generate the warnings.
2. It happens on active/passive drbd clusters (on the active node
obviously), NOT active/active. In fact, I've found that doing round
robin on active/active is a Bad Idea (tm) even with a clustered
filesystem, until at least the target software is able to synchronize
the command state of either node.
3. Linux and ESX initiators can generate the warning, but I've so far
only been able to reliably reproduce it using a Windows initiator and
sqlio or iometer benchmarks. I'll be trying again using iometer when I
have the time.
4. It only happens using a random write io workload (any block size),
with initiator threads >1, OR initiator queue depth >1. The higher
either of those is, the more spammy the warnings become.
5. The transport does not matter (reproduced with iSCSI and SRP)
6. If DRBD is disconnected (primary/unknown), the warnings are not
generated. As soon as it's reconnected (primary/secondary), the
warnings will reappear.

(sorry for the duplicate, forgot to plaintext)

-Gennadiy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vladislav Bolkhovitin on 23 Jul 2010 15:20

Gennadiy Nerubayev, on 07/23/2010 09:59 PM wrote:
> On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin<vst(a)vlnb.net> wrote:
>>
>> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>>
>>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>>
>>>> What is the best strategy to continue with the invalid guard tags on
>>>> write requests? Should this be fixed in the filesystems?
>>>
>>> For write requests, as long as the page dirty bit is still set, it's
>>> safe to drop the request, since it's already going to be repeated. What
>>> we probably want is an error code we can return that the layer that sees
>>> both the request and the page flags can make the call.
>>>
>>>> Another idea would be to pass invalid guard tags on write requests
>>>> down to the hardware, expect an "invalid guard tag" error and report
>>>> it to the block layer where a new checksum is generated and the
>>>> request is issued again. Basically implement a retry through the whole
>>>> I/O stack. But this also sounds complicated.
>>>
>>> No, no ... as long as the guard tag is wrong because the fs changed the
>>> page, the write request for the updated page will already be queued or
>>> in-flight, so there's no need to retry.
>>
>> There's one interesting problem here, at least theoretically, with SCSI or similar transports which allow to have commands queue depth>1 and allowed to internally reorder queued requests. I don't know the FS/block layers sufficiently well to tell if sending several requests for the same page really possible or not, but we can see a real life problem, which can be well explained if it's possible.
>>
>> The problem could be if the second (rewrite) request (SCSI command) for the same page queued to the corresponding device before the original request finished. Since the device allowed to freely reorder requests, there's a probability that the original write request would hit the permanent storage *AFTER* the retry request, hence the data changes it's carrying would be lost, hence welcome data corruption.
>>
>> For single parallel SCSI or SAS devices such race may look practically impossible, but for sophisticated clusters when many nodes pretending to be a single SCSI device in a load balancing configuration, it becomes very real.
>>
>> The real life problem we can see in an active-active DRBD-setup. In this configuration 2 nodes act as a single SCST-powered SCSI device and they both run DRBD to keep their backstorage in-sync. The initiator uses them as a single multipath device in an active-active round-robin load-balancing configuration, i.e. sends requests to both nodes in parallel, then DRBD takes care to replicate the requests to the other node.
>>
>> The problem is that sometimes DRBD complies about concurrent local writes, like:
>>
>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>
>> This message means that DRBD detected that both nodes received overlapping writes on the same block(s) and DRBD can't figure out which one to store. This is possible only if the initiator sent the second write request before the first one completed.
>>
>> The topic of the discussion could well explain the cause of that. But, unfortunately, people who reported it forgot to note which OS they run on the initiator, i.e. I can't say for sure it's Linux.
>
> Sorry for the late chime in, but here's some more information of
> potential interest as I've previously inquired about this to the drbd
> mailing list:
>
> 1. It only happens when using blockio mode in IET or SCST. Fileio,
> nv_cache, and write_through do not generate the warnings.

Some explanations for those who not familiar with the terminology:

- "Fileio" means Linux IO stack on the target receives IO via
vfs_readv()/vfs_writev()

- "NV_CACHE" means all the cache synchronization requests
(SYNCHRONIZE_CACHE, FUA) from the initiator are ignored

- "WRITE_THROUGH" means write through, i.e. the corresponding backend
file for the device open with O_SYNC flag.

> 2. It happens on active/passive drbd clusters (on the active node
> obviously), NOT active/active. In fact, I've found that doing round
> robin on active/active is a Bad Idea (tm) even with a clustered
> filesystem, until at least the target software is able to synchronize
> the command state of either node.
> 3. Linux and ESX initiators can generate the warning, but I've so far
> only been able to reliably reproduce it using a Windows initiator and
> sqlio or iometer benchmarks. I'll be trying again using iometer when I
> have the time.
> 4. It only happens using a random write io workload (any block size),
> with initiator threads>1, OR initiator queue depth>1. The higher
> either of those is, the more spammy the warnings become.
> 5. The transport does not matter (reproduced with iSCSI and SRP)
> 6. If DRBD is disconnected (primary/unknown), the warnings are not
> generated. As soon as it's reconnected (primary/secondary), the
> warnings will reappear.

It would be great if you prove or disprove our suspicions that Linux can
produce several write requests for the same blocks simultaneously. To be
sure we need:

1. The initiator is Linux. Windows and ESX are not needed for this
particular case.

2. If you are able to reproduce it, we will need full description of
which application used on the initiator to generate the load and in
which mode.

Target and DRBD configuration doesn't matter, you can use any.

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Gennadiy Nerubayev on 23 Jul 2010 17:00

On Fri, Jul 23, 2010 at 3:16 PM, Vladislav Bolkhovitin <vst(a)vlnb.net> wrote:
> Gennadiy Nerubayev, on 07/23/2010 09:59 PM wrote:
>>
>> On Thu, Jun 3, 2010 at 7:20 AM, Vladislav Bolkhovitin<vst(a)vlnb.net>
>> �wrote:
>>>
>>> James Bottomley, on 06/01/2010 05:27 PM wrote:
>>>>
>>>> On Tue, 2010-06-01 at 12:30 +0200, Christof Schmitt wrote:
>>>>>
>>>>> What is the best strategy to continue with the invalid guard tags on
>>>>> write requests? Should this be fixed in the filesystems?
>>>>
>>>> For write requests, as long as the page dirty bit is still set, it's
>>>> safe to drop the request, since it's already going to be repeated. �What
>>>> we probably want is an error code we can return that the layer that sees
>>>> both the request and the page flags can make the call.
>>>>
>>>>> Another idea would be to pass invalid guard tags on write requests
>>>>> down to the hardware, expect an "invalid guard tag" error and report
>>>>> it to the block layer where a new checksum is generated and the
>>>>> request is issued again. Basically implement a retry through the whole
>>>>> I/O stack. But this also sounds complicated.
>>>>
>>>> No, no ... as long as the guard tag is wrong because the fs changed the
>>>> page, the write request for the updated page will already be queued or
>>>> in-flight, so there's no need to retry.
>>>
>>> There's one interesting problem here, at least theoretically, with SCSI
>>> or similar transports which allow to have commands queue depth>1 and allowed
>>> to internally reorder queued requests. I don't know the FS/block layers
>>> sufficiently well to tell if sending several requests for the same page
>>> really possible or not, but we can see a real life problem, which can be
>>> well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for
>>> the same page queued to the corresponding device before the original request
>>> finished. Since the device allowed to freely reorder requests, there's a
>>> probability that the original write request would hit the permanent storage
>>> *AFTER* the retry request, hence the data changes it's carrying would be
>>> lost, hence welcome data corruption.
>>>
>>> For single parallel SCSI or SAS devices such race may look practically
>>> impossible, but for sophisticated clusters when many nodes pretending to be
>>> a single SCSI device in a load balancing configuration, it becomes very
>>> real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they both
>>> run DRBD to keep their backstorage in-sync. The initiator uses them as a
>>> single multipath device in an active-active round-robin load-balancing
>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>> takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD
>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received
>>> overlapping writes on the same block(s) and DRBD can't figure out which one
>>> to store. This is possible only if the initiator sent the second write
>>> request before the first one completed.
>>>
>>> The topic of the discussion could well explain the cause of that. But,
>>> unfortunately, people who reported it forgot to note which OS they run on
>>> the initiator, i.e. I can't say for sure it's Linux.
>>
>> Sorry for the late chime in, but here's some more information of
>> potential interest as I've previously inquired about this to the drbd
>> mailing list:
>>
>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>> nv_cache, and write_through do not generate the warnings.
>
> Some explanations for those who not familiar with the terminology:
>
> �- "Fileio" means Linux IO stack on the target receives IO via
> vfs_readv()/vfs_writev()
>
> �- "NV_CACHE" means all the cache synchronization requests
> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>
> �- "WRITE_THROUGH" means write through, i.e. the corresponding backend file
> for the device open with O_SYNC flag.
>
>> 2. It happens on active/passive drbd clusters (on the active node
>> obviously), NOT active/active. In fact, I've found that doing round
>> robin on active/active is a Bad Idea (tm) even with a clustered
>> filesystem, until at least the target software is able to synchronize
>> the command state of either node.
>> 3. Linux and ESX initiators can generate the warning, but I've so far
>> only been able to reliably reproduce it using a Windows initiator and
>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>> have the time.
>> 4. It only happens using a random write io workload (any block size),
>> with initiator threads>1, OR initiator queue depth>1. The higher
>> either of those is, the more spammy the warnings become.
>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>> generated. As soon as it's reconnected (primary/secondary), the
>> warnings will reappear.
>
> It would be great if you prove or disprove our suspicions that Linux can
> produce several write requests for the same blocks simultaneously. To be
> sure we need:
>
> 1. The initiator is Linux. Windows and ESX are not needed for this
> particular case.
>
> 2. If you are able to reproduce it, we will need full description of which
> application used on the initiator to generate the load and in which mode.
>
> Target and DRBD configuration doesn't matter, you can use any.

I just tried, and this particular DRBD warning is not reproducible
with io (iometer) coming from a Linux initiator (2.6.30.10) The same
iometer parameters were used as on windows, and both the base device
as well as filesystem (ext3) were tested, both negative. I'll try a
few more tests, but it seems that this is a nonissue with a Linux
initiator.

Hope that helps,

-Gennadiy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 23 Jul 2010 21:10

On Fri, Jul 23, 2010 at 11:16:33PM +0400, Vladislav Bolkhovitin wrote:
> It would be great if you prove or disprove our suspicions that Linux
> can produce several write requests for the same blocks
> simultaneously. To be sure we need:

Just use direct IO. Case in point is the concurrent sub-block
AIO-DIO data corruption we're chasing on XFS and ext4 at the moment
where we have two concurrent unaligned write IOs to the same
filesystem block:

http://oss.sgi.com/archives/xfs/2010-07/msg00278.html

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vladislav Bolkhovitin on 26 Jul 2010 08:30

Gennadiy Nerubayev, on 07/24/2010 12:51 AM wrote:
>>>> The real life problem we can see in an active-active DRBD-setup. In this
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they both
>>>> run DRBD to keep their backstorage in-sync. The initiator uses them as a
>>>> single multipath device in an active-active round-robin load-balancing
>>>> configuration, i.e. sends requests to both nodes in parallel, then DRBD
>>>> takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected! [DISCARD
>>>> L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received
>>>> overlapping writes on the same block(s) and DRBD can't figure out which one
>>>> to store. This is possible only if the initiator sent the second write
>>>> request before the first one completed.
>>>>
>>>> The topic of the discussion could well explain the cause of that. But,
>>>> unfortunately, people who reported it forgot to note which OS they run on
>>>> the initiator, i.e. I can't say for sure it's Linux.
>>>
>>> Sorry for the late chime in, but here's some more information of
>>> potential interest as I've previously inquired about this to the drbd
>>> mailing list:
>>>
>>> 1. It only happens when using blockio mode in IET or SCST. Fileio,
>>> nv_cache, and write_through do not generate the warnings.
>>
>> Some explanations for those who not familiar with the terminology:
>>
>> - "Fileio" means Linux IO stack on the target receives IO via
>> vfs_readv()/vfs_writev()
>>
>> - "NV_CACHE" means all the cache synchronization requests
>> (SYNCHRONIZE_CACHE, FUA) from the initiator are ignored
>>
>> - "WRITE_THROUGH" means write through, i.e. the corresponding backend file
>> for the device open with O_SYNC flag.
>>
>>> 2. It happens on active/passive drbd clusters (on the active node
>>> obviously), NOT active/active. In fact, I've found that doing round
>>> robin on active/active is a Bad Idea (tm) even with a clustered
>>> filesystem, until at least the target software is able to synchronize
>>> the command state of either node.
>>> 3. Linux and ESX initiators can generate the warning, but I've so far
>>> only been able to reliably reproduce it using a Windows initiator and
>>> sqlio or iometer benchmarks. I'll be trying again using iometer when I
>>> have the time.
>>> 4. It only happens using a random write io workload (any block size),
>>> with initiator threads>1, OR initiator queue depth>1. The higher
>>> either of those is, the more spammy the warnings become.
>>> 5. The transport does not matter (reproduced with iSCSI and SRP)
>>> 6. If DRBD is disconnected (primary/unknown), the warnings are not
>>> generated. As soon as it's reconnected (primary/secondary), the
>>> warnings will reappear.
>>
>> It would be great if you prove or disprove our suspicions that Linux can
>> produce several write requests for the same blocks simultaneously. To be
>> sure we need:
>>
>> 1. The initiator is Linux. Windows and ESX are not needed for this
>> particular case.
>>
>> 2. If you are able to reproduce it, we will need full description of which
>> application used on the initiator to generate the load and in which mode.
>>
>> Target and DRBD configuration doesn't matter, you can use any.
>
> I just tried, and this particular DRBD warning is not reproducible
> with io (iometer) coming from a Linux initiator (2.6.30.10) The same
> iometer parameters were used as on windows, and both the base device
> as well as filesystem (ext3) were tested, both negative. I'll try a
> few more tests, but it seems that this is a nonissue with a Linux
> initiator.

OK, but to be completely sure, can you check also with other load
generators, than IOmeter, please? IOmeter on Linux is a lot less
effective than on Windows, because it uses sync IO, while we need big
multi-IO load to trigger the problem we are discussing, if it exists.
Plus, to catch it we need an FS on the initiator side, not using raw
devices. So, something like fio over files on FS or diskbench should be
more appropriate. Please don't use direct IO to avoid the bug Dave
Chinner pointed us out.

Also, you mentioned above about that Linux can generate the warning. Can
you recall on which configuration, including the kernel version, the
load application and its configuration, you have seen it?

Thanks,
Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: Suspend and VT switch hangs since 2.6.34
Next: writeback: sync expired inodes first in background writeback