From: Vladislav Bolkhovitin on
Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>> There's one interesting problem here, at least theoretically, with SCSI
>> or similar transports which allow to have commands queue depth >1 and
>> allowed to internally reorder queued requests. I don't know the FS/block
>> layers sufficiently well to tell if sending several requests for the
>> same page really possible or not, but we can see a real life problem,
>> which can be well explained if it's possible.
>>
>> The problem could be if the second (rewrite) request (SCSI command) for
>> the same page queued to the corresponding device before the original
>> request finished. Since the device allowed to freely reorder requests,
>> there's a probability that the original write request would hit the
>> permanent storage *AFTER* the retry request, hence the data changes it's
>> carrying would be lost, hence welcome data corruption.
>>
>
> I might be totally wrong here but I think NCQ can reorder sectors but
> not writes. That is if the sector is cached in device memory and a later
> write comes to modify the same sector then the original should be
> replaced not two values of the same sector be kept in device cache at the
> same time.
>
> Failing to do so is a scsi device problem.

SCSI devices supporting Full task management model (almost all) and
having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
allowed to freely reorder any commands with SIMPLE task attribute. If an
application wants to maintain order of some commands for such devices,
it must issue them with ORDERED task attribute and over a _single_ MPIO
path to the device.

Linux neither uses ORDERED attribute, nor honors or enforces anyhow
QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
order dependencies (overlapping writes in our case) over a single MPIO path.

> Please note that page-to-sector is not necessary constant. And the same page
> might get written at a different sector, next time. But FSs will have to
> barrier in this case.
>
>> For single parallel SCSI or SAS devices such race may look practically
>> impossible, but for sophisticated clusters when many nodes pretending to
>> be a single SCSI device in a load balancing configuration, it becomes
>> very real.
>>
>> The real life problem we can see in an active-active DRBD-setup. In this
>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>> as a single multipath device in an active-active round-robin
>> load-balancing configuration, i.e. sends requests to both nodes in
>> parallel, then DRBD takes care to replicate the requests to the other node.
>>
>> The problem is that sometimes DRBD complies about concurrent local
>> writes, like:
>>
>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>
>> This message means that DRBD detected that both nodes received
>> overlapping writes on the same block(s) and DRBD can't figure out which
>> one to store. This is possible only if the initiator sent the second
>> write request before the first one completed.
>
> It is totally possible in today's code.
>
> DRBD should store the original command_sn of the write and discard
> the sector with the lower SN. It should appear as a single device
> to the initiator.

How can it find the SN? The commands were sent over _different_ MPIO
paths to the device, so at the moment of the sending all the order
information was lost.

Until SCSI generally allowed to preserve ordering information between
MPIO paths in such configurations the only way to maintain commands
order would be queue draining. Hence, for safety all initiators working
with such devices must do it.

But looks like Linux doesn't do it, so unsafe with MPIO clusters?

Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vladislav Bolkhovitin on


Vladislav Bolkhovitin, on 06/03/2010 04:41 PM wrote:
> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>> There's one interesting problem here, at least theoretically, with SCSI
>>> or similar transports which allow to have commands queue depth >1 and
>>> allowed to internally reorder queued requests. I don't know the FS/block
>>> layers sufficiently well to tell if sending several requests for the
>>> same page really possible or not, but we can see a real life problem,
>>> which can be well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for
>>> the same page queued to the corresponding device before the original
>>> request finished. Since the device allowed to freely reorder requests,
>>> there's a probability that the original write request would hit the
>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>> carrying would be lost, hence welcome data corruption.
>>>
>> I might be totally wrong here but I think NCQ can reorder sectors but
>> not writes. That is if the sector is cached in device memory and a later
>> write comes to modify the same sector then the original should be
>> replaced not two values of the same sector be kept in device cache at the
>> same time.
>>
>> Failing to do so is a scsi device problem.
>
> SCSI devices supporting Full task management model (almost all) and
> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
> allowed to freely reorder any commands with SIMPLE task attribute. If an
> application wants to maintain order of some commands for such devices,
> it must issue them with ORDERED task attribute and over a _single_ MPIO
> path to the device.
>
> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
> order dependencies (overlapping writes in our case) over a single MPIO path.
>
>> Please note that page-to-sector is not necessary constant. And the same page
>> might get written at a different sector, next time. But FSs will have to
>> barrier in this case.
>>
>>> For single parallel SCSI or SAS devices such race may look practically
>>> impossible, but for sophisticated clusters when many nodes pretending to
>>> be a single SCSI device in a load balancing configuration, it becomes
>>> very real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>> as a single multipath device in an active-active round-robin
>>> load-balancing configuration, i.e. sends requests to both nodes in
>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received
>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>> one to store. This is possible only if the initiator sent the second
>>> write request before the first one completed.
>> It is totally possible in today's code.
>>
>> DRBD should store the original command_sn of the write and discard
>> the sector with the lower SN. It should appear as a single device
>> to the initiator.
>
> How can it find the SN? The commands were sent over _different_ MPIO
> paths to the device, so at the moment of the sending all the order
> information was lost.
>
> Until SCSI generally allowed to preserve ordering information between
> MPIO paths in such configurations the only way to maintain commands
> order would be queue draining. Hence, for safety all initiators working
> with such devices must do it.
>
> But looks like Linux doesn't do it, so unsafe with MPIO clusters?

I meant load balancing MPIO clusters.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Boaz Harrosh on
On 06/03/2010 03:41 PM, Vladislav Bolkhovitin wrote:
> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>> There's one interesting problem here, at least theoretically, with SCSI
>>> or similar transports which allow to have commands queue depth >1 and
>>> allowed to internally reorder queued requests. I don't know the FS/block
>>> layers sufficiently well to tell if sending several requests for the
>>> same page really possible or not, but we can see a real life problem,
>>> which can be well explained if it's possible.
>>>
>>> The problem could be if the second (rewrite) request (SCSI command) for
>>> the same page queued to the corresponding device before the original
>>> request finished. Since the device allowed to freely reorder requests,
>>> there's a probability that the original write request would hit the
>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>> carrying would be lost, hence welcome data corruption.
>>>
>>
>> I might be totally wrong here but I think NCQ can reorder sectors but
>> not writes. That is if the sector is cached in device memory and a later
>> write comes to modify the same sector then the original should be
>> replaced not two values of the same sector be kept in device cache at the
>> same time.
>>
>> Failing to do so is a scsi device problem.
>
> SCSI devices supporting Full task management model (almost all) and
> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
> allowed to freely reorder any commands with SIMPLE task attribute. If an
> application wants to maintain order of some commands for such devices,
> it must issue them with ORDERED task attribute and over a _single_ MPIO
> path to the device.
>
> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
> order dependencies (overlapping writes in our case) over a single MPIO path.
>

OK I take your word for it. But that sounds stupid to me. I would think
that sectors can be ordered. not commands per se. What happen with reads
then? do they get to be ordered? I mean a read in between the two writes which
value is read? It gets so complicated that only a sector model makes sense
to me.

>> Please note that page-to-sector is not necessary constant. And the same page
>> might get written at a different sector, next time. But FSs will have to
>> barrier in this case.
>>
>>> For single parallel SCSI or SAS devices such race may look practically
>>> impossible, but for sophisticated clusters when many nodes pretending to
>>> be a single SCSI device in a load balancing configuration, it becomes
>>> very real.
>>>
>>> The real life problem we can see in an active-active DRBD-setup. In this
>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>> as a single multipath device in an active-active round-robin
>>> load-balancing configuration, i.e. sends requests to both nodes in
>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>
>>> The problem is that sometimes DRBD complies about concurrent local
>>> writes, like:
>>>
>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>
>>> This message means that DRBD detected that both nodes received
>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>> one to store. This is possible only if the initiator sent the second
>>> write request before the first one completed.
>>
>> It is totally possible in today's code.
>>
>> DRBD should store the original command_sn of the write and discard
>> the sector with the lower SN. It should appear as a single device
>> to the initiator.
>
> How can it find the SN? The commands were sent over _different_ MPIO
> paths to the device, so at the moment of the sending all the order
> information was lost.
>

I'm not hard on the specifics here. But I think the initiator has set
the same SN on the two paths, or has incremented them between paths.
You said:

> The initiator uses them as a single multipath device in an active-active
> round-robin load-balancing configuration, i.e. sends requests to both nodes
> in paralle.

So what was the SN sent to each side. Is there a relationship between them
or they each advance independently?

If there is a relationship then the targets on two sides should store
the SN for later comparison. (Life is hard)

> Until SCSI generally allowed to preserve ordering information between
> MPIO paths in such configurations the only way to maintain commands
> order would be queue draining. Hence, for safety all initiators working
> with such devices must do it.
>
> But looks like Linux doesn't do it, so unsafe with MPIO clusters?
>
> Vlad
>

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Vladislav Bolkhovitin on
Boaz Harrosh, on 06/03/2010 05:06 PM wrote:
> On 06/03/2010 03:41 PM, Vladislav Bolkhovitin wrote:
>> Boaz Harrosh, on 06/03/2010 04:07 PM wrote:
>>> On 06/03/2010 02:20 PM, Vladislav Bolkhovitin wrote:
>>>> There's one interesting problem here, at least theoretically, with SCSI
>>>> or similar transports which allow to have commands queue depth >1 and
>>>> allowed to internally reorder queued requests. I don't know the FS/block
>>>> layers sufficiently well to tell if sending several requests for the
>>>> same page really possible or not, but we can see a real life problem,
>>>> which can be well explained if it's possible.
>>>>
>>>> The problem could be if the second (rewrite) request (SCSI command) for
>>>> the same page queued to the corresponding device before the original
>>>> request finished. Since the device allowed to freely reorder requests,
>>>> there's a probability that the original write request would hit the
>>>> permanent storage *AFTER* the retry request, hence the data changes it's
>>>> carrying would be lost, hence welcome data corruption.
>>>>
>>> I might be totally wrong here but I think NCQ can reorder sectors but
>>> not writes. That is if the sector is cached in device memory and a later
>>> write comes to modify the same sector then the original should be
>>> replaced not two values of the same sector be kept in device cache at the
>>> same time.
>>>
>>> Failing to do so is a scsi device problem.
>> SCSI devices supporting Full task management model (almost all) and
>> having QUEUE ALGORITHM MODIFIER bits in Control mode page set to 1
>> allowed to freely reorder any commands with SIMPLE task attribute. If an
>> application wants to maintain order of some commands for such devices,
>> it must issue them with ORDERED task attribute and over a _single_ MPIO
>> path to the device.
>>
>> Linux neither uses ORDERED attribute, nor honors or enforces anyhow
>> QUEUE ALGORITHM MODIFIER bits, nor takes care to send commands with
>> order dependencies (overlapping writes in our case) over a single MPIO path.
>>
>
> OK I take your word for it. But that sounds stupid to me. I would think
> that sectors can be ordered. not commands per se. What happen with reads
> then? do they get to be ordered? I mean a read in between the two writes which
> value is read? It gets so complicated that only a sector model makes sense
> to me.

Look wider. For a single HDD your way of thinking makes sense. But how
about big clusters consisting from many nodes with many clients? In them
maintaining internal commands order is generally bad and often a way too
expensive for performance.

It's the same as with modern CPUs, where for performance reasons
programmers also must live with the commands reorder possibilities and
use barriers, when necessary.

>>> Please note that page-to-sector is not necessary constant. And the same page
>>> might get written at a different sector, next time. But FSs will have to
>>> barrier in this case.
>>>
>>>> For single parallel SCSI or SAS devices such race may look practically
>>>> impossible, but for sophisticated clusters when many nodes pretending to
>>>> be a single SCSI device in a load balancing configuration, it becomes
>>>> very real.
>>>>
>>>> The real life problem we can see in an active-active DRBD-setup. In this
>>>> configuration 2 nodes act as a single SCST-powered SCSI device and they
>>>> both run DRBD to keep their backstorage in-sync. The initiator uses them
>>>> as a single multipath device in an active-active round-robin
>>>> load-balancing configuration, i.e. sends requests to both nodes in
>>>> parallel, then DRBD takes care to replicate the requests to the other node.
>>>>
>>>> The problem is that sometimes DRBD complies about concurrent local
>>>> writes, like:
>>>>
>>>> kernel: drbd0: scsi_tgt0[12503] Concurrent local write detected!
>>>> [DISCARD L] new: 144072784s +8192; pending: 144072784s +8192
>>>>
>>>> This message means that DRBD detected that both nodes received
>>>> overlapping writes on the same block(s) and DRBD can't figure out which
>>>> one to store. This is possible only if the initiator sent the second
>>>> write request before the first one completed.
>>> It is totally possible in today's code.
>>>
>>> DRBD should store the original command_sn of the write and discard
>>> the sector with the lower SN. It should appear as a single device
>>> to the initiator.
>> How can it find the SN? The commands were sent over _different_ MPIO
>> paths to the device, so at the moment of the sending all the order
>> information was lost.
>>
>
> I'm not hard on the specifics here. But I think the initiator has set
> the same SN on the two paths, or has incremented them between paths.
> You said:
>
>> The initiator uses them as a single multipath device in an active-active
>> round-robin load-balancing configuration, i.e. sends requests to both nodes
>> in paralle.
>
> So what was the SN sent to each side. Is there a relationship between them
> or they each advance independently?
>
> If there is a relationship then the targets on two sides should store
> the SN for later comparison. (Life is hard)

None of SCSI transports carry any SN to other paths (I_T nexuses)
related information in internal packets, including iSCSI. It's simply
out of SAM. If you need order information between paths, you must use
"extensions", like iSCSI MC/S, but they are bad for many other reasons.
I summarized it in http://scst.sourceforge.net/mc_s.html.

Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Chris Mason on
On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> > >>>>> "Nick" == Nick Piggin <npiggin(a)suse.de> writes:
> >
> > >> 1) filesystem changed it
> > >> 2) corruption on the wire or in the raid controller
> > >> 3) the page was corrupted while the IO layer was doing the IO.
> > >>
> > >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> > >> their lives. With #3, we'll recrc and send the IO down again
> > >> thinking the data is correct when really we're writing garbage.
> > >>
> > >> How can we tell these three cases apart?
> >
> > Nick> Do we really need to handle #3? It could have happened before the
> > Nick> checksum was calculated.
> >
> > Reason #3 is one of the main reasons for having the checksum in the
> > first place. The whole premise of the data integrity extensions is that
> > the checksum is calculated in close temporal proximity to the data
> > creation. I.e. eventually in userland.
> >
> > Filesystems will inevitably have to be integrity-aware for that to work.
> > And it will be their job to keep the data pages stable during DMA.
>
> Let's just think hard about what windows can actually be closed versus
> how much effort goes in to closing them. I also prefer not to accept
> half-solutions in the kernel because they don't want to implement real
> solutions in hardware (it's pretty hard to checksum and protect all
> kernel data structures by hand).
>
> For "normal" writes into pagecache, the data can get corrupted anywhere
> from after it is generated in userspace, during the copy, while it is
> dirty in cache, and while it is being written out.

This is why the DIF/DIX spec has the idea of a crc generated in userland
when the data is generated. At any rate the basic idea is to crc early
but not often...recalculating the crc after we hand our precious memory
to the evil device driver does weaken the end-to-end integrity checks.

What I don't want to do is weaken the basic DIF/DIX structure by letting
the lower recrc stuff as they find faults. It would be fine if we had
some definitive way to say: the FS raced, just recrc, but we really
don't.

>
> Closing the while it is dirty, while it is being written back window
> still leaves a pretty big window. Also, how do you handle mmap writes?
> Write protect and checksum the destination page after every store? Or
> leave some window between when the pagecache is dirtied and when it is
> written back? So I don't know whether it's worth putting a lot of effort
> into this case.

So, changing gears to how do we protect filesystem page cache pages
instead of the generic idea of dif/dix, btrfs crcs just before writing,
which does leave a pretty big window for the page to get corrupted.
The storage layer shouldn't care or know about that though, we hand it a
crc and it makes sure data matching that crc goes to the media.

>
> If you had an interface for userspace to insert checksums to direct IO
> requests or pagecache ranges, then not only can you close the entire gap
> between userspace data generation, and writeback. But you also can
> handle mmap writes and anything else just fine: userspace knows about
> the concurrency details, so it can add the right checksum (and
> potentially fsync etc) when it's ready.

Yeah, I do agree here.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/