dm: only initialize full request_queue for request-based device [Kernel]

Prev: [PATCH] blkdev: cgroup whitelist permission fix
Next: i2c-i801: fix all coding style issues found by checkpatch.pl

From: Kiyoshi Ueda on 14 May 2010 04:20

Hi Mike,

On 05/13/2010 12:57 PM +0900, Mike Snitzer wrote:
> On Wed, May 12 2010 at 4:23am -0400,
> Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>
>> Hi Mike,
>>
>> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
>>> It is clear we need to resolve the current full request_queue
>>> initialization that occurs even for bio-based DM devices.
>>>
>>> I believe the 2 patches I posted accomplish this in a stright-forward
>>> way. We can always improve on it (by looking at what you proposed
>>> below) but we need a minimlaist fix that doesn't depend on userspace
>>> LVM2 changes right now.
>> Humm, OK.
>> Indeed, showing iosched directory in bio-based device's sysfs is
>> confusing users actually, and we need something to resolve that soon.
>> So I don't strongly object to your approach as the first step, as long
>> as we can accept the risk of the maintenance cost which I mentioned.
>
> OK, I understand your concern (especially after having gone through this
> further over the past couple days). I'm hopeful maintenance will be
> minimal.
>
>> By the way, your current patch has a problem below.
>> It needs to be fixed at least.
>
> Thanks for the review. I've addressed both your points with additional
> changes (split between 2 patches).

I found another bug; blk_init_allocated_queue() overwrites q->unplug_fn
with generic one. (Although it's currently harmless for multipath target.)

I feel your patch-set is growing and becoming complex fix rather than
minimalist/simple fix.
I think touching mapped_device/queue at table loading time makes
things complex. It is natural that table's arguments are reflected
to mapped_device/queue at table binding time instead.

> I confirmed with Alasdair that we don't want to conditionally relax DM's
> allocation constraints for request-based DM. Best to be consistent
> about not allowing allocations during resume.

OK.
Then, how about the patch below?
It still fully initializes queue at device creation time for both
bio-based and request-based. Then, it drops elevator when the device
type is decided as bio-based at table binding time.
So no memory allocation during resume (freeing instead).

I hope this simple work-around approach is acceptable for you and
Alasdair (and others).
(Then, let's focus on making a right fix rather than hacking
the block layer.)

[PATCH] dm: drop elevator when the device is bio-based

I/O scheduler affects nothing for bio-based dm device.
Showing I/O scheduler's attributes in sysfs for bio-based dm devices
is confusing users.
So drop them from sysfs when a dm device is decided as bio-based.

This patch depends on the commit below in the for-2.6.35 of Jens'
linux-2.6-block git:
commit 01effb0dc1451fad55925873ffbfb88fa4eadce0
Author: Mike Snitzer <snitzer(a)redhat.com>
Date: Tue May 11 08:57:42 2010 +0200

block: allow initialization of previously allocated request_queue

Signed-off-by: Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura(a)ce.jp.nec.com>
Cc: Mike Snitzer <snitzer(a)redhat.com>
Cc: Alasdair G Kergon <agk(a)redhat.com>
---
drivers/md/dm.c | 8 ++++++++
1 file changed, 8 insertions(+)

Index: 2.6.34-rc7/drivers/md/dm.c
===================================================================
--- 2.6.34-rc7.orig/drivers/md/dm.c
+++ 2.6.34-rc7/drivers/md/dm.c
@@ -2410,6 +2410,14 @@ struct dm_table *dm_swap_table(struct ma
goto out;
}

+ /* drop elevator when the device type is decided as bio-based */
+ if (!md->map && dm_table_get_type(table) == DM_TYPE_BIO_BASED) {
+ elv_unregister_queue(md->queue);
+ elevator_exit(md->queue->elevator);
+ md->queue->request_fn = NULL;
+ md->queue->elevator = NULL;
+ }
+
map = __bind(md, table, &limits);

out:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Kiyoshi Ueda on 17 May 2010 05:40

Hi Mike,

On 05/14/2010 11:08 PM +0900, Mike Snitzer wrote:
> On Fri, May 14 2010 at 4:06am -0400,
> Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>
>> Hi Mike,
>>
>> On 05/13/2010 12:57 PM +0900, Mike Snitzer wrote:
>>> On Wed, May 12 2010 at 4:23am -0400,
>>> Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
>>>>> It is clear we need to resolve the current full request_queue
>>>>> initialization that occurs even for bio-based DM devices.
>>>>>
>>>>> I believe the 2 patches I posted accomplish this in a stright-forward
>>>>> way. We can always improve on it (by looking at what you proposed
>>>>> below) but we need a minimlaist fix that doesn't depend on userspace
>>>>> LVM2 changes right now.
>>>>
>>>> Humm, OK.
>>>> Indeed, showing iosched directory in bio-based device's sysfs is
>>>> confusing users actually, and we need something to resolve that soon.
>>>> So I don't strongly object to your approach as the first step, as long
>>>> as we can accept the risk of the maintenance cost which I mentioned.
>>>
>>> OK, I understand your concern (especially after having gone through this
>>> further over the past couple days). I'm hopeful maintenance will be
>>> minimal.
snip
>> I feel your patch-set is growing and becoming complex fix rather than
>> minimalist/simple fix.
>>
>> I think touching mapped_device/queue at table loading time makes
>> things complex. It is natural that table's arguments are reflected
>> to mapped_device/queue at table binding time instead.
>
> Seems the numerous patches I've posted over the past couple days have
> given a false impression. But I do understand your concern.
>
> The longer-term direction you want to take DM (known type during
> alloc_dev) illustrates that we share a common goal: only do the precise
> work that is needed to initialize the request_queue for a DM device
> (whether it is bio-based or request-based).
>
> My changes do accomplish that in the near-term without requiring
> userspace change. Many of my proposed changes are just refactoring
> existing code to clearly split out what is needed for request-based vs
> bio-based.

As far as I understand, the current model of device-mapper is:
- a table (precisely, a target) has various attributes,
bio-based/request-based is one of such attributes
- a table and its attributes are bound to the block device on resume
If we want to fix a problem, I think we should either work based on
this model or change the model.

Your patch makes that loading table affects the block device, so you
are changing the model.

If you change the model, it should be done carefully.
For example, the current model allows most of the table loading code
to run without exclusive lock on the device because it doesn't affect
the device itself. If you change this model, table loading needs to
be serialized with appropriate locking.

My suggestion was (and is) to change the current model by determining
bio-based/request-based at device creation time, not by a table.

> I'll post a single patch that contains my changes (no need to have 2
> patches any more). With that patch I'm hopeful you'll see my changes
> aren't as complex as they might have seemed over the past few days.

The number of patches doesn't matter.
I still feel it's becoming complex, for example it now has to take
care of reverting md/queue changes by previous preload.

Also, you have to take care of race condition between concurrent table
loadings which was mentioned above, because the table which was used to
initialized the queue may not be set in hc->new_map.
Although such races may not happen in real usages, they could cause
some critical problems (maybe kernel panic, maybe memory leak).

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Kiyoshi Ueda on 18 May 2010 04:40

Hi Mike,

On 05/18/2010 02:27 AM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>> On 05/14/2010 11:08 PM +0900, Mike Snitzer wrote:
>>> Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>>>> On 05/13/2010 12:57 PM +0900, Mike Snitzer wrote:
>>>>> Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>>>>>> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
>>>>>>> It is clear we need to resolve the current full request_queue
>>>>>>> initialization that occurs even for bio-based DM devices.
>>>>>>>
>>>>>>> I believe the 2 patches I posted accomplish this in a stright-forward
>>>>>>> way. We can always improve on it (by looking at what you proposed
>>>>>>> below) but we need a minimlaist fix that doesn't depend on userspace
>>>>>>> LVM2 changes right now.
>>>>>>
>>>>>> Humm, OK.
>>>>>> Indeed, showing iosched directory in bio-based device's sysfs is
>>>>>> confusing users actually, and we need something to resolve that soon.
>>>>>> So I don't strongly object to your approach as the first step, as long
>>>>>> as we can accept the risk of the maintenance cost which I mentioned.
>>>>>
>>>>> OK, I understand your concern (especially after having gone through this
>>>>> further over the past couple days). I'm hopeful maintenance will be
>>>>> minimal.
>> snip
>>>> I feel your patch-set is growing and becoming complex fix rather than
>>>> minimalist/simple fix.
>>>>
>>>> I think touching mapped_device/queue at table loading time makes
>>>> things complex. It is natural that table's arguments are reflected
>>>> to mapped_device/queue at table binding time instead.
>>>
>>> Seems the numerous patches I've posted over the past couple days have
>>> given a false impression. But I do understand your concern.
>>>
>>> The longer-term direction you want to take DM (known type during
>>> alloc_dev) illustrates that we share a common goal: only do the precise
>>> work that is needed to initialize the request_queue for a DM device
>>> (whether it is bio-based or request-based).
>>>
>>> My changes do accomplish that in the near-term without requiring
>>> userspace change. Many of my proposed changes are just refactoring
>>> existing code to clearly split out what is needed for request-based vs
>>> bio-based.
>>
>> As far as I understand, the current model of device-mapper is:
>> - a table (precisely, a target) has various attributes,
>> bio-based/request-based is one of such attributes
>> - a table and its attributes are bound to the block device on resume
>> If we want to fix a problem, I think we should either work based on
>> this model or change the model.
>>
>> Your patch makes that loading table affects the block device, so you
>> are changing the model.
>>
>> If you change the model, it should be done carefully.
>> For example, the current model allows most of the table loading code
>> to run without exclusive lock on the device because it doesn't affect
>> the device itself. If you change this model, table loading needs to
>> be serialized with appropriate locking.
>
> Nice catch, yes md->queue needs protection (see patch below).

Not enough. (See drivers/md/dm-ioctl.c:table_load().)
Table load sequence is:
1. populate table
2. set the table to ->new_map of the hash_cell for the mapped_device
in protection by _hash_lock.

Since your fix only serializes the step 1, concurrent table loading
could end up with inconsistent status; e.g. request-based table is
bound to the mapped_device while the queue is initialized as bio-based.
With your new model, those 2 steps above must be atomic.

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Kiyoshi Ueda on 19 May 2010 02:00

Hi Mike,

On 05/18/2010 10:46 PM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>> On 05/18/2010 02:27 AM +0900, Mike Snitzer wrote:
>>> Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>>>> As far as I understand, the current model of device-mapper is:
>>>> - a table (precisely, a target) has various attributes,
>>>> bio-based/request-based is one of such attributes
>>>> - a table and its attributes are bound to the block device on resume
>>>> If we want to fix a problem, I think we should either work based on
>>>> this model or change the model.
>>>>
>>>> Your patch makes that loading table affects the block device, so you
>>>> are changing the model.
>>>>
>>>> If you change the model, it should be done carefully.
>>>> For example, the current model allows most of the table loading code
>>>> to run without exclusive lock on the device because it doesn't affect
>>>> the device itself. If you change this model, table loading needs to
>>>> be serialized with appropriate locking.
>>>
>>> Nice catch, yes md->queue needs protection (see patch below).
>>
>> Not enough. (See drivers/md/dm-ioctl.c:table_load().)
>> Table load sequence is:
>> 1. populate table
>> 2. set the table to ->new_map of the hash_cell for the mapped_device
>> in protection by _hash_lock.
>>
>> Since your fix only serializes the step 1, concurrent table loading
>> could end up with inconsistent status; e.g. request-based table is
>> bound to the mapped_device while the queue is initialized as bio-based.
>> With your new model, those 2 steps above must be atomic.
>
> Ah, yes.. I looked at the possibility of serializing the entirety of
> table_load but determined that would be too excessive (would reduce
> parallelism of table_load). But I clearly missed the fact that there
> could be a race to the _hash_lock protected critical section in
> table_load() -- leading to queue inconsistency.
>
> I'll post v5 of the overall patch which will revert the mapped_device
> 'queue_lock' serialization that I proposed in v4. v5 will contain
> the following patch to localize all table load related queue
> manipulation to the _hash_lock protected critical section in
> table_load(). So it sets the queue up _after_ the table's type is
> established with dm_table_set_type().

dm_table_setup_md_queue() may allocate memory with blocking mode.
Blocking allocation inside exclusive _hash_lock can cause deadlock;
e.g. when it has to wait for other dm devices to resume to free some
memory.

Also, your patch changes the queue configuration even when a table is
already active and used. (e.g. Loading bio-based table to a mapped_device
which is already active/used as request-based sets q->requst_fn in NULL.)
That could cause some critical problems.

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Kiyoshi Ueda on 20 May 2010 07:30

Hi Mike,

On 05/19/2010 11:39 PM +0900, Mike Snitzer wrote:
> Mike Snitzer <snitzer(a)redhat.com> wrote:
>> On Wed, May 19, 2010 at 1:57 AM, Kiyoshi Ueda <k-ueda(a)ct.jp.nec.com> wrote:
>>> On 05/18/2010 10:46 PM +0900, Mike Snitzer wrote:
>>>> I'll post v5 of the overall patch which will revert the mapped_device
>>>> 'queue_lock' serialization that I proposed in v4. v5 will contain
>>>> the following patch to localize all table load related queue
>>>> manipulation to the _hash_lock protected critical section in
>>>> table_load(). So it sets the queue up _after_ the table's type is
>>>> established with dm_table_set_type().
>>>
>>> dm_table_setup_md_queue() may allocate memory with blocking mode.
>>> Blocking allocation inside exclusive _hash_lock can cause deadlock;
>>> e.g. when it has to wait for other dm devices to resume to free some
>>> memory.
>>
>> We make no guarantees that other DM devices will resume before a table
>> load -- so calling dm_table_setup_md_queue() within the exclusive
>> _hash_lock is no different than other DM devices being suspended while
>> a request-based DM device performs its first table_load().
>>
>> My thinking was this should not be a problem as it is only valid to
>> call dm_table_setup_md_queue() before the newly created request-based
>> DM device has been resumed.
>>
>> AFAIK we don't have any explicit constraints on memory allocations
>> during table load (e.g. table loads shouldn't depend on other devices'
>> writeback) -- but any GFP_KERNEL allocation could recurse
>> (elevator_alloc() currently uses GFP_KERNEL with kmalloc_node)...
>>
>> I'll have to review the DM code further to see if all memory
>> allocations during table_load() are done via mempools. I'll also
>> bring this up on this week's LVM call.
>
> We discussed this and I understand the scope of the problem now.
>
> Just reiterating what you covered when you first pointed this issue out:
>
> It could be that a table load gets blocked (waiting on a memory
> allocation). The table load can take as long as it needs. But we can't
> have it block holding the exclusive _hash_lock while blocking. Having
> _hash_lock prevents further DM ioctls. The table load's allocation may
> be blocking waiting for writeback to a DM device that will be resumed by
> another thread.

That's right.

> Thanks again for pointing this out; I'll work to arrive at an
> alternative locking scheme. Likely introduce a lock local to the
> multiple_device (effectively the 'queue_lock' I had before). But
> difference is I'd take that lock before taking _hash_lock.

You have to see do_resume(), too.
do_resume() gets hc->new_map with _hash_lock held but without your
'queue_lock' held. This could cause inconsistent status;
table_load() for bio-based table do_resume() for rq-based table
--------------------------------------------------------------------
dm_lock_md_queue(md)
dm_table_setup_md_queue(t) (<= bio-based)
dm_clear_request_based_queue(md)
md->queue->request_fn = NULL
down_write(_hash_lock)
new_map = hc->new_map (<= rq-based)
up_write(_hash_lock)
dm_swap_table(md, new_map)
__bind(md, new_map) (<= rq-based)
down_write(_hash_lock)
hc->new_map = t
up_write(_hash_lock)
dm_unlock_md_queue(md)

Other ioctls might be as well, since I have checked only interaction
between table load and resume.

> I hope to have v6 available at some point today but it may be delayed by
> a day.

Sorry for repeating but since you are trying to change the current dm
model, please design/consider it carefully; what objects/ioctls are
involved and how you change their relationships, where are the critical
sections and how you protect them.
Otherwise, I'm afraid we will just spend your time in transforming
one bug to another. (and my time for reviewing and verifying problems :)

>>> Also, your patch changes the queue configuration even when a table is
>>> already active and used. (e.g. Loading bio-based table to a mapped_device
>>> which is already active/used as request-based sets q->requst_fn in NULL.)
>>> That could cause some critical problems.
>>
>> Yes, that is possible and I can add additional checks to prevent this.
>> But this speaks to a more general problem with the existing DM code.
>>
>> dm_swap_table() has the negative check to prevent such table loads, e.g.:
>> /* cannot change the device type, once a table is bound */
>>
>> This check should come during table_load, as part of
>> dm_table_set_type(), rather than during table resume.
>
> I'll look to address this issue in v6 too.

It can race between table_load() and do_resume() as well;
table_load() for bio-based do_resume() for rq-based
---------------------------------------------------------------------
dm_table_set_type(t) (<= bio-based)
live_table = dm_get_live_table(md)
(assume no live_table yet)
new_map = hc->new_map (<= rq-based)
dm_swap_table(md, new_map)
__bind(md, new_map)
dm_table_setup_md_queue(t)
dm_clear_request_based_queue(md)
md->queue->request_fn = NULL

Thanks,
Kiyoshi Ueda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: [PATCH] blkdev: cgroup whitelist permission fix
Next: i2c-i801: fix all coding style issues found by checkpatch.pl