From: Christoph Hellwig on
On Thu, Aug 05, 2010 at 10:08:44AM -0700, Jeremy Fitzhardinge wrote:
> On 08/04/2010 09:44 AM, Christoph Hellwig wrote:
> >>But either the blkfront patch is wrong and it needs to be fixed,
> >Actually both the old and the new one are wrong, but I'd say the new
> >one is even more wrong.
> >
> >_TAG implies that the device can do ordering by tag. And at least the
> >qemu xen_disk backend doesn't when it advertizes this feature.
>
> We don't use qemu at all for block storage; qemu (afaik) doesn't
> have a blkback protocol implementation in it. I'm guessing xen_disk
> is to allow kvm to be compatible with Xen disk images? It certainly
> isn't a reference implementation.

Disk images formats have nothing to do with the I/O interface. I
believe Gerd added it for running unmodified Xen guests in qemu,
but he can explain more of it.

I've only mentioned it here because it's the one I easily have access
to. Given Xen's about 4 different I/O backends and the various forked
trees it's rather hard to find the official reference.

> >I'm pretty sure most if not all of the original Xen backends do the
> >same. Given that I have tried to implement tagged ordering in qemu
> >I know that comes down to doing exactly the same draining we already
> >do in the kernel, just duplicated in the virtual disk backend. That
> >is for a userspace implementation - for a kernel implementation only
> >using block devices we could in theory implement it using barriers,
> >but that would be even more inefficient. And last time I looked
> >at the in-kernel xen disk backed it didn't do that either.
>
> blkback - the in-kernel backend - does generate barriers when it
> receives one from the guest. Could you expand on why passing a
> guest barrier through to the host IO stack would be bad for
> performance? Isn't this exactly the same as a local writer
> generating a barrier?

If you pass it on it has the same semantics, but given that you'll
usually end up having multiple guest disks on a single volume using
lvm or similar you'll end up draining even more I/O as there is one
queue for all of them. That way you can easily have one guest starve
others.

Note that we're going to get rid of the draining for common cases
anyway, but that's a separate discussion thread the "relaxed barriers"
one.

> It's true that a number of the Xen backends end up implementing
> barriers via drain for simplicity's sake, but there's no inherent
> reason why they couldn't implement a more complete tagged model.

If they are in Linux/Posix userspace they can't because there are
not system calls to archive that. And then again there really is
no need to implement all this in the host anyway - the draining
is something we enforced on ourselves in Linux without good reason,
which we're trying to get rid of and no other OS ever did.

> >Now where both old and new one are buggy is that that they don't
> >include the QUEUE_ORDERED_DO_PREFLUSH and
> >QUEUE_ORDERED_DO_POSTFLUSH/QUEUE_ORDERED_DO_FUA which mean any
> >explicit cache flush (aka empty barrier) is silently dropped, making
> >fsync and co not preserve data integrity.
>
> Ah, OK, something specific. What level ends up dropping the empty
> barrier? Certainly an empty WRITE_BARRIER operation to the backend
> will cause all prior writes to be durable, which should be enough.
> Are you saying that there's an extra flag we should be passing to
> blk_queue_ordered(), or is there some other interface we should be
> implementing for explicit flushes?
>
> Is there a good reference implementation we can use as a model?

Just read Documentation/block/barriers.txt, it's very well described
there. Even the naming of the various ORDERED constant should
give enough hints.

> As I said before, the qemu xen backend is irrelevent.

It's one of the many backends written to the protocol specification,
I don't think it's fair to call it irrelevant. And as mentioned before
I'd be very surprised if the other backends all get it right. If you
send me pointers to one or two backends you considered "relevent" I'm
happy to look at them.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Gerd Hoffmann on
On 08/05/10 19:19, Christoph Hellwig wrote:
> On Thu, Aug 05, 2010 at 10:08:44AM -0700, Jeremy Fitzhardinge wrote:
>> On 08/04/2010 09:44 AM, Christoph Hellwig wrote:
>>>> But either the blkfront patch is wrong and it needs to be fixed,
>>> Actually both the old and the new one are wrong, but I'd say the new
>>> one is even more wrong.
>>>
>>> _TAG implies that the device can do ordering by tag. And at least the
>>> qemu xen_disk backend doesn't when it advertizes this feature.
>>
>> We don't use qemu at all for block storage; qemu (afaik) doesn't
>> have a blkback protocol implementation in it.

Upstream qemu has.

>> I'm guessing xen_disk
>> is to allow kvm to be compatible with Xen disk images?

No, is actually is a blkback implementation.

>> It certainly
>> isn't a reference implementation.

Indeed. I also havn't tested it for ages, not sure whenever it still works.

> Disk images formats have nothing to do with the I/O interface. I
> believe Gerd added it for running unmodified Xen guests in qemu,
> but he can explain more of it.

Well, you can boot pv kernels with upstream qemu. qemu must be compiled
with xen support enabled, you need xen underneath and xenstored must
run, but nothing else (xend, tapdisk, ...) is required. qemu will call
xen libraries to build the domain and run the pv kernel. qemu provides
backends for console, framebuffer, network and disk.

There was also the plan to allow xen being emulated, so you could run pv
kernels in qemu without xen (using tcg or kvm). Basically xenner merged
into qemu. That project was never finished though and I didn't spend
any time on it for at least one year ...

Hope this clarifies,
Gerd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Daniel Stodden on
On Thu, 2010-08-05 at 13:19 -0400, Christoph Hellwig wrote:

> > blkback - the in-kernel backend - does generate barriers when it
> > receives one from the guest. Could you expand on why passing a
> > guest barrier through to the host IO stack would be bad for
> > performance? Isn't this exactly the same as a local writer
> > generating a barrier?
>
> If you pass it on it has the same semantics, but given that you'll
> usually end up having multiple guest disks on a single volume using
> lvm or similar you'll end up draining even more I/O as there is one
> queue for all of them. That way you can easily have one guest starve
> others.

> > >Now where both old and new one are buggy is that that they don't
> > >include the QUEUE_ORDERED_DO_PREFLUSH and
> > >QUEUE_ORDERED_DO_POSTFLUSH/QUEUE_ORDERED_DO_FUA which mean any
> > >explicit cache flush (aka empty barrier) is silently dropped, making
> > >fsync and co not preserve data integrity.
> >
> > Ah, OK, something specific. What level ends up dropping the empty
> > barrier? Certainly an empty WRITE_BARRIER operation to the backend
> > will cause all prior writes to be durable, which should be enough.
> > Are you saying that there's an extra flag we should be passing to
> > blk_queue_ordered(), or is there some other interface we should be
> > implementing for explicit flushes?
> >
> > Is there a good reference implementation we can use as a model?
>
> Just read Documentation/block/barriers.txt, it's very well described
> there. Even the naming of the various ORDERED constant should
> give enough hints.

That one is read and well understood.

I presently don't see a point in having the frontend perform its own
pre or post flushes as long as there's a single queue in the block
layer. But if the kernel drops the plain _TAG mode, there is no problem
with that. Essentially the frontend may drain the queue as much as as it
wants. It just won't buy you much if the backend I/O was actually
buffered, other than adding latency to the transport.

The only thing which matters is that the frontend lld gets to see the
actual barrier point, anything else needs to be sorted out next to the
physical layer anyway, so it's better left to the backends.

Not sure if I understand your above comment regarding the flush and fua
bits. Did you mean to indicate that _TAG on the frontend's request_queue
is presently not coming up with the empty barrier request to make
_explicit_ cache flushes happen? That would be something which
definitely needs a workaround in the frontend then. In that case, would
PRE/POSTFLUSH help, to get a call into prepare_flush_fn, which might
insert the tag itself then? It's sounds a bit over the top to combine
this with a queue drain on the transport, but I'm rather after
correctness.

Regarding the potential starvation problems when accessing shared
physical storage you mentioned above: Yes, good point, we discussed that
too, although only briefly, and it's a todo which I don't think has been
solved in any present backend. But again, scheduling/merging
drain/flush/fua on shared physical nodes more carefully would be
something better *enforced*. The frontend can't even avoid it.

I wonder if there's a userspace solution for that. Does e.g. fdatasync()
deal with independent invocations other than serializing? Couldn't find
anything which indicates that, but I might not have looked hard enough.
The blktap userspace component presently doesn't buffer, so a _DRAIN is
sufficient. But if it did, then it'd be kinda cool if handled more
carefully. If the kernel does it, all the better.

Thanks,
Daniel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jeremy Fitzhardinge on
On 08/05/2010 10:19 AM, Christoph Hellwig wrote:
>>> I'm pretty sure most if not all of the original Xen backends do the
>>> same. Given that I have tried to implement tagged ordering in qemu
>>> I know that comes down to doing exactly the same draining we already
>>> do in the kernel, just duplicated in the virtual disk backend. That
>>> is for a userspace implementation - for a kernel implementation only
>>> using block devices we could in theory implement it using barriers,
>>> but that would be even more inefficient. And last time I looked
>>> at the in-kernel xen disk backed it didn't do that either.
>> blkback - the in-kernel backend - does generate barriers when it
>> receives one from the guest. Could you expand on why passing a
>> guest barrier through to the host IO stack would be bad for
>> performance? Isn't this exactly the same as a local writer
>> generating a barrier?
> If you pass it on it has the same semantics, but given that you'll
> usually end up having multiple guest disks on a single volume using
> lvm or similar you'll end up draining even more I/O as there is one
> queue for all of them. That way you can easily have one guest starve
> others.

Yes, that's unfortunate. In the normal case the IO streams would
actually be independent so they wouldn't need to be serialized with
respect to each other. But I don't know if that kind of partial-order
dependency is possible or on the cards.

> Note that we're going to get rid of the draining for common cases
> anyway, but that's a separate discussion thread the "relaxed barriers"
> one.

Does that mean barriers which enforce ordering without flushing?

>> It's true that a number of the Xen backends end up implementing
>> barriers via drain for simplicity's sake, but there's no inherent
>> reason why they couldn't implement a more complete tagged model.
> If they are in Linux/Posix userspace they can't because there are
> not system calls to archive that. And then again there really is
> no need to implement all this in the host anyway - the draining
> is something we enforced on ourselves in Linux without good reason,
> which we're trying to get rid of and no other OS ever did.

Userspace might not be relying on the kernel to do storage (it might
have its own iscsi implementation or something).

>>> Now where both old and new one are buggy is that that they don't
>>> include the QUEUE_ORDERED_DO_PREFLUSH and
>>> QUEUE_ORDERED_DO_POSTFLUSH/QUEUE_ORDERED_DO_FUA which mean any
>>> explicit cache flush (aka empty barrier) is silently dropped, making
>>> fsync and co not preserve data integrity.
>> Ah, OK, something specific. What level ends up dropping the empty
>> barrier? Certainly an empty WRITE_BARRIER operation to the backend
>> will cause all prior writes to be durable, which should be enough.
>> Are you saying that there's an extra flag we should be passing to
>> blk_queue_ordered(), or is there some other interface we should be
>> implementing for explicit flushes?
>>
>> Is there a good reference implementation we can use as a model?
> Just read Documentation/block/barriers.txt, it's very well described
> there. Even the naming of the various ORDERED constant should
> give enough hints.

I've gone over it a few times. Since the blkback barriers do both
ordering and flushing, it seems to me that plain _TAG is the right
choice; we don't need _TAG_FLUSH or _TAG_FUA. I still don't understand
what you mean about "explicit cache flush (aka empty barrier) is
silently dropped". Who drops it where? Do you mean the block subsystem
will drop an empty write, even if it has a barrier associated with it,
but if I set PREFLUSH and POSTFLUSH/FUA then those will still come
through? If so, isn't dropping a write with a barrier the problem?

> It's one of the many backends written to the protocol specification,
> I don't think it's fair to call it irrelevant. And as mentioned before
> I'd be very surprised if the other backends all get it right. If you
> send me pointers to one or two backends you considered "relevent" I'm
> happy to look at them.

You can see the current state in
git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git
xen/dom0/backend/blkback is the actual backend part. It can either
attach directly to a file/device, or go via blktap for usermode processing.

J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on
On Thu, Aug 05, 2010 at 02:07:42PM -0700, Daniel Stodden wrote:
> That one is read and well understood.

Given that xen blkfront does not actually implement cache flushes
correctly that doesn't seem to be the case.

> I presently don't see a point in having the frontend perform its own
> pre or post flushes as long as there's a single queue in the block
> layer. But if the kernel drops the plain _TAG mode, there is no problem
> with that. Essentially the frontend may drain the queue as much as as it
> wants. It just won't buy you much if the backend I/O was actually
> buffered, other than adding latency to the transport.

You do need the _FLUSH or _FUA modes (either with TAGs or DRAIN) to get
the block layer to send you pure cache flush requests (aka "empty
barriers") without this they don't work. They way the current barrier
code is implemented means you will always get manual cache flushes
before the actual barrier requests once you implement that. By using
the _FUA mode you can still do your own post flush.

I've been through doing all this, and given how hard it is to do a
semi-efficient drain in a backend driver, and given that non-Linux
guests don't even benefit from it just leaving the draining to the
guest is the easiest solution. If you already have the draining around
and are confident that it gets all corner cases right you can of
course keep it and use the QUEUE_ORDERED_TAG_FLUSH/QUEUE_ORDERED_TAG_FUA
modes. But from dealing with data integrity issues in virtualized
environment I'm not confident that things will just work, both on the
backend side, especially if image formats are around, and also on the
guest side given that QUEUE_ORDERED_TAG* has zero real life testing.

> Not sure if I understand your above comment regarding the flush and fua
> bits. Did you mean to indicate that _TAG on the frontend's request_queue
> is presently not coming up with the empty barrier request to make
> _explicit_ cache flushes happen?

Yes.

> That would be something which
> definitely needs a workaround in the frontend then. In that case, would
> PRE/POSTFLUSH help, to get a call into prepare_flush_fn, which might
> insert the tag itself then? It's sounds a bit over the top to combine
> this with a queue drain on the transport, but I'm rather after
> correctness.

prepare_flush_fn is gone now.

> I wonder if there's a userspace solution for that. Does e.g. fdatasync()
> deal with independent invocations other than serializing?

fsync/fdatasync is serialized by i_mutex.

> The blktap userspace component presently doesn't buffer, so a _DRAIN is
> sufficient. But if it did, then it'd be kinda cool if handled more
> carefully. If the kernel does it, all the better.

Doesn't buffer as in using O_SYNC/O_DYSNC or O_DIRECT? You still need
to call fdatsync for the latter, to flush out transaction for block
allocations in sparse / fallocated images and to flush the volatile
write cache of the host disks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/