High availability in KVM [Kernel]

Prev: [PATCH 3/2] Driver core: move platform device creation helpers to .init.text (if MODULE=n)
Next: [PATCH] USB: isp1362-hcd, fix double lock

From: Takuya Yoshikawa on 13 Jul 2010 04:00

On Mon, 12 Jul 2010 02:49:55 -0700 (PDT)
david(a)lang.hm wrote:

> On Mon, 12 Jul 2010, Takuya Yoshikawa wrote:
>

[...]

> > 1: Pacemaker starts Qemu.
> >
> > 2: Pacemaker checks the state of Qemu via RA.
> > RA checks the state of Qemu using virsh(libvirt).
> > Qemu replies to RA "RUNNING"(normally executing), (*1)
> > and RA returns the state to Pacemaker as it's running correctly.
> >
> > (*1): libvirt defines the following domain states:
> >
> > enum virDomainState {
> >
> > VIR_DOMAIN_NOSTATE = 0 : no state
> > VIR_DOMAIN_RUNNING = 1 : the domain is running
> > VIR_DOMAIN_BLOCKED = 2 : the domain is blocked on resource
> > VIR_DOMAIN_PAUSED = 3 : the domain is paused by user
> > VIR_DOMAIN_SHUTDOWN = 4 : the domain is being shut down
> > VIR_DOMAIN_SHUTOFF = 5 : the domain is shut off
> > VIR_DOMAIN_CRASHED = 6 : the domain is crashed
> >
> > }
> >
> > We took the most common case RUNNING as an example, but this might be
> > other states except for failover targets: SHUTOFF and CRASHED ?
> >
> > --- SOME ERROR HAPPENS ---
> >
> > 3: Pacemaker checks the state of Qemu via RA.
> > RA checks the state of Qemu using virsh(libvirt).
> > Qemu replies to RA "SHUTOFF", (*2)
>
> why would it return 'shutoff' if an error happened instead of 'crashed'?

Yes, it would be 'crashed'.

But 'shutoff' may also be returned I think: it depends on the type of the error
and how KVM/qemu handle it.

I take into my mind not only hardware errors but virtualization specific
errors like emulation errors.

>
> > and RA returns the state to Pacemaker as it's already stopped.
> >
> > (*2): Currently we are checking "shut off" answer from domstate command.
> > Yes, we should care about both SHUTOFF and CRASHED if possible.
> >
> > 4: Pacemaker finally tries to confirm if it can safely start failover by
> > sending stop command. After killing Qemu, RA replies to Pacemaker
> > "OK" so that Pacemaker can start failover.
> >
> > Problems: We lose debuggable information of VM such as the contents of
> > guest memory.
>
> the OCF interface has start, stop, status (running or not) or an error
> (plus API info)
>
> what I would do in this case is have the script notice that it's in
> crashed status and return an error if it's told to start it. This will
> cause pacemaker to start the service on another system.

I see.
So the key point is to how to check target, crashed in this case, status.

In the HA's point of view, we need that qemu guarantees:
- Guest never start again
- VM never modify external resources

But I'm not so sure if qemu currently guarantees such conditions in generic
manner.

Generically I agree that we always start the guest in another node for
failover. But are there any benefits if we can start the guest in the
same node?

>
> if it's told to stop it, do whatever you can to save state, but definantly
> pause/freeze the instance and return 'stopped'
>
>
>
> no need to define some additional state. As far as pacemaker is concerned
> it's safe as long as there is no chance of it changing the state of any
> shared resources that the other system would use, so simply pausing the
> instance will make it safe. It will be interesting when someone wants to
> investigate what's going on inside the instance (you need to have it be
> functional, but not able to use the network or any shared
> drives/filesystems), but I don't believe that you can get that right in a
> generic manner, the details of what will cause grief and what won't will
> vary from site to site.

If we cannot say in a generic manner, we usually choose the most conservative
one: memory and ... perservation only.

What we concern the most is qemu actually guarantees the conditions we are
talking in this thread.

>
>
> > B. Our proposal: "introduce a new domain state to indicate failover-safe"
> >
> > Pacemaker...(OCF)....RA...(libvirt)...Qemu
> > | | |
> > | | |
> > 1: +---- start ----->+---------------->+ state=RUNNING
> > | | |
> > +---- monitor --->+---- domstate -->+
> > 2: | | |
> > +<---- "OK" ------+<--- "RUNNING" --+
> > | | |
> > | | |
> > | | * Error: state=FROZEN
> > | | | Qemu releases resources
> > | | | and VM gets frozen. (*3)
> > +---- monitor --->+---- domstate -->+
> > 3: | | |
> > +<-- "STOPPED" ---+<--- "FROZEN" ---+
> > | | |
> > +---- stop ------>+---- domstate -->+
> > 4: | | |
> > +<---- "OK" ------+<--- "FROZEN" ---+
> > | | |
> > | | |
> >
> >
> > 1: Pacemaker starts Qemu.
> >
> > 2: Pacemaker checks the state of Qemu via RA.
> > RA checks the state of Qemu using virsh(libvirt).
> > Qemu replies to RA "RUNNING"(normally executing), (*1)
> > and RA returns the state to Pacemaker as it's running correctly.
> >
> > --- SOME ERROR HAPPENS ---
> >
> > 3: Pacemaker checks the state of Qemu via RA.
> > RA checks the state of Qemu using virsh(libvirt).
> > Qemu replies to RA "FROZEN"(VM stopped in a failover-safe state), (*3)
> > and RA keeps it in mind, then replies to Pacemaker "STOPPED".
> >
> > (*3): this is what we want to introduce as a new state. Failover-safe means
> > that Qemu released the external resources, including some namespaces, to
> > be
> > available from another instance.
>
> it doesn't need to release the resources. It just needs to not be able to
> modify them.
>
> pacemaker on the host won't try to start another instance on the same
> host, it will try to start an instance on another host. so you don't need
> to worry about releaseing memory, file locks, etc locally. for remote
> resources you _can't_ release them gracefully if you crash, so your apps
> already need to be able to handle that situation. there's no difference to
> the other instances between a machine that gets powered off via STONITH
> and a virtual system that gets paused.

Can't pacemaker start another instance on the same host by configuration?
Of course I agree that it may not be valuable in most situations.

Takuya

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: david on 13 Jul 2010 05:00

On Tue, 13 Jul 2010, Takuya Yoshikawa wrote:

> On Mon, 12 Jul 2010 02:49:55 -0700 (PDT)
> david(a)lang.hm wrote:
>
>> On Mon, 12 Jul 2010, Takuya Yoshikawa wrote:
>>
>>
>>> and RA returns the state to Pacemaker as it's already stopped.
>>>
>>> (*2): Currently we are checking "shut off" answer from domstate command.
>>> Yes, we should care about both SHUTOFF and CRASHED if possible.
>>>
>>> 4: Pacemaker finally tries to confirm if it can safely start failover by
>>> sending stop command. After killing Qemu, RA replies to Pacemaker
>>> "OK" so that Pacemaker can start failover.
>>>
>>> Problems: We lose debuggable information of VM such as the contents of
>>> guest memory.
>>
>> the OCF interface has start, stop, status (running or not) or an error
>> (plus API info)
>>
>> what I would do in this case is have the script notice that it's in
>> crashed status and return an error if it's told to start it. This will
>> cause pacemaker to start the service on another system.
>
>
> I see.
> So the key point is to how to check target, crashed in this case, status.
>
> In the HA's point of view, we need that qemu guarantees:
> - Guest never start again
> - VM never modify external resources
>
> But I'm not so sure if qemu currently guarantees such conditions in generic
> manner.

you don't have to depend on the return from qemu. there are many OCF
scripts that maintain state internally (look at the e-mail script as an
example), if your OCF script thinks that it should be running and it
isn't, mark it as crashed and don't try to start it again until external
actions clear the status (and you can have a boot do so in case you have
an unclean shutdown)

> Generically I agree that we always start the guest in another node for
> failover. But are there any benefits if we can start the guest in the
> same node?

I don't believe that pacemaker supports this concept.

however, if you wanted to you could have the OCF script know that there is
a 'crshed' instance and instead of trying to start it, start a fresh copy.

>
>>
>> if it's told to stop it, do whatever you can to save state, but definantly
>> pause/freeze the instance and return 'stopped'
>>
>>
>>
>> no need to define some additional state. As far as pacemaker is concerned
>> it's safe as long as there is no chance of it changing the state of any
>> shared resources that the other system would use, so simply pausing the
>> instance will make it safe. It will be interesting when someone wants to
>> investigate what's going on inside the instance (you need to have it be
>> functional, but not able to use the network or any shared
>> drives/filesystems), but I don't believe that you can get that right in a
>> generic manner, the details of what will cause grief and what won't will
>> vary from site to site.
>
>
> If we cannot say in a generic manner, we usually choose the most conservative
> one: memory and ... perservation only.
>
> What we concern the most is qemu actually guarantees the conditions we are
> talking in this thread.

I'll admit that I'm not familiar with using qemu/KVM, but vmware/virtual
box/XEN all have an option to freeze all activity and save the ram to a
disk file for a future restart. the OCF file can trigger such action
easily.

>>> B. Our proposal: "introduce a new domain state to indicate failover-safe"
>>>
>>> Pacemaker...(OCF)....RA...(libvirt)...Qemu
>>> | | |
>>> | | |
>>> 1: +---- start ----->+---------------->+ state=RUNNING
>>> | | |
>>> +---- monitor --->+---- domstate -->+
>>> 2: | | |
>>> +<---- "OK" ------+<--- "RUNNING" --+
>>> | | |
>>> | | |
>>> | | * Error: state=FROZEN
>>> | | | Qemu releases resources
>>> | | | and VM gets frozen. (*3)
>>> +---- monitor --->+---- domstate -->+
>>> 3: | | |
>>> +<-- "STOPPED" ---+<--- "FROZEN" ---+
>>> | | |
>>> +---- stop ------>+---- domstate -->+
>>> 4: | | |
>>> +<---- "OK" ------+<--- "FROZEN" ---+
>>> | | |
>>> | | |
>>>
>>>
>>> 1: Pacemaker starts Qemu.
>>>
>>> 2: Pacemaker checks the state of Qemu via RA.
>>> RA checks the state of Qemu using virsh(libvirt).
>>> Qemu replies to RA "RUNNING"(normally executing), (*1)
>>> and RA returns the state to Pacemaker as it's running correctly.
>>>
>>> --- SOME ERROR HAPPENS ---
>>>
>>> 3: Pacemaker checks the state of Qemu via RA.
>>> RA checks the state of Qemu using virsh(libvirt).
>>> Qemu replies to RA "FROZEN"(VM stopped in a failover-safe state), (*3)
>>> and RA keeps it in mind, then replies to Pacemaker "STOPPED".
>>>
>>> (*3): this is what we want to introduce as a new state. Failover-safe means
>>> that Qemu released the external resources, including some namespaces, to
>>> be
>>> available from another instance.
>>
>> it doesn't need to release the resources. It just needs to not be able to
>> modify them.
>>
>> pacemaker on the host won't try to start another instance on the same
>> host, it will try to start an instance on another host. so you don't need
>> to worry about releaseing memory, file locks, etc locally. for remote
>> resources you _can't_ release them gracefully if you crash, so your apps
>> already need to be able to handle that situation. there's no difference to
>> the other instances between a machine that gets powered off via STONITH
>> and a virtual system that gets paused.
>
>
> Can't pacemaker start another instance on the same host by configuration?

I don't think so. If you think about it from the pacemaker/heartbeat point
of view (where they don't know anything about virtual servers, they just
see them as applications) there are two choices to having a failed
service.

1. issue a start command to try and bring it back up (as I note above, the
OCFscript could be written to have this start a new copy instead of
restarting the old copy)

2. decide that if applications are crashing there may be something
wrong with the host and migrate services to another server

> Of course I agree that it may not be valuable in most situations.

a combination of this and the fact that this can be done so easily (and
flexibly) with scripts in the existing tools makes me question the value
of modifying the kernel.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2
Prev: [PATCH 3/2] Driver core: move platform device creation helpers to .init.text (if MODULE=n)
Next: [PATCH] USB: isp1362-hcd, fix double lock