From: Andi Kleen on
> Please, as Peter and Boris asked you already, quote a concrete, specific
> example:

It was already in my answer to Peter.

>
> 'Specific event X occurs, kernel wants/needs to do Y. This cannot be done
> via the suggested method due to Z.'
>
> Your generic arguments look wrong (to the extent they are specified) and it
> makes it much easier and faster to address your points if you dont blur them
> by vagaries.

It's one of the fundamental properties of recoverable errors.

Error happens.
Machine check or NMI or other exception happens.
That exception runs on the exception stack
The error is not fatal, but recoverable.
For example you want to kill a process or call hwpoison or do some other
recovery action. These generally have to sleep to do anything
interesting.
You cannot do the sleeping on the exception stack, so you push it to
another context.

Now just because an error is recoverable doesn't mean it's not critical
(I think that was the mistake Boris made). If you don't do something
(like killing or recovery) you could end up in a loop or consume
corrupted data or something else bad.

So the error has to have a fail safe path from detection to handling.

That's quite different from logging or performance counting etc.
where dropping events on overload is normal and expected.

Normally it can be only done by using dedicated resources.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on
From: Andi Kleen <andi(a)firstfloor.org>
Date: Thu, Jun 24, 2010 at 10:01:43AM -0400

> > Please, as Peter and Boris asked you already, quote a concrete, specific
> > example:
>
> It was already in my answer to Peter.
>
> >
> > 'Specific event X occurs, kernel wants/needs to do Y. This cannot be done
> > via the suggested method due to Z.'
> >
> > Your generic arguments look wrong (to the extent they are specified) and it
> > makes it much easier and faster to address your points if you dont blur them
> > by vagaries.
>
> It's one of the fundamental properties of recoverable errors.
>
> Error happens.
> Machine check or NMI or other exception happens.
> That exception runs on the exception stack
> The error is not fatal, but recoverable.
> For example you want to kill a process or call hwpoison or do some other
> recovery action. These generally have to sleep to do anything
> interesting.
> You cannot do the sleeping on the exception stack, so you push it to
> another context.
>
> Now just because an error is recoverable doesn't mean it's not critical
> (I think that was the mistake Boris made).

It wasn't a mistake - I was simply trying to lure you into giving a more
concrete example so that we all land on the same page and we know what
the heck you/we/all are talking about.

> If you don't do something
> (like killing or recovery) you could end up in a loop or consume
> corrupted data or something else bad.
>
> So the error has to have a fail safe path from detection to handling.

So we are talking about a more involved and "could-sleep" error
recovery.

> That's quite different from logging or performance counting etc.
> where dropping events on overload is normal and expected.

So I went back and reread the whole thread, and correct me if I'm
wrong but the whole run softirq after NMI has one use case for now -
"could-sleep" error handling for MCEs _only_ on x86. So you're changing
a bunch of generic and x86 kernel code just for error handling. Hmm,
that's a kinda big hammer in my book.

A slimmer solution is a much better way to go, IMHO. I think Peter said
something about irq_exit(), which should be just fine.

But AFAICT an arch-specific solution would be even better, e.g.
if you call into your deferred work helper from paranoid_exit in
<arch/x86/kernel/entry_64.S>. I.e, something like

#ifdef CONFIG_X86_MCE
testl $_TIF_NEED_POST_NMI,%ebx
jnz do_post_nmi_work
#endif

Or even slimmer, rewrite the paranoidzeroentry to a MCE-specific variant
which does the added functionality. But that wouldn't be extensible if
other entities want post-NMI work later.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
On Thu, Jun 24, 2010 at 05:41:24PM +0200, Borislav Petkov wrote:
> > If you don't do something
> > (like killing or recovery) you could end up in a loop or consume
> > corrupted data or something else bad.
> >
> > So the error has to have a fail safe path from detection to handling.
>
> So we are talking about a more involved and "could-sleep" error
> recovery.

That's one case, there are other too.

>
> > That's quite different from logging or performance counting etc.
> > where dropping events on overload is normal and expected.
>
> So I went back and reread the whole thread, and correct me if I'm
> wrong but the whole run softirq after NMI has one use case for now -
> "could-sleep" error handling for MCEs _only_ on x86. So you're changing

Nope, there are multiple use cases. Today it's background MCE
and possibly perf if it ever decides to share code
with the rest of the kernel instead of wanting to be Bork of Linux.
Future ones would be more MCE errors and also non MCE errors like NMIs.

> a bunch of generic and x86 kernel code just for error handling. Hmm,
> that's a kinda big hammer in my book.

Actually no, it would just make the current code slightly cleaner
and somewhat more general. But for most cases it works without it.
>
> A slimmer solution is a much better way to go, IMHO. I think Peter said
> something about irq_exit(), which should be just fine.

The "slimmer solution" is there, but it has some limitations.
I merely said that softirqs would be useful for solving these limitations
(but are not strictly needed)

Anyways slimmer solution was even originally proposed,
just some of the earlier review proposed softirqs instead.
So Ying posts softirqs and then he gets now flamed for posting
softirqs. Overall there wasn't much consistency in the suggestions,
three different reviewers suggested three incompatible approaches.

Anyways if there are no softirqs that's fine too, the error
handler can probably live with not having that.

> But AFAICT an arch-specific solution would be even better, e.g.
> if you call into your deferred work helper from paranoid_exit in
> <arch/x86/kernel/entry_64.S>. I.e, something like

Yes that helps for part of the error handling (in fact this
has been implemented), but that does not solve the self interrupt
problem which requires delaying until next cli.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Huang Ying on
On Thu, 2010-06-24 at 14:35 +0800, Peter Zijlstra wrote:
> Something like this, but filled out with some arch code that does the
> self-ipi and calls irq_work_run() should do.
>
> No need to molest the softirq code, no need for limited vectors of any
> kind.

Now, as far as my understanding goes, hard IRQ based solution is
acceptable for everyone.

Ingo and Andi,

Do you agree?

> Signed-off-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> ---
> include/linux/irq_callback.h | 13 ++++++++
> kernel/irq_callback.c | 66 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 79 insertions(+)
>
> Index: linux-2.6/include/linux/irq_callback.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/include/linux/irq_callback.h
> @@ -0,0 +1,13 @@
> +#ifndef _LINUX_IRQ_CALLBACK_H
> +#define _LINUX_IRQ_CALLBACK_H
> +
> +struct irq_work {
> + struct irq_work *next;
> + void (*func)(struct irq_work *);
> +};

It is better to add "void *data" field in this struct to allow same
function can be used for multiple struct irq_work.

And I think IRQ is the implementation detail here, so irq_work is
probably not a good name. nmi_return_notifier or nmi_callback is better?

> +int irq_work_queue(struct irq_work *entry, void (*func)(struct irq_work *));
> +void irq_work_run(void);
> +void irq_work_sync(struct irq_work *entry);
> +
> +#endif /* _LINUX_IRQ_CALLBACK_H */
> Index: linux-2.6/kernel/irq_callback.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/kernel/irq_callback.c
> @@ -0,0 +1,66 @@
> +
> +#include <linux/irq_callback.h>
> +
> +#define CALLBACK_TAIL ((struct irq_work *)-1UL)
> +
> +static DEFINE_PER_CPU(struct irq_work *, irq_work_list) = {
> + CALLBACK_TAIL,
> +};
> +
> +int irq_work_queue(struct irq_work *entry, void (*func)(struct irq_work *))
> +{
> + struct irq_work **head;
> +
> + if (cmpxchg(&entry->next, NULL, CALLBACK_TAIL) != NULL)
> + return 0;
> +
> + entry->func = func;
> +
> + head = &get_cpu_var(irq_work_list);
> +
> + do {
> + entry->next = *head;
> + } while (cmpxchg(head, entry->next, entry) != entry->next);
> +
> + if (entry->next == CALLBACK_TAIL)
> + arch_self_ipi();
> +
> + put_cpu_var(irq_work_list);
> + return 1;
> +}
> +
> +void irq_work_run(void)
> +{
> + struct irq_work *list;
> +
> + list = xchg(&__get_cpu_var(irq_work_list), CALLBACK_TAIL);
> + while (list != CALLBACK_TAIL) {
> + struct irq_work *entry = list;
> +
> + list = list->next;
> + entry->func(entry);
> +
> + entry->next = NULL;

entry->next = NULL should be put before entry->func(entry), so that we
will not lose a notification from NMI. And maybe check irq_work_list for
several times to make sure nothing is lost.

> + /*
> + * matches the mb in cmpxchg() in irq_work_queue()
> + */
> + smp_wmb();
> + }
> +}

I don't know why we need smp_wmb() here and smp_rmb() in
irq_work_pending(). The smp_<x>mb() in original perf_pending_xxx code is
not necessary too. Because smp_<x>mb is invoked in wake_up_process() and
__wait_event() already.

> +static int irq_work_pending(struct irq_work *entry)
> +{
> + /*
> + * matches the wmb in irq_work_run
> + */
> + smp_rmb();
> + return entry->next != NULL;
> +}
> +
> +void irq_work_sync(struct irq_work *entry)
> +{
> + WARN_ON_ONCE(irqs_disabled());
> +
> + while (irq_work_pending(entry))
> + cpu_relax();
> +}

If we move entry->next = NULL earlier in irq_work_run(), we need another
flag to signify the entry->func is running here.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on
On Fri, 2010-06-25 at 10:12 +0800, Huang Ying wrote:
>
> It is better to add "void *data" field in this struct to allow same
> function can be used for multiple struct irq_work.

No, simply do:

struct my_foo {
struct irq_work work;
/* my extra data */
}

void my_func(struct irq_work *work)
{
struct my_foo *foo = container_of(work, struct my_foo, work);

/* tada! */
}


> And I think IRQ is the implementation detail here, so irq_work is
> probably not a good name. nmi_return_notifier or nmi_callback is better?

Well, its ran in hard-irq context, so its an irq work. There's nothing
that says it can only be used from NMI context.

> > +void irq_work_run(void)
> > +{
> > + struct irq_work *list;
> > +
> > + list = xchg(&__get_cpu_var(irq_work_list), CALLBACK_TAIL);
> > + while (list != CALLBACK_TAIL) {
> > + struct irq_work *entry = list;
> > +
> > + list = list->next;
> > + entry->func(entry);
> > +
> > + entry->next = NULL;
>
> entry->next = NULL should be put before entry->func(entry), so that we
> will not lose a notification from NMI. And maybe check irq_work_list for
> several times to make sure nothing is lost.

But then _sync() will return before its done executing.

I think clearing after the function is done executing is the only sane
semantics (and yes, I should fix the current perf code).

You can always miss an NMI since it can always happen before the
callback gets done, and allowing another enqueue before the callback is
complete is asking for trouble.

> > + /*
> > + * matches the mb in cmpxchg() in irq_work_queue()
> > + */
> > + smp_wmb();
> > + }
> > +}
>
> I don't know why we need smp_wmb() here and smp_rmb() in
> irq_work_pending(). The smp_<x>mb() in original perf_pending_xxx code is
> not necessary too. Because smp_<x>mb is invoked in wake_up_process() and
> __wait_event() already.

The smp_wmb() wants to be before ->next = NULL; so that all writes are
completed before we release the entry. To that same effect _sync() and
_queue need the (r)mb.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/