From: David Miller on
From: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr(a)intel.com>
Date: Sun, 18 Apr 2010 21:57:41 -0700

> This patch adds a callback function pointer to the irq_desc
> structure, along with a registration function and a read-only
> proc entry for each interrupt.
>
> This affinity_hint handle for each interrupt can be used by
> underlying drivers that need a better mechanism to control
> interrupt affinity. The underlying driver can register a
> callback for the interrupt, which will allow the driver to
> provide the CPU mask for the interrupt to anything that
> requests it. The intent is to extend the userspace daemon,
> irqbalance, to help hint to it a preferred CPU mask to balance
> the interrupt into.
>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr(a)intel.com>

I'll leave it to the IRQ layer experts whether this is
appropriate or not, it doesn't look too bad to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ben Hutchings on
On Thu, 2010-04-22 at 05:11 -0700, Peter P Waskiewicz Jr wrote:
> On Wed, 21 Apr 2010, Ben Hutchings wrote:
>
> > On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
> >> This patch adds a callback function pointer to the irq_desc
> >> structure, along with a registration function and a read-only
> >> proc entry for each interrupt.
> >>
> >> This affinity_hint handle for each interrupt can be used by
> >> underlying drivers that need a better mechanism to control
> >> interrupt affinity. The underlying driver can register a
> >> callback for the interrupt, which will allow the driver to
> >> provide the CPU mask for the interrupt to anything that
> >> requests it. The intent is to extend the userspace daemon,
> >> irqbalance, to help hint to it a preferred CPU mask to balance
> >> the interrupt into.
> >
> > Doesn't it make more sense to have the driver follow affinity decisions
> > made from user-space? I realise that reallocating queues is disruptive
> > and we probably don't want irqbalance to trigger that, but there should
> > be a mechanism for the administrator to trigger it.
>
> The driver here would be assisting userspace (irqbalance) to provide
> better details how the HW is laid out with respect to flows. As it stands
> today, irqbalance is almost guaranteed to move interrups to CPUs that are
> not aligned with where applications are running for network adapters.
> This is very apparent when running at speeds in the 10 Gigabit range, or
> even multiple 1 Gigabit ports running at the same time.

I'm well aware that irqbalance isn't making good decisions at the
moment. The question is whether this will really help irqbalance to do
better.

[...]
> > This just assigns IRQs to the first n CPU threads. Depending on the
> > enumeration order, this might result in assigning an IRQ to each of 2
> > threads on a core while leaving other cores unused!
>
> This ixgbe patch is only meant to be an example of how you could use it.
> I didn't hammer out all the corner cases of interrupt alignment in it yet.
> However, ixgbe is already aligning Tx flows onto the CPU/queue pair the Tx
> occurred (i.e. Tx session from CPU 4 will be queued on Tx queue 4),
[...]

OK, now I remember ixgbe has this odd select_queue() implementation.
But this behaviour can result in reordering whenever a user thread
migrates, and in any case Dave discourages people from setting
select_queue(). So I see that these changes would be useful for ixgbe
(together with an update to irqbalance), but they don't seem to fit the
general direction of multiqueue networking on Linux.

(Actually, the hints seem to be incomplete. If there are more than 16
CPU threads then multiple CPU threads can map to the same queues, but it
looks like you only include the first in the queue's hint.)

An alternate approach is to use the RX queue index to drive TX queue
selection. I posted a patch to do that earlier this week. However I
haven't yet had a chance to try that on a suitably large system.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Ben Hutchings on
On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
> This patch adds a callback function pointer to the irq_desc
> structure, along with a registration function and a read-only
> proc entry for each interrupt.
>
> This affinity_hint handle for each interrupt can be used by
> underlying drivers that need a better mechanism to control
> interrupt affinity. The underlying driver can register a
> callback for the interrupt, which will allow the driver to
> provide the CPU mask for the interrupt to anything that
> requests it. The intent is to extend the userspace daemon,
> irqbalance, to help hint to it a preferred CPU mask to balance
> the interrupt into.

Doesn't it make more sense to have the driver follow affinity decisions
made from user-space? I realise that reallocating queues is disruptive
and we probably don't want irqbalance to trigger that, but there should
be a mechanism for the administrator to trigger it.

Looking at your patch for ixgbe:

[...]
> diff --git a/drivers/net/ixgbe/ixgbe_main.c
> b/drivers/net/ixgbe/ixgbe_main.c
> index 1b1419c..3e00d41 100644
> --- a/drivers/net/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ixgbe/ixgbe_main.c
[...]
> @@ -1083,6 +1113,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
> q_vector->eitr = adapter->rx_eitr_param;
>
> ixgbe_write_eitr(q_vector);
> +
> + /*
> + * Allocate the affinity_hint cpumask, assign the mask for
> + * this vector, and register our affinity_hint callback.
> + */
> + alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL);
> + cpumask_set_cpu(v_idx, q_vector->affinity_mask);
> + irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
> + adapter,
> + &ixgbe_irq_affinity_callback);
> }
>
> if (adapter->hw.mac.type == ixgbe_mac_82598EB)
[...]

This just assigns IRQs to the first n CPU threads. Depending on the
enumeration order, this might result in assigning an IRQ to each of 2
threads on a core while leaving other cores unused!

irqbalance can already find the various IRQs associated with a single
net device by looking at the handler names. So it can do at least as
well as this without such a hint. Unless drivers have *useful* hints to
give, I don't see the point in adding this mechanism.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Thomas Gleixner on
On Sun, 18 Apr 2010, Peter P Waskiewicz Jr wrote:
> +/**
> + * struct irqaffinityhint - per interrupt affinity helper
> + * @callback: device driver callback function
> + * @dev: reference for the affected device
> + * @irq: interrupt number
> + */
> +struct irqaffinityhint {
> + irq_affinity_hint_t callback;
> + void *dev;
> + int irq;
> +};

Why do you need that extra data structure ? The device and the irq
number are known, so all you need is the callback itself. So no need
for allocating memory ....

> +static ssize_t irq_affinity_hint_proc_write(struct file *file,
> + const char __user *buffer, size_t count, loff_t *pos)
> +{
> + /* affinity_hint is read-only from proc */
> + return -EOPNOTSUPP;
> +}
> +

Why do you want a write function when the file is read only ?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Peter P Waskiewicz Jr on
On Tue, 27 Apr 2010, Thomas Gleixner wrote:

> On Sun, 18 Apr 2010, Peter P Waskiewicz Jr wrote:
>> +/**
>> + * struct irqaffinityhint - per interrupt affinity helper
>> + * @callback: device driver callback function
>> + * @dev: reference for the affected device
>> + * @irq: interrupt number
>> + */
>> +struct irqaffinityhint {
>> + irq_affinity_hint_t callback;
>> + void *dev;
>> + int irq;
>> +};
>
> Why do you need that extra data structure ? The device and the irq
> number are known, so all you need is the callback itself. So no need
> for allocating memory ....

When I register the function callback with the interrupt layer, I need to
know what device structures to reference back in the driver. In other
words, if I call into an underlying driver with just an interrupt number,
then I have no way at getting at the dev structures (netdevice for me,
plus my private adapter structures), unless I declare them globally
(yuck).

I had a different approach before this one where I assumed the device from
the irq handler callback was safe to use for the device in this new
callback. I didn't feel really great about that, since it's an implicit
assumption that could cause things to go sideways really quickly.

Let me know what you think either way. I'm certainly willing to make a
change, I just don't know at this point what's the safest approach from
what I currently have.

>
>> +static ssize_t irq_affinity_hint_proc_write(struct file *file,
>> + const char __user *buffer, size_t count, loff_t *pos)
>> +{
>> + /* affinity_hint is read-only from proc */
>> + return -EOPNOTSUPP;
>> +}
>> +
>
> Why do you want a write function when the file is read only ?

It's leftover paranoia. I put it in early on, then changed the mode
later. I can remove this function. I'll re-send something once we agree
on how the code in your first comment should look.

Thanks Thomas!

-PJ
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/