From: Linus Torvalds on
On Sun, Jul 11, 2010 at 7:19 PM, Rusty Russell <rusty(a)rustcorp.com.au> wrote:
>
> PS. �When did we start top-commenting and quoting the whole patch?

Sorry, my bad. I've been using the gmail web interface for a while now
(that's how I tracked my email on my cellphone while I was on
vacation, which helped a lot when I got back). I like many of the
features, but the email posting takes some getting used to. Partly
because gmail seems to actively encourage some bad behavior (like top
posting and obviously not having working tabs), but mostly because I'm
just a klutz.

(The big upside of the gmail web interface being that searching works
across folders. So I think I'll stick with it despite the downsides.
And I'll try to be less klutzy)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Eric Dumazet on
Le dimanche 11 juillet 2010 à 18:18 -0700, Linus Torvalds a écrit :
> On Sun, Jul 11, 2010 at 3:03 PM, Steven Rostedt <rostedt(a)goodmis.org> wrote:
> >
> > I have seen some hits with cli-sti. I was considering swapping all
> > preempt_disable() with local_irq_save() in ftrace, but hackbench showed
> > a 30% performance degradation when I did that.
>
> Yeah, but in that case you almost certainly keep the per-cpu cacheline
> hot in the D$ L1 cache, and the stack tracer is presumably also not
> taking any extra I$ L1 misses. So you're not seeing any of the
> downsides. The upside of plain cli/sti is that they're small, and have
> no D$ footprint.
>
> And it's possible that the interrupt flag - at least if/when
> positioned right - wouldn't have any additional D$ footprint under
> normal load either. IOW, if there is an existing per-cpu cacheline
> that is effectively always already dirty and in the cache,
> But that's something that really needs macro-benchmarks - exactly
> because microbenchmarks don't show those effects since they are always
> basically hot-cache.
>

Some kernel dev incorrectly assume they own cpu caches...

This discussion reminds me I noticed a performance problem with
placement of cpu_online_bits and cpu_online_mask on separate sections
(and thus separate cache lines) and a network load.

static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
const struct cpumask *const cpu_online_mask = to_cpumask(cpu_online_bits);

Two changes are possible :

1) Get rid of the cpu_online_mask (its a const pointer to a known
target). I cant see a reason for its need it actually...

2) Dont use a the last const qualifier but __read_mostly to move
cpu_online_mask on same section.

Rusty, could you comment on one or other way before I submit a patch ?

(Of course, possible/present/active have same problem)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Tejun Heo on
Hello,

On 07/11/2010 10:29 PM, Linus Torvalds wrote:
> You need to show some real improvement on real hardware.
>
> I can't really care less about qemu behavior. If the emulator is bad
> at emulating cli/sti, that's a qemu problem.

Yeap, qemu is just nice when developing things like this and I
mentioned it mainly to point out how immature the patch is as it
behaves good (correctness wise) only there yet probably because qemu
doesn't use one of the fancier idle's.

> But if it actually helps on real hardware (which is possible), that
> would be interesting. However, quite frankly, I doubt you can really
> measure it on any bigger load. cli-sti do not tend to be all that
> expensive any more (on a P4 it's probably noticeable, I doubt it shows
> up very much anywhere else).

I'm not very convinced either. Nehalems are said to be able to do
cli-sti sequences every 13 cycles or so, which sounds pretty good and
managing it asynchronously might not buy anything. But what they said
was cli-sti bandwidth, probably meaning that if you do cli-sti's in
succession or tight loop, each iteration will take 13 cycles. So,
there still could be cost related to instruction scheduling.

Another thing is the cost difference of cli/sti's on different
archs/machines. This is the reason Rusty suggested it in the first
place, I think (please correct me if I'm wrong). This means that
we're forced to assume that cli/sti's are relatively expensive when
writing generic code. This, for example, impacts how generic percpu
access operations are defined. Their semantic is defined as
preemption safe but not IRQ safe. ie. IRQ handler may run in the
middle of percpu_add() although on many archs including x86 these
operations are atomic w.r.t. IRQ. If the cost of interrupt masking
operation can be brought down to that of preemption masking across
major architectures, those restrictions can be removed.

x86 might not be the architecture which would benefit the most from
such change but it's the most widely tested architecture so I think it
would be better to have it applied on x86 if it helps a bit while not
being too invasive if this is done on multiple platforms. (Plus, it's
the architecture I'm most familiar with :-)

It only took me a couple of days to get it working and the changes are
pretty localized, so I think it's worthwhile to see whether it
actually helps anything on x86. I'm thinking about doing raw IOs on
SSDs which isn't too unrealistic and heavy on both IRQ masking and IRQ
handling although actual hardware access cost might just drown any
difference and workloads which are heavy on memory allocations and
such might be better fit. If you have any better ideas on testing,
please let me know.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Tejun Heo on
Hello, Rusty.

On 07/12/2010 04:19 AM, Rusty Russell wrote:
> Also, is it worth trying to implement this soft disable generically?
> I know at least ppc64 does it today...
>
> (Y'know, because your initial patch wasn't ambitious enough...)

We can evolve things such that common parts are factored into generic
code but most of important part being heavily dependent on the
specific architecture, I don't think there will be too much (calling
irqhandler on a separate stack if necessary, generic IRQ masking flag
mgmt maybe merged into preemption flag and so on).

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Rusty Russell on
On Mon, 12 Jul 2010 02:41:33 pm Eric Dumazet wrote:
> Two changes are possible :
>
> 1) Get rid of the cpu_online_mask (its a const pointer to a known
> target). I cant see a reason for its need it actually...

There was a reason, but I'm trying to remember it.

ISTR, it was to catch direct frobbing of the masks. That was important:
we were converting code everywhere to hand around cpumasks by ptr
rather than by copy. But that semantic change meant that a function which
previously harmlessly frobbed a copy would now frob (say) cpu_online_mask.

However, ((const struct cpumask *)cpu_online_bits)) would work for that
too. (Well, renaming cpu_online_bits to __cpu_online_bits would be better
since it's not non-static).

Ideally, those masks too would be dynamically allocated. But the boot
changes required for that are best left until someone really needs > 64k
CPUs.

> 2) Dont use a the last const qualifier but __read_mostly to move
> cpu_online_mask on same section.
>
> Rusty, could you comment on one or other way before I submit a patch ?
>
> (Of course, possible/present/active have same problem)

Yep. Might want to do a patch to get rid of the remaining 100 references
to cpu_online_map (etc) as well if you're feeling enthusiastic :)

Thanks!
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/