vmalloc performance [Kernel]

Prev: KVM: remove unused code in kvm_coalesced_mmio_init()
Next: Trying to fix ITE-887x parallel/serial driver bugs (including unhandled IRQs)

From: Steven Whitehouse on 16 Apr 2010 04:50

Hi,

On Fri, 2010-04-16 at 16:12 +1000, Nick Piggin wrote:
> On Thu, Apr 15, 2010 at 09:33:08AM +0100, Steven Whitehouse wrote:
> > Hi,
> >
> > On Thu, 2010-04-15 at 01:35 +0900, Minchan Kim wrote:
> > > On Thu, 2010-04-15 at 00:13 +0900, Minchan Kim wrote:
> > > > On Wed, Apr 14, 2010 at 9:49 PM, Steven Whitehouse <swhiteho(a)redhat.com> wrote:
> > > > >> When this module is run on my x86_64, 8 core, 12 Gb machine, then on an
> > > > >> otherwise idle system I get the following results:
> > > > >>
> > > > >> vmalloc took 148798983 us
> > > > >> vmalloc took 151664529 us
> > > > >> vmalloc took 152416398 us
> > > > >> vmalloc took 151837733 us
> > > > >>
> > > > >> After applying the two line patch (see the same bz) which disabled the
> > > > >> delayed removal of the structures, which appears to be intended to
> > > > >> improve performance in the smp case by reducing TLB flushes across cpus,
> > > > >> I get the following results:
> > > > >>
> > > > >> vmalloc took 15363634 us
> > > > >> vmalloc took 15358026 us
> > > > >> vmalloc took 15240955 us
> > > > >> vmalloc took 15402302 us
> > >
> > >
> > > > >>
> > > > >> So thats a speed up of around 10x, which isn't too bad. The question is
> > > > >> whether it is possible to come to a compromise where it is possible to
> > > > >> retain the benefits of the delayed TLB flushing code, but reduce the
> > > > >> overhead for other users. My two line patch basically disables the delay
> > > > >> by forcing a removal on each and every vfree.
> > > > >>
> > > > >> What is the correct way to fix this I wonder?
> > > > >>
> > > > >> Steve.
> > > > >>
> > >
> > > In my case(2 core, mem 2G system), 50300661 vs 11569357.
> > > It improves 4 times.
> > >
> > Looking at the code, it seems that the limit, against which my patch
> > removes a test, scales according to the number of cpu cores. So with
> > more cores, I'd expect the difference to be greater. I have a feeling
> > that the original reporter had a greater number than the 8 of my test
> > machine.
> >
> > > It would result from larger number of lazy_max_pages.
> > > It would prevent many vmap_area freed.
> > > So alloc_vmap_area takes long time to find new vmap_area. (ie, lookup
> > > rbtree)
> > >
> > > How about calling purge_vmap_area_lazy at the middle of loop in
> > > alloc_vmap_area if rbtree lookup were long?
> > >
> > That may be a good solution - I'm happy to test any patches but my worry
> > is that any change here might result in a regression in whatever
> > workload the lazy purge code was originally designed to improve. Is
> > there any way to test that I wonder?
>
> Ah this is interesting. What we could do is have a "free area cache"
> like the user virtual memory allocator has, which basically avoids
> restarting the search from scratch.
>
> Or we could perhaps go one better and do a more sophisticated free space
> allocator.
>
> Bigger systems will indeed get hurt by increasing flushes so I'd prefer
> to avoid that. But that's not a good justification for a slowdown for
> small systems. What good is having cake if you can't also eat it? :)
>
I'm all for cake, particularly if its lemon cake :-)

>
> > > BTW, Steve. Is is real issue or some test?
> > > I doubt such vmalloc bomb workload is real.
> >
> > Well the answer is both yes and no :-) So this is how I came across the
> > issue. I received a report that GFS2 performance had regressed in recent
> > kernels in relation to a test which basically fires lots of requests at
> > it via NFS. The reporter of this problem gave me two bits of
> > information: firstly that by eliminating all readdir calls from the
> > test, the regression is never seen and secondly that oprofile showed
> > that two functions related to vmalloc (rb_next, find_vmap_area,
> > alloc_vmap_area in that order) were taking between them about 60% of the
> > total cpu time.
>
> Thanks for tracking this down. I didn't realize GFS2 used vmalloc
> extensively. How large are typical vmalloc requests here, can you
> tell me? There is a per-cpu virtual memory allocator that is more
> scalable than the global one, and would help avoid these problems
> too.
>
> XFS is using it at the moment, but we are looking for some more
> users of the API so as to get more testing coverage. I was
> considering moving vmalloc over to use it (vm_map_ram).
>
> It's still probably a good idea to improve the global allocator
> regression first, but that might get you even more performance.
>
> Thanks,
> Nick
>

Well, I wouldn't say extensively... its used just once in readdir. Even
then we only use it for larger directories. We use it for two things,
basically as a temporary buffer to record pointers to all the "leaf
blocks" in one hash chain, and also as a temporary buffer to record
pointers to all the directory entries in the same hash chain. The only
reason that its used to keep track of the pointers to the leaf blocks
themselves is simply that it was easier than having two separate
allocations.

The reason that we need a list of pointers to hash entries is so that we
can feed the resulting buffer to sort() in order to put the entries into
hash order. Sorting into hash order isn't really the optimal way to
return the entries in readdir() but due to the slightly odd way in which
directories expand as entries are added, it is the only ordering which
allows us to be certain of not listing entries twice or missing entries
if insertions are made by one process while another process is making
successive calls to readdir().

The per-cpu virtual memory allocator though, sounds like a better fit
for GFS2's needs here, so we should look into using that in future I
think.

As for the size of the allocations, that depends a entirely on the
directory size. It could be anything from a single page to a couple of
dozen or more.

For the test which the original reporter was running, I suspect that it
would be multiple pages, but probably less than 10.

If a readdir spans multiple hash chains, then its possible that there
will be two or more calls to vmalloc/vfree per readdir. However since
readdir calls tend to use buffers based on the inode's optimal I/O size,
its pretty unlikely that this will happen very often, and even then its
only likely to span two hash chains at most.

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Whitehouse on 16 Apr 2010 10:20

Hi,

On Fri, 2010-04-16 at 01:51 +0900, Minchan Kim wrote:
[snip]
> Thanks for the explanation. It seems to be real issue.
>
> I tested to see effect with flush during rb tree search.
>
> Before I applied your patch, the time is 50300661 us.
> After your patch, 11569357 us.
> After my debug patch, 6104875 us.
>
> I tested it as changing threshold value.
>
> threshold time
> 1000 13892809
> 500 9062110
> 200 6714172
> 100 6104875
> 50 6758316
>
My results show:

threshold time
100000 139309948
1000 13555878
500 10069801
200 7813667
100 18523172
50 18546256

> And perf shows smp_call_function is very low percentage.
>
> In my cases, 100 is best.
>
Looks like 200 for me.

I think you meant to use the non _minmax version of proc_dointvec too?
Although it doesn't make any difference for this basic test.

The original reporter also has 8 cpu cores I've discovered. In his case
divided by 4 cpus where as mine are divided by 2 cpus, but I think that
makes no real difference in this case.

I'll try and get some further test results ready shortly. Many thanks
for all your efforts in tracking this down,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Whitehouse on 19 Apr 2010 09:00

Hi,

On Mon, 2010-04-19 at 00:14 +0900, Minchan Kim wrote:
> On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote:
> > Hi,
> >
> > On Fri, 2010-04-16 at 01:51 +0900, Minchan Kim wrote:
> > [snip]
> > > Thanks for the explanation. It seems to be real issue.
> > >
> > > I tested to see effect with flush during rb tree search.
> > >
> > > Before I applied your patch, the time is 50300661 us.
> > > After your patch, 11569357 us.
> > > After my debug patch, 6104875 us.
> > >
> > > I tested it as changing threshold value.
> > >
> > > threshold time
> > > 1000 13892809
> > > 500 9062110
> > > 200 6714172
> > > 100 6104875
> > > 50 6758316
> > >
> > My results show:
> >
> > threshold time
> > 100000 139309948
> > 1000 13555878
> > 500 10069801
> > 200 7813667
> > 100 18523172
> > 50 18546256
> >
> > > And perf shows smp_call_function is very low percentage.
> > >
> > > In my cases, 100 is best.
> > >
> > Looks like 200 for me.
> >
> > I think you meant to use the non _minmax version of proc_dointvec too?
>
> Yes. My fault :)
>
> > Although it doesn't make any difference for this basic test.
> >
> > The original reporter also has 8 cpu cores I've discovered. In his case
> > divided by 4 cpus where as mine are divided by 2 cpus, but I think that
> > makes no real difference in this case.
> >
> > I'll try and get some further test results ready shortly. Many thanks
> > for all your efforts in tracking this down,
> >
> > Steve.
>
> I voted "free area cache".
My results with this patch are:

vmalloc took 5419238 us
vmalloc took 5432874 us
vmalloc took 5425568 us
vmalloc took 5423867 us

So thats about a third of the time it took with my original patch, so
very much going in the right direction :-)

I did get a compile warning:
CC mm/vmalloc.o
mm/vmalloc.c: In function '__free_vmap_area':
mm/vmalloc.c:454: warning: unused variable 'prev'

.....harmless, but it should be fixed before the final version,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 19 Apr 2010 09:40

On Mon, Apr 19, 2010 at 12:14:09AM +0900, Minchan Kim wrote:
> On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote:
> Nick, What do you think about "free area cache" approach?

Thanks, yep something like this is what I had in mind. Looks like you
have some really nice speed improvements which is great.

> In this version, I don't consider last hole and backward cache movement which is
> like mmap's cached_hole_size
> That's because I want to flush vmap_areas freed intentionally if we meet vend.
> It makes flush frequent than old but it's trade-off. In addition, vmalloc isn't
> critical compared to mmap about performance. So I think that's enough.
>
> If you don't opposed, I will repost formal patch without code related to debug.

I think I would prefer to be a little smarter about using lower
addresses first. I know the lazy TLB flushing works against this, but
that is an important speed tradeoff, wheras there is not really any
downside to trying hard to allocate low areas first. Keeping virtual
addresses dense helps with locality of reference of page tables, for
one.

So I would like to see:
- invalidating the cache in the case of vstart being decreased.
- Don't unconditionally reset the cache to the last vm area freed,
because you might have a higher area freed after a lower area. Only
reset if the freed area is lower.
- Do keep a cached hole size, so smaller lookups can restart a full
search.

Probably also at this point, moving some of the rbtree code (like the
search code) into functions would manage the alloc_vmap_area complexity.
Maybe do this one first if you're going to write a patchset.

What do you think? Care to have a go? :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Minchan Kim on 19 Apr 2010 10:20

On Mon, Apr 19, 2010 at 10:38 PM, Nick Piggin <npiggin(a)suse.de> wrote:
> On Mon, Apr 19, 2010 at 12:14:09AM +0900, Minchan Kim wrote:
>> On Fri, 2010-04-16 at 15:10 +0100, Steven Whitehouse wrote:
>> Nick, What do you think about "free area cache" approach?
>
> Thanks, yep something like this is what I had in mind. Looks like you
> have some really nice speed improvements which is great.
>
>
>> In this version, I don't consider last hole and backward cache movement which is
>> like mmap's cached_hole_size
>> That's because I want to flush vmap_areas freed intentionally if we meet vend.
>> It makes flush frequent than old but it's trade-off. In addition, vmalloc isn't
>> critical compared to mmap about performance. So I think that's enough.
>>
>> If you don't opposed, I will repost formal patch without code related to debug.
>
> I think I would prefer to be a little smarter about using lower
> addresses first. I know the lazy TLB flushing works against this, but
> that is an important speed tradeoff, wheras there is not really any
> downside to trying hard to allocate low areas first. Keeping virtual
> addresses dense helps with locality of reference of page tables, for
> one.
>
> So I would like to see:
> - invalidating the cache in the case of vstart being decreased.
> - Don't unconditionally reset the cache to the last vm area freed,
> because you might have a higher area freed after a lower area. Only
> reset if the freed area is lower.
> - Do keep a cached hole size, so smaller lookups can restart a full
> search.

Firstly, I considered it which is used by mmap.
But I thought it might be overkill since vmalloc space isn't large
compared to mmaped addresses.
I should have thought about locality of reference of page tables. ;-)

> Probably also at this point, moving some of the rbtree code (like the
> search code) into functions would manage the alloc_vmap_area complexity.
> Maybe do this one first if you're going to write a patchset.
>
> What do you think? Care to have a go? :)

Good. I will add your requirements to TODO list.
But don't wait me. If you care to have a go, RUN!!!
I am looking forward to seeing your awesome patches. :)

Thanks for careful review, Nick.
--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: KVM: remove unused code in kvm_coalesced_mmio_init()
Next: Trying to fix ITE-887x parallel/serial driver bugs (including unhandled IRQs)