Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: "Luke Lonergan" on 4 Mar 2007 14:11

I'm putting this out there before we publish a fix so that we can discuss
how best to fix it.

Doug and Sherry recently found the source of an important performance issue
with the Postgres shared buffer cache.

The issue is summarized like this: the buffer cache in PGSQL is not "scan
resistant" as advertised. A sequential scan of a table larger than cache
will pollute the buffer cache in almost all circumstances.

Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500
(thumper-3) with 4 cores where "bigtable" is a table 2x the size of RAM and
"memtable" is a table that fits into I/O cache:

With our default setting of shared_buffers (16MB):

Operation memtable bigtable
---------------------------------------------------
SELECT COUNT(*) 1221 MB/s 973 MB/s
VACUUM 1709 MB/s 1206 MB/s

We had observed that VACUUM would perform better when done right after a
SELECT. In the above example, the faster rate from disk was 1608 MB/s,
compared to the normal rate of 1206 MB/s.

We verified this behavior on Postgres 8.2 as well. The buffer selection
algorithm is choosing buffer pages scattered throughout the buffer cache in
almost all circumstances.

Sherry traced the behavior to the processor repeatedly flushing the L2
cache. Doug found that we weren't using the Postgres buffer cache the way
we expected, instead we were loading the scanned data from disk into the
cache even though there was no possibility of reusing it. In addition to
pushing other, possibly useful pages from the cache, it has the additional
behavior of invalidating the L2 cache for the remainder of the executor path
that uses the data.

To prove that the buffer cache was the source of the problem, we dropped the
shared buffer size to fit into L2 cache (1MB per Opteron core), and this is
what we saw (drop size of shared buffers to 680KB):

Operation memtable bigtable
---------------------------------------------------
SELECT COUNT(*) 1320 MB/s 1059 MB/s
VACUUM 3033 MB/s 1597 MB/s

These results do not vary with the order of operations.

Thoughts on the best way to fix the buffer selection algorithm? Ideally,
one page would be used in the buffer cache in circumstances where the table
to be scanned is (significantly?) larger than the size of the buffer cache.

- Luke

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

From: Tom Lane on 4 Mar 2007 20:36

"Luke Lonergan" <llonergan(a)greenplum.com> writes:
> The issue is summarized like this: the buffer cache in PGSQL is not "scan
> resistant" as advertised.

Sure it is. As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache. That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway? I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

From: "Luke Lonergan" on 4 Mar 2007 21:32

When we instrument the page selections made within the buffer cache, they are sequential and span the entire address space of the cache.

With respect to whether it's L2, it's a conclusion based on the experimental results. It's not the TLB, as we also tested for the 512 entries for each L2.

One thing I left out of the previous post: the difference between fast and slow behavior was that in the fast case, the buffer selection alternated between two buffer pages. This was the case only when the preceding statement was a SELECT and the statement was VACUUM.

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Tom Lane [mailto:tgl(a)sss.pgh.pa.us]
Sent: Sunday, March 04, 2007 08:36 PM Eastern Standard Time
To: Luke Lonergan
Cc: PGSQL Hackers; Doug Rady; Sherry Moore
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

"Luke Lonergan" <llonergan(a)greenplum.com> writes:
> The issue is summarized like this: the buffer cache in PGSQL is not "scan
> resistant" as advertised.

Sure it is. As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache. That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway? I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane

From: "Luke Lonergan" on 4 Mar 2007 21:35

One more thing: the L2 is invalidated when re-written from the kernel IO cache, but the pages addressed in L2 retain their values when 'writeen thru' which allows the new data to be re-used up the executor chain.

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Tom Lane [mailto:tgl(a)sss.pgh.pa.us]
Sent: Sunday, March 04, 2007 08:36 PM Eastern Standard Time
To: Luke Lonergan
Cc: PGSQL Hackers; Doug Rady; Sherry Moore
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

"Luke Lonergan" <llonergan(a)greenplum.com> writes:
> The issue is summarized like this: the buffer cache in PGSQL is not "scan
> resistant" as advertised.

Sure it is. As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache. That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway? I wonder whether your
numbers are explained by some other consideration than you think.

regards, tom lane

From: Mark Kirkwood on 5 Mar 2007 00:03

Tom Lane wrote:
> "Luke Lonergan" <llonergan(a)greenplum.com> writes:
>> The issue is summarized like this: the buffer cache in PGSQL is not "scan
>> resistant" as advertised.
>
> Sure it is. As near as I can tell, your real complaint is that the
> bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
> which is hardly surprising considering it doesn't know the size of L2
> cache. That's not a consideration that we've ever taken into account.
>

To add a little to this - forgetting the scan resistant point for the
moment... cranking down shared_buffers to be smaller than the L2 cache
seems to help *any* sequential scan immensely, even on quite modest HW:

e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram,

SELECT count(*) FROM lineitem (which is about 11GB) performance:

Shared_buffers Elapsed
-------------- -------
400MB 101 s
128KB 74 s

When I've profiled this activity, I've seen a lot of time spent
searching for/allocating a new buffer for each page being fetched.
Obviously having less of them to search through will help, but having
less than the L2 cache-size worth of 'em seems to help a whole lot!

Cheers

Mark

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?