From: Heikki Linnakangas on
I've been looking at the way we do vacuums.

The fundamental performance issue is that a vacuum generates
nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
spread the cost like part payment, but the total is the same. In an I/O
bound system, the extra I/O directly leads to less throughput.

Therefore, we need to do less I/O. Dead space map helps by allowing us
to skip blocks that don't need vacuuming, reducing the # of I/Os to
2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the
dead tuples are spread uniformly.

If we could piggyback the vacuum I/Os to the I/Os that we're doing
anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've
tried to figure out a way to do that.

Vacuum is done in 3 phases:

1. Scan heap
2. Vacuum index
3. Vacuum heap

Instead of doing a sequential scan, we could perform the 1st phase by
watching the buffer pool, scanning blocks for dead tuples when they're
in memory and keeping track of which pages we've seen. When all pages
have been seen, the tid list is sorted and 1st phase is done.

In theory, the index vacuum could also be done that way, but let's
assume for now that indexes would be scanned like they are currently.

The 3rd phase can be performed similarly to the 1st phase. Whenever a
page enters the buffer pool, we check the tid list and remove any
matching tuples from the page. When the list is empty, vacuum is complete.

Of course, there's some issues in the design as described above. For
example, the vacuum might take a long time if there's cold spots in the
table. In fact, a block full of dead tuples might never be visited again.

A variation of the scheme would be to keep scanning pages that are in
cache, until the tid list reaches a predefined size, instead of keeping
track of which pages have already been seen. That would deal better with
tables with hot and cold spots, but it couldn't advance the relfrozenid
because there would be no guarantee that all pages are visited. Also, we
could start 1st phase of the next vacuum, while we're still in the 3rd
phase of previous one.

Also, after we've seen 95% of the pages or a timeout expires, we could
fetch the rest of them with random I/O to let the vacuum finish.

I'm not sure how exactly this would be implemented. Perhaps bgwriter or
autovacuum would do it, or a new background process. Presumably the
process would need access to relcache.

One issue is that if we're trying to vacuum every table simultaneously
this way, we'll need more overall memory for the tid lists. I'm hoping
there's a way to implement this without requiring shared memory for the
tid lists, that would make the memory management a nightmare. Also, we'd
need changes to bufmgr API to support this.

This would work nicely with the DSM. The list of pages that need to be
visited in phase 1 could be initialized from the DSM, largely avoiding
the problem with cold spots.

Any thoughts before I start experimenting?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

From: "Simon Riggs" on
On Mon, 2007-01-22 at 13:41 +0000, Heikki Linnakangas wrote:
> Any thoughts before I start experimenting?

Probably only to detail the various use cases we are discussing.

My thoughts on various use cases are:

- small table with frequent update/delete, heap and indexes all/mostly
cached
e.g. Counter tables, DBT2: District/Warehouse TPC-C, pgbench:
Branches/Tellers
Current VACUUM works well in this situation, since the only I/O incurred
is the WAL written for the VACUUM. VACUUM very cheap even if not in
cache because of sequential I/O. Keeping track of whether there are hot
spots in these tables seems like a waste of cycles and could potentially
introduce contention and hence reduce performance. These need to be very
frequently VACUUMed, even when other VACUUMs are required.
My current view: just need multiple concurrent autovacuum processes.

- large table with severe hotspots
e.g. DBT2: NewOrder, larger queue-style tables
The hotspots are likely to be in cache and the not-so-hotspots might or
might not be in cache, but we don't care either way. DSM concept works
well for this case, since we are able to avoid lots of I/O by
appropriate book-keeping. Works well for removing rows after a file-scan
DELETE, as well as for DELETE or UPDATE hot spots.
My current view: DSM would be great for this

- large table with few hotspots
e.g. DBT2: Stock, pgbench: Accounts, most Customer tables
Current VACUUM works very badly in this case, since updates are sparsely
distributed across table. DSM wouldn't help either unless we
differentiate between few/many updates to a block.
My current view: Piggyback concept seems on the right track, but clearly
needs further thought.

Currently we have only one technique for garbage collection, plus one
process to perform it. We need multiple techniques executed by multiple
processes, when required, plus some way of automatically selecting which
is appropriate depending upon the use case. Yes, automatic :-)

DSM and this piggyback idea need not be thought of as competing
techniques.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

From: "Jim C. Nasby" on
On Mon, Jan 22, 2007 at 02:51:47PM +0000, Heikki Linnakangas wrote:
> I've been looking at the way we do vacuums.
>
> The fundamental performance issue is that a vacuum generates
> nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
> spread the cost like part payment, but the total is the same. In an I/O
> bound system, the extra I/O directly leads to less throughput.
>
> Therefore, we need to do less I/O. Dead space map helps by allowing us
> to skip blocks that don't need vacuuming, reducing the # of I/Os to
> 2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the
> dead tuples are spread uniformly.
>
> If we could piggyback the vacuum I/Os to the I/Os that we're doing
> anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've
> tried to figure out a way to do that.
>
> Vacuum is done in 3 phases:
>
> 1. Scan heap
> 2. Vacuum index
> 3. Vacuum heap



> Instead of doing a sequential scan, we could perform the 1st phase by
> watching the buffer pool, scanning blocks for dead tuples when they're
> in memory and keeping track of which pages we've seen. When all pages
> have been seen, the tid list is sorted and 1st phase is done.
>
> In theory, the index vacuum could also be done that way, but let's
> assume for now that indexes would be scanned like they are currently.
>
> The 3rd phase can be performed similarly to the 1st phase. Whenever a
> page enters the buffer pool, we check the tid list and remove any
> matching tuples from the page. When the list is empty, vacuum is complete.

Is there any real reason to demark the start and end of a vacuum? Why
not just go to a continuous process? One possibility is to keep a list
of TIDs for each phase, though that could prove tricky with multiple
indexes.

> A variation of the scheme would be to keep scanning pages that are in
> cache, until the tid list reaches a predefined size, instead of keeping
> track of which pages have already been seen. That would deal better with
> tables with hot and cold spots, but it couldn't advance the relfrozenid
> because there would be no guarantee that all pages are visited. Also, we
> could start 1st phase of the next vacuum, while we're still in the 3rd
> phase of previous one.

What if we tracked freeze status on a per-page basis? Perhaps track the
minimum XID that's on each page. That would allow us to ensure that we
freeze pages that are approaching XID wrap.
--
Jim Nasby jim(a)nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: ITAGAKI Takahiro on

Heikki Linnakangas <heikki(a)enterprisedb.com> wrote:

> Vacuum is done in 3 phases:
> 1. Scan heap
> 2. Vacuum index
> 3. Vacuum heap

> A variation of the scheme would be to keep scanning pages that are in
> cache, until the tid list reaches a predefined size, instead of keeping
> track of which pages have already been seen. That would deal better with
> tables with hot and cold spots, but it couldn't advance the relfrozenid
> because there would be no guarantee that all pages are visited. Also, we
> could start 1st phase of the next vacuum, while we're still in the 3rd
> phase of previous one.

ISTM, it is another DSM that has a tuple-level accuracy, not a page-level.
One of the benefits is that we can skip the 1st phase of vacuum; We will
have a TID list of dead tuples at the start of vacuum, so we can start
from 2nd phase.

I have another idea for use of TID lists -- Store the TIDs after the 1st
or 2nd phase, and exit the vacuum. At the next vacuum, we will do both
the previous 3rd phase and new 1st phase at once, so that I/Os are reduced
(ndirtyblocks + nindexblocks) from (2*ndirtyblocks + nindexblocks) in
average. We've already use a similar method in vacuuming btree indexes
to collect recyclable empty pages.

I think piggybacking of I/Os are very useful. Buffer manager helps us
folding up some of I/Os, but explicit orders are more effective.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: "Pavan Deolasee" on
On 1/22/07, Heikki Linnakangas <heikki(a)enterprisedb.com> wrote:
>
> I've been looking at the way we do vacuums.
>
> The fundamental performance issue is that a vacuum generates
> nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
> spread the cost like part payment, but the total is the same. In an I/O
> bound system, the extra I/O directly leads to less throughput.
>
>
Another source of I/O is perhaps the CLOG read/writes for checking
transaction status. If we are talking about large tables like accounts in
pgbench or customer/stock in DBT2, the tables are vacuumed much later than
the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
that CLOG pages holding the status of many of the transactions might have
been already flushed out of the cache and require an I/O. Since the default
CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
during VACUUM as the transaction ids will be all random in a heap page.

Would it help to set the status of the XMIN/XMAX of tuples early enough such
that the heap page is still in the buffer cache, but late enough such that
the XMIN/XMAX transactions are finished ? How about doing it when the
bgwriter is about to write the page to disk ? Assuming few seconds of life
of a heap page in the buffer cache, hopefully most of the XMIN/XMAX
transactions should have completed and bgwriter can set XMIN(XMAX)_COMMITTED
or XMIN(XMAX)_INVALID for most of the tuples in the page. This would save us
CLOG I/Os later, either during subsequent access to the tuple and/or
vacuum.

Any thoughts ?

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com