|
Prev: Does anything dump per-database config settings? (was Re: ALTER DATABASE vs pg_dump)
Next: XML index support
From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on 27 Jun 2008 13:39 Jan Urbański wrote: > I'll add the first one to the commit fest page, and I'm sending it to > -hackers with congratulations on the decision to ditch -patches ;) Hm... someone apparently added this mail to the wiki pag independently of me, so there were two duplicate entries. I found the second description better, so I removed my original entry and left the other one. -- Jan Urbanski GPG key ID: E583D7D2 ouden estin -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Josh Berkus on 27 Jun 2008 17:25 Jan, > Hm... someone apparently added this mail to the wiki pag independently > of me, so there were two duplicate entries. I found the second > description better, so I removed my original entry and left the other > one. Yeah, I've been going through -hackers and -patches and adding stuff to the wiki page. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 3 Jul 2008 15:25 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j.urbanski(a)students.mimuw.edu.pl> writes: > attached are two patches against HEAD. The smaller one is meant to be > commited - it adds some functions that manipulate double-linked lists, > namely inserting a new cell after or before another cell and swapping > two adjacent cells. > The gzipped one is WIP for my GSoC project. I've reworked the algorithm > for determing most common lexemes. I looked over this a bit. I'm not excited about adding functionality to Dllist --- that data structure is barely used at all in the backend, and I think a better case could be made for getting rid of it than adding code to it. The analyze patch doesn't change my mind on the point, because I don't think that Dllist is really helping you there anyway. The data structure I'd suggest is a simple array of pointers to the underlying hash table entries. Since you have a predetermined maximum number of lexemes to track, you can just palloc the array once --- you don't need the expansibility properties of a list. The only operations you need are "add an entry at the end" (if you keep the array sorted by descending count not ascending), "remove the end entry", and "swap adjacent entries", all of which are actually cheaper on an array than on a Dllist. Another point is that you don't really need the array to be sorted all the time. Instead of using what is basically an O(N^2) incremental sort, you could consider applying qsort() when you do need it to be sorted, which is at the end or when the table overflows and you need to discard some entries. If you are willing to discard multiple entries per overflow event, this could be quite cheap --- although I think in the worst case where there are many overflows, it'd go back to being O(N^2). Note BTW that adding a count=1 entry at the end cannot make the array unsorted if it was sorted before. The only event that renders the array unsorted is increasing an item's count to more than the count of its predecessor --- so it might be worth keeping a predecessor pointer in each hashtable entry that identifies its predecessor as of the time of the last array sort, so you could check on-the-fly and avoid unnecessary re-sorts. I'm not totally sure that this idea is a win, but it seems worth investigating. Other than that, the analyze patch looked generally sane to me, and I think you're on the right track. Please rework and resubmit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: "Heikki Linnakangas" on 4 Jul 2008 03:32 Tom Lane wrote: > The data structure I'd suggest is a simple array of pointers > to the underlying hash table entries. Since you have a predetermined > maximum number of lexemes to track, you can just palloc the array once > --- you don't need the expansibility properties of a list. The number of lexemes isn't predetermined. It's 2 * (longest tsvector seen so far), and we don't know beforehand how long the longest tsvector is. repalloc()ing shouldn't be a problem, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 4 Jul 2008 11:53 "Heikki Linnakangas" <heikki(a)enterprisedb.com> writes: > Tom Lane wrote: >> The data structure I'd suggest is a simple array of pointers >> to the underlying hash table entries. Since you have a predetermined >> maximum number of lexemes to track, you can just palloc the array once >> --- you don't need the expansibility properties of a list. > The number of lexemes isn't predetermined. It's 2 * (longest tsvector > seen so far), and we don't know beforehand how long the longest tsvector is. Hmm, I had just assumed without looking too closely that it was stats target times a fudge factor. What is the rationale for doing it as above? I don't think I like the idea of the limit varying over the course of the scan --- that means that lexemes in different places in the input will have significantly different probabilities of surviving to the final result. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
|
Next
|
Last
Pages: 1 2 3 4 5 6 Prev: Does anything dump per-database config settings? (was Re: ALTER DATABASE vs pg_dump) Next: XML index support |