From: "H.Harada" on
Hi,

As I proposed a month before, I am working on window function.
Although this work is at quite early step, I would like to introduce
it since a part of it have been finished. If you can afford and are
interested in it, please review the document and patch, or compile the
applied source to execute an attached sample SQL.

http://umitanuki.net/pgsql/wfv01/design.html

Currently, only aggregation over window does work. I am planning to
work for the combination of window and normal aggregation from now on,
which I guess I can manage to do.

The problem is, as written in the "Things to discussed" section of the
document, how you define window functions (e.g. RANK()). My idea is to
treat them as specialized functions such as SET OF functions and mark
it in pg_proc. But this doesn't resolve RANK() boundary problem.

I am so happy with any kind of comments, reviews or critiques.

Regards,

--
Hitoshi Harada

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Martijn van Oosterhout on
On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote:
> Hi,
>
> As I proposed a month before, I am working on window function.

Very nice!

> http://umitanuki.net/pgsql/wfv01/design.html
>
> The problem is, as written in the "Things to discussed" section of the
> document, how you define window functions (e.g. RANK()). My idea is to
> treat them as specialized functions such as SET OF functions and mark
> it in pg_proc. But this doesn't resolve RANK() boundary problem.

Actually, I would make RANK() and ROW_NUMBER() act more like
aggregates. ISTM you have two kinds of window functions:

- aggregation: a result is calculated over a set and the result copied
across all the rows.
- order depenadant: same as above, but the result is different for each
row.

I think you could make the latter work using the current aggregation
setup, just by calling the final_func for each row rather than just
once at the end.

That would make RANK() a normal aggrgate which returns the number of
distinct values seen so far (assuming input is ordered) and
ROW_NUMBER() is just an alias for COUNT().

I hope this is clear, let me know if it doesn't make sense.

Have a nice day,
--
Martijn van Oosterhout <kleptog(a)svana.org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.
From: Simon Riggs on

On Sat, 2008-07-05 at 16:20 +0200, Martijn van Oosterhout wrote:
> On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote:

> > http://umitanuki.net/pgsql/wfv01/design.html
> >
> > The problem is, as written in the "Things to discussed" section of the
> > document, how you define window functions (e.g. RANK()). My idea is to
> > treat them as specialized functions such as SET OF functions and mark
> > it in pg_proc. But this doesn't resolve RANK() boundary problem.
>
> Actually, I would make RANK() and ROW_NUMBER() act more like
> aggregates. ISTM you have two kinds of window functions:
>
> - aggregation: a result is calculated over a set and the result copied
> across all the rows.
> - order depenadant: same as above, but the result is different for each
> row.
>
> I think you could make the latter work using the current aggregation
> setup, just by calling the final_func for each row rather than just
> once at the end.

AFAICS there's no overlap between windowed aggregates and normal
aggregates, so we can different infrastructure for each. I like the
suggestion of doing it very similarly to current aggregates, but I would
introduce a new function hook for windowed aggregates, wfunc.

i.e. to create a windowed aggregate you would do

CREATE AGGREGATE window_func()
(
sfunc = ...
stype = ...
wfunc = ...
initcond =
)

For each row we would execute the transition function (sfunc) then, if
there is a window function (wfunc) then we call that to return a value
for this tuple (so in that case we execute two functions per tuple in
the window). If wfunc is not set then we return the transition datatype
itself.

Doing it this way

* it will be clear which aggregates are windowed and which non-windowed,
so we can avoid errors running a windowed aggregate in a non-windowed
context

* it also allows us to avoid executing two functions when the windowed
function is very simple - denserank() for example just returns the
number of rows seen so far in the window.

Denserank is fairly simple

CREATE AGGREGATE denserank()
(
sfunc = increment
stype = bigint
initcond = 0
)

rank() is fairly complex because the state data must track 3 things:
* the number of tuples seen so far (bigint)
* the value of the last tuple seen (anyelement)
* the rank of the last tuple seen (bigint)

sfunc would compare the new value with the last value, if they match
then we return the rank of the last tuple. If they don't match then we
set the stored value and rank, then return the number of tuples seen so
far as the rank.

> That would make RANK() a normal aggrgate which returns the number of
> distinct values seen so far (assuming input is ordered) and
> ROW_NUMBER() is just an alias for COUNT().

> I hope this is clear, let me know if it doesn't make sense.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "H.Harada" on
Hi,

2008/7/6 Simon Riggs <simon(a)2ndquadrant.com>:
>
> On Sat, 2008-07-05 at 16:20 +0200, Martijn van Oosterhout wrote:
>> On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote:
>
>> > http://umitanuki.net/pgsql/wfv01/design.html
>> >
>> > The problem is, as written in the "Things to discussed" section of the
>> > document, how you define window functions (e.g. RANK()). My idea is to
>> > treat them as specialized functions such as SET OF functions and mark
>> > it in pg_proc. But this doesn't resolve RANK() boundary problem.
>>
>> Actually, I would make RANK() and ROW_NUMBER() act more like
>> aggregates. ISTM you have two kinds of window functions:
>>
>> - aggregation: a result is calculated over a set and the result copied
>> across all the rows.
>> - order depenadant: same as above, but the result is different for each
>> row.
>>
>> I think you could make the latter work using the current aggregation
>> setup, just by calling the final_func for each row rather than just
>> once at the end.
>
> AFAICS there's no overlap between windowed aggregates and normal
> aggregates, so we can different infrastructure for each. I like the
> suggestion of doing it very similarly to current aggregates, but I would
> introduce a new function hook for windowed aggregates, wfunc.

I think there are two types of functions for windowed mode.
- windowed aggregate
this type of function is exactly same as normal aggregate. So we use
functions that have been in pgsql already. Actually in my patch above,
I didn't introduce any new function. This type of function includes
simply sum(), avg(), etc. which returns same values on a partition or
a window frame.

- windowed function
this is the NEW type of function. I guess we should add a new function
type to pgsql. This type of function includes rank(), rank_dense(),
row_number(), etc. Windowed functions returns different values per
tuple.

The difference between two types is if the function returns the same
value during a partition or different values.

So, windowed aggregate and normal aggregate overlap each other. How
you know which one is that you see OVER clause in SQL just after the
function call. When you see OVER after func(), and pg_proc says it's
an aggregate, it's a windowed aggregate. Otherwise, it's a windowed
function.

If I misunderstood about those definitions please correct me.

> Denserank is fairly simple
>
> CREATE AGGREGATE denserank()
> (
> sfunc = increment
> stype = bigint
> initcond = 0
> )

I think this is ROW_NUMBER(), not DENSE_RANK(), isn't it?

> rank() is fairly complex because the state data must track 3 things:
> * the number of tuples seen so far (bigint)
> * the value of the last tuple seen (anyelement)
> * the rank of the last tuple seen (bigint)
>
> sfunc would compare the new value with the last value, if they match
> then we return the rank of the last tuple. If they don't match then we
> set the stored value and rank, then return the number of tuples seen so
> far as the rank.

Yeah, I think RANK() needs to know some or all of tuple of current
row. But it seems not to take any argument, which is the "boundary
problem" I said. Definitely RANK() or windowed function should know
about current tuple so that when to increment rank. But how? As
explicit argument? or specialized method?


--
Hitoshi Harada

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "H.Harada" on
2008/7/5 Martijn van Oosterhout <kleptog(a)svana.org>:
> On Sat, Jul 05, 2008 at 07:04:29PM +0900, H.Harada wrote:
>> Hi,
>>
>> As I proposed a month before, I am working on window function.
>
> Very nice!
>
>> http://umitanuki.net/pgsql/wfv01/design.html
>>
>> The problem is, as written in the "Things to discussed" section of the
>> document, how you define window functions (e.g. RANK()). My idea is to
>> treat them as specialized functions such as SET OF functions and mark
>> it in pg_proc. But this doesn't resolve RANK() boundary problem.
>
> Actually, I would make RANK() and ROW_NUMBER() act more like
> aggregates. ISTM you have two kinds of window functions:
>
> - aggregation: a result is calculated over a set and the result copied
> across all the rows.
> - order depenadant: same as above, but the result is different for each
> row.

So I agree the definition of these two types.

> I think you could make the latter work using the current aggregation
> setup, just by calling the final_func for each row rather than just
> once at the end.

How do you know which type of two above is used in the same SQL syntax?


--
Hitoshi Harada

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers