From: Markus Wanner on
Hi,

On 08/01/2010 08:04 PM, Sushant Sinha wrote:
> 1. We do not have separate tokens "wikipedia" and "org"
> 2. If we have the two tokens we should have them at adjacent position so
> that a phrase search for "wikipedia org" should work.

This would needlessly increase the number of tokens. Instead you'd
better make it work like compound word support, having just "wikipedia"
and "org" as tokens.

Searching for "wikipedia.org" or "wikipedia org" should then result in
the same search query with the two tokens: "wikipedia" and "org".

> position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)

IMO the differentiation between WORDs and URLs is not something the text
search engine should have to take care a lot. Let it just do the
searching and make it do that well.

What does a token "wikipedia.org/search?q=sushant" buy you in terms of
text searching? Or even result highlighting? I wouldn't expect anybody
to want to search for a full URL, do you?

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Sushant Sinha on
> On 08/01/2010 08:04 PM, Sushant Sinha wrote:
> > 1. We do not have separate tokens "wikipedia" and "org"
> > 2. If we have the two tokens we should have them at adjacent position so
> > that a phrase search for "wikipedia org" should work.
>
> This would needlessly increase the number of tokens. Instead you'd
> better make it work like compound word support, having just "wikipedia"
> and "org" as tokens.

The current text parser already returns url and url_path. That already
increases the number of unique tokens. I am only asking for adding of
normal english words as well so that if someone types only "wikipedia"
he gets a match.

>
> Searching for "wikipedia.org" or "wikipedia org" should then result in
> the same search query with the two tokens: "wikipedia" and "org".

Earlier people have expressed the need to index urls/emails and
currently the text parser already does so. Reverting that would be a
regression of functionality. Further, a ranking function can take
advantage of direct match of a token.

> > position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant)
>
> IMO the differentiation between WORDs and URLs is not something the text
> search engine should have to take care a lot. Let it just do the
> searching and make it do that well.

Postgres english parser already emits urls as tokens. Only thing I am
asking is on improving the tokenization and positioning.

> What does a token "wikipedia.org/search?q=sushant" buy you in terms of
> text searching? Or even result highlighting? I wouldn't expect anybody
> to want to search for a full URL, do you?

There have been need expressed in past. And an exact token match can
result in better ranking functions. For example, a tf-idf ranking will
rank matching of such unique tokens significantly higher.

-Sushant.

> Regards
>
> Markus Wanner



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Markus Wanner on
Hi,

On 08/02/2010 03:12 PM, Sushant Sinha wrote:
> The current text parser already returns url and url_path. That already
> increases the number of unique tokens.

Well, I think I simply turned that off to be able to search for plain
words. It still works for complete URLs, those are just treated like
text, then.

> Earlier people have expressed the need to index urls/emails and
> currently the text parser already does so. Reverting that would be a
> regression of functionality. Further, a ranking function can take
> advantage of direct match of a token.

That's a point, yes. However, simply making the same string turn up
twice in the tokenizer's output doesn't sound like the right solution to
me. Especially considering that the query parser uses the very same
tokenizer.

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers