Seeking suggestions for site search [General Linux]

Prev: badblocks
Next: Plextor PX-712A DVD+-RW can't write in CentOS 5.4

From: Allen Kistler on 11 Oct 2009 20:15

As the title states, I'm seeking suggestions for a site search engine to
search wikis, regular web sites, and possibly CVS. At the highest
level, the requirements are:

1. Must be open source & fee-free
2. Must not be Java or C/C++ (not debatable, don't try)
3. Should be Python (I might be able to sell Perl, though)
4. Should have an active development community
5. Should have an API that allows apps to query
6. Should have ability to tweak results administratively
(e.g., choose which pages get listed for a certain word,
even if they don't have that word, and which pages don't
get listed, even if they do have the word

Most stuff that I've been able to find that meets Req 1 gets killed by
Req 2.

What I've found so far that survives Req 2:

Gonzui - written in Ruby, doesn't appear to be actively maintained
Lucene - although written in Java, it has ports to Perl and Ruby
Namazu - written in Perl
OpenFTS - written in Perl, doesn't appear to be actively maintained

I haven't dug deeply into those above, but are there any others I should
consider? Any experience with those above?

From: Keith Keller on 11 Oct 2009 23:08

On 2009-10-12, Allen Kistler <ackistler(a)oohay.moc> wrote:
> As the title states, I'm seeking suggestions for a site search engine to
> search wikis, regular web sites, and possibly CVS.

You didn't say: searching from the front-end (i.e., screenscraping
someone else's site) or the back-end (i.e., indexing your own site)?
I'm assuming the latter.

> Lucene - although written in Java, it has ports to Perl and Ruby

I'm assuming you're talking about KinoSearch here as the Perl port? We
use it for a fairly large chunk of data, and still literally tens of
millions of ''documents'' fit into an index about 10GB in size, and it's
incredibly fast to return results. But it does require you to write
code to build the index--you can't just throw it at a wiki off the shelf
and hope it works. So you'll also have to figure out how to get the
list of documents to feed to it, and what sort of data to stuff into the
index.

The KinoSearch page is here: http://www.rectangular.com/kinosearch/

If you're talking about Plucene, forget about it, it's dog slow. See
http://www.rectangular.com/kinosearch/benchmarks.html if you're willing
to trust their benchmarks (I've never done the benchmarks myself).

--keith

--
kkeller-usenet(a)wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

From: Allen Kistler on 12 Oct 2009 02:50

Keith Keller wrote:
> On 2009-10-12, Allen Kistler <ackistler(a)oohay.moc> wrote:
>> As the title states, I'm seeking suggestions for a site search engine to
>> search wikis, regular web sites, and possibly CVS.
>
> You didn't say: searching from the front-end (i.e., screenscraping
> someone else's site) or the back-end (i.e., indexing your own site)?
> I'm assuming the latter.

Yes, indexing our own site, which would actually be multiple content
sources, so I was expecting there to be some crawling involved. If I
can avoid crawling, that's okay, too.

>> Lucene - although written in Java, it has ports to Perl and Ruby
>
> I'm assuming you're talking about KinoSearch here as the Perl port? We
> use it for a fairly large chunk of data, and still literally tens of
> millions of ''documents'' fit into an index about 10GB in size, and it's
> incredibly fast to return results. But it does require you to write
> code to build the index--you can't just throw it at a wiki off the shelf
> and hope it works. So you'll also have to figure out how to get the
> list of documents to feed to it, and what sort of data to stuff into the
> index.
>
> The KinoSearch page is here: http://www.rectangular.com/kinosearch/
>
> If you're talking about Plucene, forget about it, it's dog slow. See
> http://www.rectangular.com/kinosearch/benchmarks.html if you're willing
> to trust their benchmarks (I've never done the benchmarks myself).

I was thinking of both. I like the statement in the benchmark:
"Lucene's data structures are almost pathologically ill-matched with Perl."

Thanks for the feedback.

|
Pages: 1
Prev: badblocks
Next: Plextor PX-712A DVD+-RW can't write in CentOS 5.4