From: Erick T. Barkhuis on
Ignoramus12110:
....
>So... Any suggestion for software to ran strings by similarity and
>provide "top 5" or something like that?

All I can come up with is Levenshtein (not much experience using it,
though).
May I suggest you use "levenshtein mysql" or "levenshtein php" as a
search phrase?


--
Erick
From: Axel Schwenke on
Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
>
> I am hoping that, perhaps, there is some free package that could take
> a few hundreds of thousands of text strings and could provide me with
> "find similar" functionality.
>
> Realizing the potential difficulty of the task, I would be content if
> it worked only moderately well. I just want something along the lines.
>
> Are there any MySQL functions or other software packages or perl
> modules that provide something of the sort.

CPAN has some packages for approximate string matching. Levenstein has
been named. And virtually all SQL databases have SOUNDEX(). Another
approach is trigram counting.

The problem ist hard, especially when you look for a solution that runs
faster than O(n). Outside the database you cannot be faster than O(n)
anyway. For "few thousands" candidates it will however be fast enough.


XL
From: Axel Schwenke on
Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote:
>>
>> CPAN has some packages for approximate string matching. Levenstein has
>> been named. And virtually all SQL databases have SOUNDEX(). Another
>> approach is trigram counting.
>
> Thanks. Do you know any package names?

CPAN does:

http://search.cpan.org/search?query=levenshtein&mode=all

This also lists some non-Levenstein implementations

>> The problem ist hard, especially when you look for a solution that runs
>> faster than O(n). Outside the database you cannot be faster than O(n)
>> anyway. For "few thousands" candidates it will however be fast enough.
>
> Right now I have 208,919 candidates and the number is growing by
> appx. 200 per day.

Then non-indexed solutions migh be a little slow.

Approximate matching is one facette of fulltext search engines. So you
might want to try one of those. MySQL itself comes with a (limited)
implementation of a FULLTEXT index. And there is a wealth of fulltext
engines: Sphinx, Lucene, Xapian, Swish++, mnogosearch, ...

SOUNDEX() I named only for completeness. The nice thing about it is
that it is virtually everywhere available. Though its usefulness is
quite limited because it is very coarse and works only for English.


XL
From: Philipp Pagel on
In comp.os.linux.misc Ignoramus12110 <ignoramus12110(a)nospam.12110.invalid> wrote:

> ``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a
> shadow of 2 ft. What is the height of the flag pole?''

> ``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree
> casts a shadow of 2 ft. Find the height of the flag pole?''

> are similar.
>
> I am hoping that, perhaps, there is some free package that could take
> a few hundreds of thousands of text strings and could provide me with
> "find similar" functionality.

The BLAST software roughly does what you are looking for - but outside
of mysql and specifically written for finding similar sequences in DNA
or protein databases. That said, it should be possible to tweak it to
accept input with arbitrary alphabets. Another programm from the
bioinformatics world would be FASTA. At least BLAST can be found as a
debian package (and probably for others, too).

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
http://faculty.virginia.edu/wrpearson/fasta/

For a few hundred or thousand comparisons and a little patience, you
may also consider using the sequence alignment algorithms by Needleman
and Wunsch or Smith and Waterman -- depending on your needs. If you
would like to explore this route and are willing to do some code
tweaking/expanding I could provide a simple implementation written in
Python (just email me). And of course you can find tons of other
implementations on the web because it's a classic for bioinformatics
students.


cu
Philipp

--
Dr. Philipp Pagel
Lehrstuhl f. Genomorientierte Bioinformatik
Technische Universit�t M�nchen
http://webclu.bio.wzw.tum.de/~pagel/
From: ccc31807 on
On Jul 5, 4:16 pm, Ignoramus12110 <ignoramus12...(a)NOSPAM.
12110.invalid> wrote:
> I have a MySQL database of answered algebra questions. Questions are
> stored as text strings.
> When students ask questions, often (if not usually) there is already
> something similar answered in the database. Note that I am not
> defining what is "similar" and I do realize that it is a difficult
> definition to make.

As strong as Perl is at string manipulation, this is the kind of
problem domain that Lisp is ideally suited for. At least one
introduction, Lisp 3rd (Winston and Horn) devotes the last half of the
book to consideration of these kinds of problems, and can be had for
$1.68 as a used book on amazon.com, half.com, etc.

I don't know what kind of time you have to devote to solving the
problem, or the strength of your interest, or your previous
experience, but I would strongly suggest that if you have the time,
interest, and experience, that you would do well to read through W&H,
Lisp 3rd.

If you want something meatier, Paradigms of Artificial Intelligence
Programming: Case Studies in Common Lisp (Norvig) seems to make the
top ten list of everyone's Best Books in computer science.

Essentially, what you would want to do is parse the student query for
key words, perhaps building a database of common search terms, and
match them against your database, perhaps iteratively using random
permutations, using the standard Lisp pattern matching techniques.

CC.