Are there any MySQL queries or software packages for "findingsimilar items" [Perl]

Prev: FAQ 4.65 How can I get the unique keys from two hashes?
Next: Are there any MySQL queries or software packages for "finding similar items"

From: Ignoramus12110 on 5 Jul 2010 16:16

I have a MySQL database of answered algebra questions. Questions are
stored as text strings.

Examples are

``two dice are rolled. find the odds that the score on the dice is either 10 or at most 5''
``if x is the first of three consecutive even intethe product of twice a number and three is the same as the difference''
``Write the equation of the line with a slope of -1/3 and passing through the point (6, -4).''
``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a shadow of 2 ft. What is the height of the flag pole?''
``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree casts a shadow of 2 ft. Find the height of the flag pole?''

When students ask questions, often (if not usually) there is already
something similar answered in the database. Note that I am not
defining what is "similar" and I do realize that it is a difficult
definition to make.

Example: to my human mind, questions

``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a shadow of 2 ft. What is the height of the flag pole?''

and

``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree casts a shadow of 2 ft. Find the height of the flag pole?''

are similar.

I am hoping that, perhaps, there is some free package that could take
a few hundreds of thousands of text strings and could provide me with
"find similar" functionality.

Realizing the potential difficulty of the task, I would be content if
it worked only moderately well. I just want something along the lines.

Are there any MySQL functions or other software packages or perl
modules that provide something of the sort.

I have seen some web forums that provide "do perhaps those other
threads answer your question?" functionality by giving a list of
matching threads. None of them seems to have super cow powers, but it
looks like decent start.

So... Any suggestion for software to ran strings by similarity and
provide "top 5" or something like that?

thanks

i

From: Norman Peelman on 5 Jul 2010 16:42

Ignoramus12110 wrote:
> I have a MySQL database of answered algebra questions. Questions are
> stored as text strings.
>
> Examples are
>
> ``two dice are rolled. find the odds that the score on the dice is either 10 or at most 5''
> ``if x is the first of three consecutive even intethe product of twice a number and three is the same as the difference''
> ``Write the equation of the line with a slope of -1/3 and passing through the point (6, -4).''
> ``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a shadow of 2 ft. What is the height of the flag pole?''
> ``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree casts a shadow of 2 ft. Find the height of the flag pole?''
>
> When students ask questions, often (if not usually) there is already
> something similar answered in the database. Note that I am not
> defining what is "similar" and I do realize that it is a difficult
> definition to make.
>
> Example: to my human mind, questions
>
> ``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a shadow of 2 ft. What is the height of the flag pole?''
>
> and
>
> ``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree casts a shadow of 2 ft. Find the height of the flag pole?''
>
> are similar.
>
> I am hoping that, perhaps, there is some free package that could take
> a few hundreds of thousands of text strings and could provide me with
> "find similar" functionality.
>
> Realizing the potential difficulty of the task, I would be content if
> it worked only moderately well. I just want something along the lines.
>
> Are there any MySQL functions or other software packages or perl
> modules that provide something of the sort.
>
> I have seen some web forums that provide "do perhaps those other
> threads answer your question?" functionality by giving a list of
> matching threads. None of them seems to have super cow powers, but it
> looks like decent start.
>
> So... Any suggestion for software to ran strings by similarity and
> provide "top 5" or something like that?
>
> thanks
>
> i

Check:
http://us.php.net/manual/en/function.soundex.php

and other links/algorithms on that page.

If you created and stored the soundex of each question in the db
(indexed) you could do a search by soundex of input question.

--
Norman
Registered Linux user #461062
-Have you been to www.mysql.com yet?-

From: Norman Peelman on 5 Jul 2010 16:46

Ignoramus12110 wrote:
> I have a MySQL database of answered algebra questions. Questions are
> stored as text strings.
>
> Examples are
>
> ``two dice are rolled. find the odds that the score on the dice is either 10 or at most 5''
> ``if x is the first of three consecutive even intethe product of twice a number and three is the same as the difference''
> ``Write the equation of the line with a slope of -1/3 and passing through the point (6, -4).''
> ``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a shadow of 2 ft. What is the height of the flag pole?''
> ``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree casts a shadow of 2 ft. Find the height of the flag pole?''
>
> When students ask questions, often (if not usually) there is already
> something similar answered in the database. Note that I am not
> defining what is "similar" and I do realize that it is a difficult
> definition to make.
>
> Example: to my human mind, questions
>
> ``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a shadow of 2 ft. What is the height of the flag pole?''
>
> and
>
> ``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree casts a shadow of 2 ft. Find the height of the flag pole?''
>
> are similar.
>
> I am hoping that, perhaps, there is some free package that could take
> a few hundreds of thousands of text strings and could provide me with
> "find similar" functionality.
>
> Realizing the potential difficulty of the task, I would be content if
> it worked only moderately well. I just want something along the lines.
>
> Are there any MySQL functions or other software packages or perl
> modules that provide something of the sort.
>
> I have seen some web forums that provide "do perhaps those other
> threads answer your question?" functionality by giving a list of
> matching threads. None of them seems to have super cow powers, but it
> looks like decent start.
>
> So... Any suggestion for software to ran strings by similarity and
> provide "top 5" or something like that?
>
> thanks
>
> i

Looks like perl has it also:

http://perldoc.perl.org/Text/Soundex.html

--
Norman
Registered Linux user #461062
-Have you been to www.mysql.com yet?-

From: Ignoramus12110 on 5 Jul 2010 18:33

On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote:
> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
>>
>> I am hoping that, perhaps, there is some free package that could take
>> a few hundreds of thousands of text strings and could provide me with
>> "find similar" functionality.
>>
>> Realizing the potential difficulty of the task, I would be content if
>> it worked only moderately well. I just want something along the lines.
>>
>> Are there any MySQL functions or other software packages or perl
>> modules that provide something of the sort.
>
> CPAN has some packages for approximate string matching. Levenstein has
> been named. And virtually all SQL databases have SOUNDEX(). Another
> approach is trigram counting.

Thanks. Do you know any package names?

> The problem ist hard, especially when you look for a solution that runs
> faster than O(n). Outside the database you cannot be faster than O(n)
> anyway. For "few thousands" candidates it will however be fast enough.

Right now I have 208,919 candidates and the number is growing by
appx. 200 per day.

::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1"
count(*)
208919

I agree that it is a hard problem.

Perl levenshtein module seems to be more single word oriented.

i

From: Norman Peelman on 5 Jul 2010 19:33

Ignoramus12110 wrote:
> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote:
>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
>>> I am hoping that, perhaps, there is some free package that could take
>>> a few hundreds of thousands of text strings and could provide me with
>>> "find similar" functionality.
>>>
>>> Realizing the potential difficulty of the task, I would be content if
>>> it worked only moderately well. I just want something along the lines.
>>>
>>> Are there any MySQL functions or other software packages or perl
>>> modules that provide something of the sort.
>> CPAN has some packages for approximate string matching. Levenstein has
>> been named. And virtually all SQL databases have SOUNDEX(). Another
>> approach is trigram counting.
>
> Thanks. Do you know any package names?
>
>> The problem ist hard, especially when you look for a solution that runs
>> faster than O(n). Outside the database you cannot be faster than O(n)
>> anyway. For "few thousands" candidates it will however be fast enough.
>
> Right now I have 208,919 candidates and the number is growing by
> appx. 200 per day.
>
> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1"
> count(*)
> 208919
>
> I agree that it is a hard problem.
>
> Perl levenshtein module seems to be more single word oriented.
>
> i

So does soundex now that i've tried it a bit.

--
Norman
Registered Linux user #461062
-Have you been to www.mysql.com yet?-

| Next | Last
Pages: 1 2 3
Prev: FAQ 4.65 How can I get the unique keys from two hashes?
Next: Are there any MySQL queries or software packages for "finding similar items"