From: Ignoramus12110 on
On 2010-07-05, Norman Peelman <npeelman(a)cfl.rr.com> wrote:
> Ignoramus12110 wrote:
>> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote:
>>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
>>>> I am hoping that, perhaps, there is some free package that could take
>>>> a few hundreds of thousands of text strings and could provide me with
>>>> "find similar" functionality.
>>>>
>>>> Realizing the potential difficulty of the task, I would be content if
>>>> it worked only moderately well. I just want something along the lines.
>>>>
>>>> Are there any MySQL functions or other software packages or perl
>>>> modules that provide something of the sort.
>>> CPAN has some packages for approximate string matching. Levenstein has
>>> been named. And virtually all SQL databases have SOUNDEX(). Another
>>> approach is trigram counting.
>>
>> Thanks. Do you know any package names?
>>
>>> The problem ist hard, especially when you look for a solution that runs
>>> faster than O(n). Outside the database you cannot be faster than O(n)
>>> anyway. For "few thousands" candidates it will however be fast enough.
>>
>> Right now I have 208,919 candidates and the number is growing by
>> appx. 200 per day.
>>
>> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1"
>> count(*)
>> 208919
>>
>> I agree that it is a hard problem.
>>
>> Perl levenshtein module seems to be more single word oriented.
>>
>> i
>
> So does soundex now that i've tried it a bit.
>
>

Yes, soundex ias for misspellings.
From: Jerry Stuckle on
Ignoramus12110 wrote:
> On 2010-07-05, Norman Peelman <npeelman(a)cfl.rr.com> wrote:
>> Ignoramus12110 wrote:
>>> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote:
>>>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
>>>>> I am hoping that, perhaps, there is some free package that could take
>>>>> a few hundreds of thousands of text strings and could provide me with
>>>>> "find similar" functionality.
>>>>>
>>>>> Realizing the potential difficulty of the task, I would be content if
>>>>> it worked only moderately well. I just want something along the lines.
>>>>>
>>>>> Are there any MySQL functions or other software packages or perl
>>>>> modules that provide something of the sort.
>>>> CPAN has some packages for approximate string matching. Levenstein has
>>>> been named. And virtually all SQL databases have SOUNDEX(). Another
>>>> approach is trigram counting.
>>> Thanks. Do you know any package names?
>>>
>>>> The problem ist hard, especially when you look for a solution that runs
>>>> faster than O(n). Outside the database you cannot be faster than O(n)
>>>> anyway. For "few thousands" candidates it will however be fast enough.
>>> Right now I have 208,919 candidates and the number is growing by
>>> appx. 200 per day.
>>>
>>> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1"
>>> count(*)
>>> 208919
>>>
>>> I agree that it is a hard problem.
>>>
>>> Perl levenshtein module seems to be more single word oriented.
>>>
>>> i
>> So does soundex now that i've tried it a bit.
>>
>>
>
> Yes, soundex ias for misspellings.

Agreed. Soundex is not for trying to understand phrases or sentences.
It is to find words which "sound" alike - i.e. misspelled words.

It can't, for instance, tell the difference between "here" and "hear" -
but it can tell they sound alike.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(a)attglobal.net
==================
From: Ignoramus12110 on
On 2010-07-06, Jerry Stuckle <jstucklex(a)attglobal.net> wrote:
> Ignoramus12110 wrote:
>> On 2010-07-05, Norman Peelman <npeelman(a)cfl.rr.com> wrote:
>>> Ignoramus12110 wrote:
>>>> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote:
>>>>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
>>>>>> I am hoping that, perhaps, there is some free package that could take
>>>>>> a few hundreds of thousands of text strings and could provide me with
>>>>>> "find similar" functionality.
>>>>>>
>>>>>> Realizing the potential difficulty of the task, I would be content if
>>>>>> it worked only moderately well. I just want something along the lines.
>>>>>>
>>>>>> Are there any MySQL functions or other software packages or perl
>>>>>> modules that provide something of the sort.
>>>>> CPAN has some packages for approximate string matching. Levenstein has
>>>>> been named. And virtually all SQL databases have SOUNDEX(). Another
>>>>> approach is trigram counting.
>>>> Thanks. Do you know any package names?
>>>>
>>>>> The problem ist hard, especially when you look for a solution that runs
>>>>> faster than O(n). Outside the database you cannot be faster than O(n)
>>>>> anyway. For "few thousands" candidates it will however be fast enough.
>>>> Right now I have 208,919 candidates and the number is growing by
>>>> appx. 200 per day.
>>>>
>>>> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1"
>>>> count(*)
>>>> 208919
>>>>
>>>> I agree that it is a hard problem.
>>>>
>>>> Perl levenshtein module seems to be more single word oriented.
>>>>
>>>> i
>>> So does soundex now that i've tried it a bit.
>>>
>>>
>>
>> Yes, soundex ias for misspellings.
>
> Agreed. Soundex is not for trying to understand phrases or sentences.
> It is to find words which "sound" alike - i.e. misspelled words.
>
> It can't, for instance, tell the difference between "here" and "hear" -
> but it can tell they sound alike.
>

I actually looked quite a bit, and did not find anything. Maybe my
brother in law could find something.

i
From: Jerry Stuckle on
Ignoramus12110 wrote:
> On 2010-07-06, Jerry Stuckle <jstucklex(a)attglobal.net> wrote:
>> Ignoramus12110 wrote:
>>> On 2010-07-05, Norman Peelman <npeelman(a)cfl.rr.com> wrote:
>>>> Ignoramus12110 wrote:
>>>>> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote:
>>>>>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
>>>>>>> I am hoping that, perhaps, there is some free package that could take
>>>>>>> a few hundreds of thousands of text strings and could provide me with
>>>>>>> "find similar" functionality.
>>>>>>>
>>>>>>> Realizing the potential difficulty of the task, I would be content if
>>>>>>> it worked only moderately well. I just want something along the lines.
>>>>>>>
>>>>>>> Are there any MySQL functions or other software packages or perl
>>>>>>> modules that provide something of the sort.
>>>>>> CPAN has some packages for approximate string matching. Levenstein has
>>>>>> been named. And virtually all SQL databases have SOUNDEX(). Another
>>>>>> approach is trigram counting.
>>>>> Thanks. Do you know any package names?
>>>>>
>>>>>> The problem ist hard, especially when you look for a solution that runs
>>>>>> faster than O(n). Outside the database you cannot be faster than O(n)
>>>>>> anyway. For "few thousands" candidates it will however be fast enough.
>>>>> Right now I have 208,919 candidates and the number is growing by
>>>>> appx. 200 per day.
>>>>>
>>>>> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1"
>>>>> count(*)
>>>>> 208919
>>>>>
>>>>> I agree that it is a hard problem.
>>>>>
>>>>> Perl levenshtein module seems to be more single word oriented.
>>>>>
>>>>> i
>>>> So does soundex now that i've tried it a bit.
>>>>
>>>>
>>> Yes, soundex ias for misspellings.
>> Agreed. Soundex is not for trying to understand phrases or sentences.
>> It is to find words which "sound" alike - i.e. misspelled words.
>>
>> It can't, for instance, tell the difference between "here" and "hear" -
>> but it can tell they sound alike.
>>
>
> I actually looked quite a bit, and did not find anything. Maybe my
> brother in law could find something.
>
> i

But what you're looking for is to get a computer to be a natural
language processor, which is still beyond our current programming
capabilities. IBM has recently come up with a test system ("Watson")
which does a fair job, but still has a long ways to go. Once we get
there, we'll have a Star Trek capability :)

With that said, it doesn't mean all is hopeless. Levenstein can help,
as can trigram matching and other things mentioned (except SoundEx).
But it will also require a lot of work on your part to "train" the
system as to whether two questions are similar or not.




--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(a)attglobal.net
==================
From: Marc Espie on
In article <maudnQZR9LURoK_RnZ2dnUVZ_qadnZ2d(a)giganews.com>,
Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote:
>I have a MySQL database of answered algebra questions. Questions are
>stored as text strings.
>
>Examples are
>
>``two dice are rolled. find the odds that the score on the dice is
>either 10 or at most 5''
>``if x is the first of three consecutive even intethe product of twice a
>number and three is the same as the difference''
>``Write the equation of the line with a slope of -1/3 and passing
>through the point (6, -4).''
>``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a
>shadow of 2 ft. What is the height of the flag pole?''
>``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree casts a
>shadow of 2 ft. Find the height of the flag pole?''
>
>When students ask questions, often (if not usually) there is already
>something similar answered in the database. Note that I am not
>defining what is "similar" and I do realize that it is a difficult
>definition to make.

Are you hell-bent on mysql ?

Because sqlite has a fts3 extension that looks like a prime candidate
for trying to locate similar questions before using some perl approximate
code to figure out whether it's the same or not...