From: BK on
hey,
I'm trying to extract US States from a character string... true to
from, the dataentry inclueds many misspellings and both full names and
appriviations. So far, I'm using a few of the built in SAS functions (
& mods of them) for testing the SCANs to see if they are valid, and
they work fairly well. Does anyone have any other ideas? I've thought
about using the SOUNDEX function with SPEDIS to determin if its close,
but am unsure of a good implememtation.

I was thinking about incorporating this kind of logic
{score(i)=spedis(soundex(schst),soundex(StateArray(i)))} into an array
taking the minimum score, but it seems ineffecient;

I'd like to have the code stay structured so that the match between the
origional value and Actual state is used first and then move toward
more fuzzy logic.

Thanks in advanced;
Byron

Here's the code so far:

***************************************************

data fmt(drop=i);
do i=0 to 56;
label=fipstate(i);
start=fipname(i);
fmtname="$FSfip";
if label not in ('--', ' ') then output fmt;
end;
start="OTHER"; label=" "; OUTPUT;
run;
proc sort data=fmt; by start; run;
proc format cntlin=fmt; run;


data test(drop=l);
set MyData;
l=(length(schst)-length(compress(schst," ,-'")))+1;
nschst=compress(schst,".'`");

do i=1 to l;
** orig. text **;
if stfips(scan(schst,i)) ne . then do; nstate=scan(schst,i); i=l;
goto el; end;
** compressed version **;
if nstate=' ' and stfips(scan(nschst,i)) ne . then do;
nstate=scan(nschst,i); i=l; goto el; end;
** single word states **;
if nstate=' ' and put(scan(nschst,i),$FSfip.) ne ' ' then do;
nstate=put(scan(nschst,i),$FSfip.); i=l; goto el; end;
** for two word states**;
if nstate=' ' and i < l and put((scan(nschst,i)||"
"||scan(nschst,i+1)),$FSfip.) ne ' '
then do; nstate=put((scan(nschst,i)||"
"||scan(nschst,i+1)),$FSfip.); i=l; goto el; end;
** for District of Columbia **;
if nstate=' ' and i+1 < l and put((scan(nschst,i)||"
"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.) ne ' '
then do; nstate=put((scan(nschst,i)||"
"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.); i=l; goto el;
end;
el: end;
run;


/** excert of names left out of the above statements ***;
MASS
MASSACHUETTS
MASSACHUSETTE
MASSACHUSSETTS
MASSSCHUSETTE
** it does get these **;
MASSACHUSETTS
SPRINGFIELD MA
SPRINGFIELD, MA
SPRINGFIELD, MA 01107
TYNGSBORO MA
W BARNSTALL, MA
W BARSTABLE MA
WELLESLEY HILLS MA
WELLESLEY, MASSACHUSETTS
WEST BARNSTABLE MA
WORCESTER MA
*******************************************************/;

From: Sigurd Hermansen on
BK:
I have to wonder whether any pattern matching method will work that well =
in this situation. Two alternatives may supplement pattern matching and =
give you acceptable results. First, an equivalent of soundex() for state =
names and abbreviations could help by removing vowels and limiting =
typographic patterns to variations of phonetic spellings. Second, take =
advantage of the context. In your example the state name either appears =
alone or as the second element.
Sig

________________________________

From: owner-sas-l(a)listserv.uga.edu on behalf of BK
Sent: Sat 8/26/2006 12:09 AM
To: sas-l(a)uga.edu
Subject: Help: fuzzy match in string



hey,
I'm trying to extract US States from a character string... true to
from, the dataentry inclueds many misspellings and both full names and
appriviations. So far, I'm using a few of the built in SAS functions (
& mods of them) for testing the SCANs to see if they are valid, and
they work fairly well. Does anyone have any other ideas? I've thought
about using the SOUNDEX function with SPEDIS to determin if its close,
but am unsure of a good implememtation.

I was thinking about incorporating this kind of logic
{score(i)=3Dspedis(soundex(schst),soundex(StateArray(i)))} into an array
taking the minimum score, but it seems ineffecient;

I'd like to have the code stay structured so that the match between the
origional value and Actual state is used first and then move toward
more fuzzy logic.

Thanks in advanced;
Byron

Here's the code so far:

***************************************************

data fmt(drop=3Di);
do i=3D0 to 56;
label=3Dfipstate(i);
start=3Dfipname(i);
fmtname=3D"$FSfip";
if label not in ('--', ' ') then output fmt;
end;
start=3D"OTHER"; label=3D" "; OUTPUT;
run;
proc sort data=3Dfmt; by start; run;
proc format cntlin=3Dfmt; run;


data test(drop=3Dl);
set MyData;
l=3D(length(schst)-length(compress(schst," ,-'")))+1;
nschst=3Dcompress(schst,".'`");

do i=3D1 to l;
** orig. text **;
if stfips(scan(schst,i)) ne . then do; nstate=3Dscan(schst,i); =
i=3Dl;
goto el; end;
** compressed version **;
if nstate=3D' ' and stfips(scan(nschst,i)) ne . then do;
nstate=3Dscan(nschst,i); i=3Dl; goto el; end;
** single word states **;
if nstate=3D' ' and put(scan(nschst,i),$FSfip.) ne ' ' then do;
nstate=3Dput(scan(nschst,i),$FSfip.); i=3Dl; goto el; end;
** for two word states**;
if nstate=3D' ' and i < l and put((scan(nschst,i)||"
"||scan(nschst,i+1)),$FSfip.) ne ' '
then do; nstate=3Dput((scan(nschst,i)||"
"||scan(nschst,i+1)),$FSfip.); i=3Dl; goto el; end;
** for District of Columbia **;
if nstate=3D' ' and i+1 < l and put((scan(nschst,i)||"
"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.) ne ' '
then do; nstate=3Dput((scan(nschst,i)||"
"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.); i=3Dl; goto el;
end;
el: end;
run;


/** excert of names left out of the above statements ***;
MASS
MASSACHUETTS
MASSACHUSETTE
MASSACHUSSETTS
MASSSCHUSETTE
** it does get these **;
MASSACHUSETTS
SPRINGFIELD MA
SPRINGFIELD, MA
SPRINGFIELD, MA 01107
TYNGSBORO MA
W BARNSTALL, MA
W BARSTABLE MA
WELLESLEY HILLS MA
WELLESLEY, MASSACHUSETTS
WEST BARNSTABLE MA
WORCESTER MA
*******************************************************/;
From: BK on
Thanks to you both...

For a little clarification what I'm trying to end up with is a new
variable with the two letter abbreviation for the state found in the
string.

I've never used regular expressions so I'll have to look them, so
thanks for the reference assistance.

Also, I'll probably try the first solution you mention Sig of
compressing the vowels, that may catch a few more. The method I've got
already searches (nearly) all possible segments of the string; it's
just making the automating determination of which of the 50 it the
value is... The example only shows MA, but there are may other state
mispellings (ie:"South Crlina" for "South Carolina") and also countries
mixed in.

After this, I may be getting about as good as it gets without having a
few "false positives" (so to speak) in my results.

Byron

From: David L Cassell on
byronkirby(a)GMAIL.COM wrote:
>hey,
>I'm trying to extract US States from a character string... true to
>from, the dataentry inclueds many misspellings and both full names and
>appriviations. So far, I'm using a few of the built in SAS functions (
>& mods of them) for testing the SCANs to see if they are valid, and
>they work fairly well. Does anyone have any other ideas? I've thought
>about using the SOUNDEX function with SPEDIS to determin if its close,
>but am unsure of a good implememtation.
>
>I was thinking about incorporating this kind of logic
>{score(i)=spedis(soundex(schst),soundex(StateArray(i)))} into an array
>taking the minimum score, but it seems ineffecient;
>
>I'd like to have the code stay structured so that the match between the
>origional value and Actual state is used first and then move toward
>more fuzzy logic.
>
>Thanks in advanced;
>Byron
>
>Here's the code so far:
>
>***************************************************
>
>data fmt(drop=i);
> do i=0 to 56;
> label=fipstate(i);
> start=fipname(i);
> fmtname="$FSfip";
> if label not in ('--', ' ') then output fmt;
> end;
> start="OTHER"; label=" "; OUTPUT;
>run;
>proc sort data=fmt; by start; run;
>proc format cntlin=fmt; run;
>
>
>data test(drop=l);
> set MyData;
> l=(length(schst)-length(compress(schst," ,-'")))+1;
> nschst=compress(schst,".'`");
>
> do i=1 to l;
>** orig. text **;
> if stfips(scan(schst,i)) ne . then do; nstate=scan(schst,i); i=l;
>goto el; end;
>** compressed version **;
> if nstate=' ' and stfips(scan(nschst,i)) ne . then do;
>nstate=scan(nschst,i); i=l; goto el; end;
>** single word states **;
> if nstate=' ' and put(scan(nschst,i),$FSfip.) ne ' ' then do;
>nstate=put(scan(nschst,i),$FSfip.); i=l; goto el; end;
>** for two word states**;
> if nstate=' ' and i < l and put((scan(nschst,i)||"
>"||scan(nschst,i+1)),$FSfip.) ne ' '
> then do; nstate=put((scan(nschst,i)||"
>"||scan(nschst,i+1)),$FSfip.); i=l; goto el; end;
>** for District of Columbia **;
> if nstate=' ' and i+1 < l and put((scan(nschst,i)||"
>"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.) ne ' '
> then do; nstate=put((scan(nschst,i)||"
>"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.); i=l; goto el;
>end;
> el: end;
>run;
>
>
>/** excert of names left out of the above statements ***;
> MASS
> MASSACHUETTS
> MASSACHUSETTE
> MASSACHUSSETTS
> MASSSCHUSETTE
>** it does get these **;
> MASSACHUSETTS
> SPRINGFIELD MA
> SPRINGFIELD, MA
> SPRINGFIELD, MA 01107
> TYNGSBORO MA
> W BARNSTALL, MA
> W BARSTABLE MA
> WELLESLEY HILLS MA
> WELLESLEY, MASSACHUSETTS
> WEST BARNSTABLE MA
> WORCESTER MA
>*******************************************************/;

First, note that SOUNDEX() and SPEDIS() are two very different critters.

And note that there are some very nice tools like COMPGED() and
COMPLEV() which are also worth looking into here.

SOUNDEX() might actually work in your context. It oguht to catch common
phonetic misspellings, since what it does is compress out the vowels and
then treat similar-sounding consonants as being identical. You could
build a format or hash that holds all the soundex-compressed versions
of the full state names, and try to match against that.

However, tools like COMPLEV() yield an edit distance for a pair of words.
You could compute COMPLEV(yourword,statename{i}) for all 50 states
and then look at anything which has at least one edit distance below
some reasonable cutoff that you would assign by experimentation.

You can't get away from at least *some* false positives and false negatives.
Expect that the more you try to avoid one, the more of the other you
will get.

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Windows Live Spaces is here! It?s easy to create your own personal Web site.
http://spaces.live.com/signup.aspx
From: "Howard Schreier <hs AT dc-sug DOT org>" on
Can it be presumed that the string expresses a state name? Or is it possible
that it's just noise?

I wouldn't use SOUNDEX *with* SPEDIS (or withone of the new v. 9 edit
distance functions). Perhaps they can be used in parallel, one approach
serving to confirm the other or to serve as a "Plan B".

Start by standardizing your string. Change everything to upper case. Replace
non-alpha characters with blanks. Replace multiple blanks with single ones.
Change "NORTH", "SOUTH", and "WEST" to one-letter abbreviations.

There are three types of abbreviations to consider: the 2-letter postal
codes (Florida=FL), traditionally recognized ones (Florida=FLA), and
arbitrary truncations (FLORIDA=FLOR, etc.). Missouri and Mississippi require
5-letter truncations to be differentiated from each other; other names can
be distinguished with fewer letters. You many want to take a few minutes to
build a table for this purpose.

These bits of logic should yield a lot of matches before you have to start
with fuzzy approaches.

On Fri, 25 Aug 2006 21:09:05 -0700, BK <byronkirby(a)GMAIL.COM> wrote:

>hey,
>I'm trying to extract US States from a character string... true to
>from, the dataentry inclueds many misspellings and both full names and
>appriviations. So far, I'm using a few of the built in SAS functions (
>& mods of them) for testing the SCANs to see if they are valid, and
>they work fairly well. Does anyone have any other ideas? I've thought
>about using the SOUNDEX function with SPEDIS to determin if its close,
>but am unsure of a good implememtation.
>
>I was thinking about incorporating this kind of logic
>{score(i)=spedis(soundex(schst),soundex(StateArray(i)))} into an array
>taking the minimum score, but it seems ineffecient;
>
>I'd like to have the code stay structured so that the match between the
>origional value and Actual state is used first and then move toward
>more fuzzy logic.
>
>Thanks in advanced;
>Byron
>
>Here's the code so far:
>
>***************************************************
>
>data fmt(drop=i);
> do i=0 to 56;
> label=fipstate(i);
> start=fipname(i);
> fmtname="$FSfip";
> if label not in ('--', ' ') then output fmt;
> end;
> start="OTHER"; label=" "; OUTPUT;
>run;
>proc sort data=fmt; by start; run;
>proc format cntlin=fmt; run;
>
>
>data test(drop=l);
> set MyData;
> l=(length(schst)-length(compress(schst," ,-'")))+1;
> nschst=compress(schst,".'`");
>
> do i=1 to l;
>** orig. text **;
> if stfips(scan(schst,i)) ne . then do; nstate=scan(schst,i); i=l;
>goto el; end;
>** compressed version **;
> if nstate=' ' and stfips(scan(nschst,i)) ne . then do;
>nstate=scan(nschst,i); i=l; goto el; end;
>** single word states **;
> if nstate=' ' and put(scan(nschst,i),$FSfip.) ne ' ' then do;
>nstate=put(scan(nschst,i),$FSfip.); i=l; goto el; end;
>** for two word states**;
> if nstate=' ' and i < l and put((scan(nschst,i)||"
>"||scan(nschst,i+1)),$FSfip.) ne ' '
> then do; nstate=put((scan(nschst,i)||"
>"||scan(nschst,i+1)),$FSfip.); i=l; goto el; end;
>** for District of Columbia **;
> if nstate=' ' and i+1 < l and put((scan(nschst,i)||"
>"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.) ne ' '
> then do; nstate=put((scan(nschst,i)||"
>"||scan(nschst,i+1)||" "||scan(nschst,i+2)),$FSfip.); i=l; goto el;
>end;
> el: end;
>run;
>
>
>/** excert of names left out of the above statements ***;
> MASS
> MASSACHUETTS
> MASSACHUSETTE
> MASSACHUSSETTS
> MASSSCHUSETTE
>** it does get these **;
> MASSACHUSETTS
> SPRINGFIELD MA
> SPRINGFIELD, MA
> SPRINGFIELD, MA 01107
> TYNGSBORO MA
> W BARNSTALL, MA
> W BARSTABLE MA
> WELLESLEY HILLS MA
> WELLESLEY, MASSACHUSETTS
> WEST BARNSTABLE MA
> WORCESTER MA
>*******************************************************/;