From: BK on
With out using the SPEDIS and Soundex functions I've gotten it down to
<2% of the non missing values not being assigened a state. I did do
some testing with those functions and did indeed get MANY incorrect
results, some of which I would have never expected as the were SO far
off.

The data is readin using the $UpperW. informat so thats been taken care
of and SCAN by default treets consecutive delimiters as one and In my
multy word passes I force single space delimiter before testing, so
thats taken care of... (though I'll probably take the suggestion as its
good for premanently cleaning the orig field)

I think the biggest issue I have left are the non standard, but
traditional Abbrieviations for the states.
Thanks to all!!!!

Byron

> Start by standardizing your string. Change everything to upper case. Replace
> non-alpha characters with blanks. Replace multiple blanks with single ones.
> Change "NORTH", "SOUTH", and "WEST" to one-letter abbreviations.
>
> There are three types of abbreviations to consider: the 2-letter postal
> codes (Florida=FL), traditionally recognized ones (Florida=FLA), and
> arbitrary truncations (FLORIDA=FLOR, etc.). Missouri and Mississippi require
> 5-letter truncations to be differentiated from each other; other names can
> be distinguished with fewer letters. You many want to take a few minutes to
> build a table for this purpose.

From: "Howard Schreier <hs AT dc-sug DOT org>" on
On Tue, 29 Aug 2006 05:23:42 -0700, BK <byronkirby(a)GMAIL.COM> wrote:

>With out using the SPEDIS and Soundex functions I've gotten it down to
><2% of the non missing values not being assigened a state. I did do
>some testing with those functions and did indeed get MANY incorrect
>results, some of which I would have never expected as the were SO far
>off.

If you have V. 9 try COMPGED instead of SPEDIS. One advantage is that you
can use CALL COMPCOST to tune the coefficients used by COMPGED.

>
>The data is readin using the $UpperW. informat so thats been taken care
>of and SCAN by default treets consecutive delimiters as one and In my
>multy word passes I force single space delimiter before testing, so
>thats taken care of... (though I'll probably take the suggestion as its
>good for premanently cleaning the orig field)
>
>I think the biggest issue I have left are the non standard, but
>traditional Abbrieviations for the states.

Most of those are what I called truncations (eg, TENN for TENNESSEE). If you
systematically handle the truncations, there should only be a handful of
other traditional abbreviations (eg, PENNA for PENNSYLVANIA).

>Thanks to all!!!!
>
>Byron
>
>> Start by standardizing your string. Change everything to upper case. Replace
>> non-alpha characters with blanks. Replace multiple blanks with single ones.
>> Change "NORTH", "SOUTH", and "WEST" to one-letter abbreviations.
>>
>> There are three types of abbreviations to consider: the 2-letter postal
>> codes (Florida=FL), traditionally recognized ones (Florida=FLA), and
>> arbitrary truncations (FLORIDA=FLOR, etc.). Missouri and Mississippi require
>> 5-letter truncations to be differentiated from each other; other names can
>> be distinguished with fewer letters. You many want to take a few minutes to
>> build a table for this purpose.