From: Albert van der Horst on
In article <4bcddc5a$0$1630$742ec2ed(a)news.sonic.net>,
John Nagle <nagle(a)animats.com> wrote:
>Iain King wrote:
>> Not sure on the volume of addresses you're working with, but as an
>> alternative you could try grabbing the zip code, looking up all
>> addresses in that zip code, and then finding whatever one of those
>> address strings most closely resembles your address string (smallest
>> Levenshtein distance?).
>
> The parser doesn't have to be perfect, but it should
>reliably reports when it fails. Then I can run the hard cases through
>one of the commercial online address standardizers. I'd like to
>be able to knock off the easy cases cheaply.

In a similar situation I did the exact reverse. ( analysing
assembler code sequences for the stack effect.)
I made a list of all exceptions, and checked against that first.
If it is not an exception, the rule should apply.
If it doesn't, call Houston.
(Of course one starts with making an input canonical, all upper case
maybe reordering etc.)

>
> What I want to do is to first extract the street number and
>undecorated street name only, match that to a large database of US businesses
>stored in MySQL, and then find the best match from the database
>hits. So I need reliable extraction of undecorated street name and number. The
>other fields are less important.

This kind of problem remains very tricky ...

At least in the Netherlands we have a book containing information
about how the spelling of a street should be officially using a limited
number of characters.

>
> John Nagle

Groetjes Albert

--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert(a)spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst