Usable street address parser in Python? [Python]

Prev: Python Learning Environment
Next: Python at BerkeleyTIP-Global meeting on Sunday April 18 12N-3P,& April 27

From: Tim Roberts on 20 Apr 2010 02:53

John Nagle <nagle(a)animats.com> wrote:
>
> Unfortunately, now it won't run with the released
>version of "pyparsing" (1.5.2, from April 2009), because it uses
>"originalTextFor", a feature introduced since then. I worked around that,
>but discovered that the new version is case-sensitive. Changed
>"Keyword" to "CaselessKeyword" where appropriate.
>
> I put in the full list of USPS street types, and discovered
>that "1500 DEER CREEK LANE" still parses with a street name
>of "DEER", and a street type fo "CREEK", because "CREEK" is a
>USPS street type. Need to do something to pick up the last street
>type, not the first. I'm not sure how to do that with pyparsing.
>Maybe if I buy the book...
>
> There's still a problem with: "2081 N Webb Rd", where the street name
>comes out as "N WEBB".
>Addresses like "1234 5th St. S." yield a street name of "5 TH",
>but if the directional is before the name, it ends up with the name.
>
> Getting closer, though. If I can get to 95% of common cases, I'll
>be happy.

This is a very tricky problem. Consider Salem, Oregon, which puts the
direction after the street:

3340 Astoria Way NE
Salem, OR 97303

Consider northern Los Angeles County, which use directions both before and
after. I used to live at:

44720 N 2nd St E
Lancaster, CA 93534

Consider much of Utah, which is both easy (because of its very neat grid)
and a pain, because of addresses like:

389 W 1700 S
Salt Lake City, UT 84115
--
Tim Roberts, timr(a)probo.com
Providenza & Boekelheide, Inc.

From: John Yeung on 20 Apr 2010 03:24

My response is similar to John Roth's. It's mainly just sympathy. ;)

I deal with addresses a lot, and I know that a really good parser is
both rare/expensive to find and difficult to write yourself. We have
commercial, USPS-certified products where I work, and even with those
I've written a good deal of pre-processing and post-processing code,
consisting almost entirely of very silly-looking fixes for special
cases.

I don't have any experience whatsoever with pyparsing, but I will say
I agree that you should try to get the street type from the end of the
line. Just be aware that it can be valid to leave off the street type
completely. And of course it's a plus if you can handle suites that
are on the same line as the street (which is where the USPS prefers
them to be).

I would take the approach which John R. seems to be suggesting, which
is to tokenize and then write a whole bunch of very hairy, special-
case-laden logic. ;) I'm almost positive this is what all the
commercial packages are doing, and I have a tough time imagining what
else you could do. Addresses inherently have a high degree of
irregularity.

Good luck!

John Y.

From: Iain King on 20 Apr 2010 05:23

On Apr 20, 8:24 am, John Yeung <gallium.arsen...(a)gmail.com> wrote:
> My response is similar to John Roth's. It's mainly just sympathy. ;)
>
> I deal with addresses a lot, and I know that a really good parser is
> both rare/expensive to find and difficult to write yourself. We have
> commercial, USPS-certified products where I work, and even with those
> I've written a good deal of pre-processing and post-processing code,
> consisting almost entirely of very silly-looking fixes for special
> cases.
>
> I don't have any experience whatsoever with pyparsing, but I will say
> I agree that you should try to get the street type from the end of the
> line. Just be aware that it can be valid to leave off the street type
> completely. And of course it's a plus if you can handle suites that
> are on the same line as the street (which is where the USPS prefers
> them to be).
>
> I would take the approach which John R. seems to be suggesting, which
> is to tokenize and then write a whole bunch of very hairy, special-
> case-laden logic. ;) I'm almost positive this is what all the
> commercial packages are doing, and I have a tough time imagining what
> else you could do. Addresses inherently have a high degree of
> irregularity.
>
> Good luck!
>
> John Y.

Not sure on the volume of addresses you're working with, but as an
alternative you could try grabbing the zip code, looking up all
addresses in that zip code, and then finding whatever one of those
address strings most closely resembles your address string (smallest
Levenshtein distance?).

Iain

From: Grant Edwards on 20 Apr 2010 09:41

On 2010-04-20, Tim Roberts <timr(a)probo.com> wrote:

> This is a very tricky problem. Consider Salem, Oregon, which puts the
> direction after the street:
>
> 3340 Astoria Way NE
> Salem, OR 97303

In Minneapolis, the direction comes before the street in some
quadrants and after it in others. I used to live on W 43rd Street.
Now I live on 24th Ave NE. And just to be more inconsistent, only the
"NE" section uses two directions, everywhere else it's just W, S, N,
or E.

--
Grant Edwards grant.b.edwards Yow! Is it NOUVELLE
at CUISINE when 3 olives are
gmail.com struggling with a scallop
in a plate of SAUCE MORNAY?

From: John Nagle on 20 Apr 2010 13:16

Iain King wrote:
> Not sure on the volume of addresses you're working with, but as an
> alternative you could try grabbing the zip code, looking up all
> addresses in that zip code, and then finding whatever one of those
> address strings most closely resembles your address string (smallest
> Levenshtein distance?).

The parser doesn't have to be perfect, but it should
reliably reports when it fails. Then I can run the hard cases through
one of the commercial online address standardizers. I'd like to
be able to knock off the easy cases cheaply.

What I want to do is to first extract the street number and
undecorated street name only, match that to a large database of US businesses
stored in MySQL, and then find the best match from the database
hits. So I need reliable extraction of undecorated street name and number. The
other fields are less important.

John Nagle

First | Prev | Next | Last
Pages: 1 2 3
Prev: Python Learning Environment
Next: Python at BerkeleyTIP-Global meeting on Sunday April 18 12N-3P,& April 27