Open Source Address Correction / Parser with Fuzzy Matching

المشرف العام

Administrator
طاقم الإدارة
Here is a a little bit of a detailed question related to address parsing/geocoding which I feel should be interesting to many users.

So, essentially I am curious to know if anyone has had any experience installing, building or extending a opensource geocoding and/or address correction tool.

I am aware of geocoder:US 2.0 initiatives which I think are maintained by geocommons but I am unsure if there are better alternatives, other opensource tools, if their system can be effectively extended or if there are any developments I might not be aware of.

My goals are as follows:

  1. I need a highly accurate tool which is capable of automatically parsing out and/or standardizing location data inputted by users from a single input field all in real-time and with the highest volume possible.
  2. Input data would be one or more of the following address components: zipcode, county, city, street, address, state.
  3. Input data also needs to be able to lookup from our custom geonames database. For example he may enter the name of a neighborhood or non USPS location name which naturally are not standard address variables.
Given these goals I am well aware of the fact that when given a single form field to conduct such a lookup each user will enter his data in different formats while the other factor generally fall into misspellings.

Besides utilizing the census database as the core for the valid addresses/ranges (all which I believe Geocoder:US does, I believe some type of ability to define known "aliases" would be ideal for known misspellings of street names. The same goes for things such as a user entering Ave compared to Ave. compared to Avenue. Don't think such alias capabilities are fully possible with the Geocoder:US tool.

While the above elements may indeed solve the majority of issues I think some type of effective fuzzy matching needs to exist when the input can't be matches to high enough %age.

If input data can effectively be parsed out into individual elements based off some assumed rules and then utilizing a type of "match score" component to fuzzy match any unmatched elements would have to be based on those elements which were already "matched" with a high degree.

For example: I am going to assume for geocoding to be as effective as possible we need to extract individual data elements from the input field first in an attempt to narrow down the "area" the user is trying to find results for. In my view this means that a 5 digit number could be assumed to be a zipcode, if there is another element such as a city name that matches the zipcode the assumption that we have the "area" correct... Next we use the remaining data to try to find a full, partial or fuzzy match, score and list possible results.

In any case - I would greatly appreciate if anyone could provide some advise here along with any advise, performance stats or upcoming developments they are aware of which might adjust my direction (such as the use of postgis 2.0 as a means for enhanced matching capabilities)



أكثر...
 
أعلى