Site Index Contact Login Sign up RSS
Texas A&M Geography
Address Parsing and Normalization

Techical Details

Address parsing is the process of dividing a single address string into its individual component parts, while address normalization converts these parts into their standardized equivalents. The services provided on this page do both of these tasks using a deterministic approach, one in a single record fashion and the other processing a database of records in batch. For address parsing, a set of rules have been developed using tokenization on white spaces between words, the ordering of the tokens, and a series of alias tables to determine each of the individual address components.

Once the type of the tokens have been identified, normalization is performed with more alias tables to make a best effort attempt to normalize the components to the desired output format. Currently, we have fairly compliant implementations of the USPS Publication 28 Standard, the US Census TIGER/Line Address Format, the address format used by the Los Angeles County Chief Information Office. Details of these formats are below.

Please note that this software is not USPS CASS Certified but works farily well on most normal address data and is constantly being improved. If you encounter any problems or find any cases where the software fails, we would like to hear about it so we can work to improve the quality of the services. Please let us know.

Version 4.01 - February 4 2013

The Texas A&M Geoservices Geocoder uses a deterministic token-based context-aware substitution table strategy to parse the input data.

Technical Reports

The following technical report details of the inner workings of the Texas A&M Geoservices Geocoding and Parsing Platform:

Goldberg, D.W., 2009. The USC WebGIS Open Source Geocoding Platform. Technical Report No 11. Los Angeles CA: University of Southern California GIS Research Laboratory. Available online at: http://gislab.usc.edu/i/publications/gislabtr11.pdf.

Address Formats

Users can select which ouput format they would like their data placed into. Currently we are fairly compliant with the following address formats.
Format
USPS Publication 28 Address Standard
US Census TIGER/Line Address Format
Los Angeles County Chief Information Office
The main difference between these formats is the granularity that the address components are bloken up into. The following table lists the address components that each address format reports. An X means the format supports a seperate field for the address component, while a - means that the address component is not supported. When a component is not supported, it will instead be contained as part of the Name field.
Address component USPSPublication28 USCensusTiger LACounty Examples
Number X X X 123
Number Fractional X X X 1/2
Pre-Directional X X X North, S, E, West
Pre-Qualifier - X X Old, New, Business
Pre-Type - X X Route, US HWY, Avenue, Via, Paseo
Pre-Article - - X La, De La, Del
Name X X X Main
Post-Article - - X La, De La, Del
Suffix X X X Street, Blvd
Post-Qualifier - X X Private, Bypass
Post Directional X X X North, S, E, West
Suite Type X X X Apt, Suite, FL
Suite Number X X X 24A
Post Office Box Type X X X PO Box, RR, HC
Post Office Box Number X X X 0255

Known Bugs

The following list contains the set of known bugs for this release of the address parsing/normalization service. We are constantly working to improve the service and will be addressing these bugs in future releases. If you discover or suspect another bug, please report it.

ID Description Fixed
1 Only street address data are currently parsed - not the city, state, and zip portions of an address  
2 PO Box addresses are not currently parsed 04/15/2009
3 Rural Route addresses are not currently parsed 04/15/2009
4 Highway Contract addresses are not currently parsed 04/15/2009
5 Street intersections are not parsed  
Quick Links: Home | Services | Databases | Support | About | Site Map | Contact