LocationTools is an open source platform developed under the direction of data scientists at the National Fire Protection Association (NFPA) that efficiently geocodes very large amounts of data. To do bulk geolocation quickly, LocationTools makes use of a suite of technologies such as Spark, Lucene, and Libpostal to conduct geocoding on a Cloudera Hadoop network. LocationTools also provides a more easily managed web API for single address geocoding that a user can easily implement using Docker. Finally, LocationTools uses U.S. Census TIGER files as its base layer of street address location data. To learn how to use LocationTools on your machine or cluster, please read the below information or visit the project webpage at https://nfpa.github.io/LocationTools.
We're looking for partners to help maintain and grow LocationTools. If you're interested, please reach out to us at [email protected]
Geocoding millions of address points is technically challenging, time consuming, and often a very expensive process. NFPA encountered those problems when geocoding fire department response information from NFIRS ( National Fire Incident Reporting System ), a data set dissemenated by annually the United States Fire Adminstration (USFA) containing information from 20,000+ fire departments across the United States. A single year of NFIRS data contains records of 20-25 million incidents that fire departments responded to, including events such as structure fires, EMS calls, and HAZMAT incidents. While individaul fire departments often collect and store geolocation data for the incidents that they respond to, the public data provdided by USFA only contains street addresses for those incidents. Complicating matters, those street addresses are often only partially filled out, leaving us with a very large amount of very messy data.
We quickly discovered that standard approaches to geocoding, either through commercial services or existing open source platforms, couldn't produce results that were accurate, fast, and cost-effective for the NFIRS data, So we launched this project, initially codenamed 'Wandering Moose', to solve some aspects of spatial analysis on huge datasets for downstream spatial analysis. Trade-offs, though, are inevitable, and focused on speed and cost at the cost of a bit accuracy. We made this tradeoff since our use cases didn't require roof-top accuracy, and being "close enough" was "good enough" for what we wanted.
Whlie this tool is initially geared towards NFIRS type dataset, we presume that it can be extended to any dataset with address fields. We provide you with sufficient documentation related to how it can be setup in different ways (WebServer or Docker) with code, examples, and sample result outputs. See the Getting Started section below for more information.
Open source tools such as POSTGIS do support geocoding, but they face signficiant execution bottleneck due to limited distribution capabilities when it comes to operating on large datasets. Therefore, geocoding millions of records becomes painful and time-consuming. To solve this problem, we wanted to scale this processing on multiple machines with multiple cores to achieve maximum throughput with least amount of latency.
Our underlying data comes from the TIGER/Line Database extensive collection of geographic and cartographic information of United States from the US Census Bureau. The shapefile includes polygon boundaries of geographic areas and features, linear features including roads and hydrography, and point features. TIGER data is freely available and does not contain any sensitive data.
We created a JAVA based wrapper to download specific version and layers from Line Shapefiles, GeoSpark utilities convert all Shapefiles as DataFrames and join specific attribute types (like Faces, Edges etc.) into one huge CSV file for indexing (Query used for joining is provided in project's resource folder).
For indexing, Apache Lucene and JTS Topology Suite is used for non-GIS and GIS related fields respectively. Lucence’s Spatial package contains different Spatial Strategies to create and index spatial data, we have used SpatialPrefixTrees that decomposes geographic shapes into variable length strings with precision. Each string corresponds to a rectangular spatial region. This is also referred to as “Grid” or “Tiles”. Each line in the above CSV is converted to a Lucene document to be indexed. Size of the index also depends on the precision of geohash used for constructing the grid.
For geocoding (which, in our approach, corresponds to searching a Lucene index), we added two ways of doing it either with single address, or a batch of addresses. For both, Jpostal is used to parse and normalize the input address into multiple components (House No., Street Name, State, ZIP etc.) It constructs the Composite Query based on the quality of the parsed components from Jpostal. Searching single address just prints the results to standard output. Batch geocoding uses a Spark RDD utilities to map the input batch DataFrames as partitions, parallelize execution and perform fast in-memory searching. We created a GeocodeWrapper that maps raw output of geocoder into multiple output formats like JSON, String or ScalaMap for Hive. We used ScalaMap to directly create HIVE output table for batch geocoding (All resources are in the Utils folder).
For Reverse Geocoding, we use the same Spatial Strategies mentioned above to convert the LAT, LON to point in KM and find all results within specified distance sorted by closest to farthest.
All results were benchmarked with a CDH cluster with about 224 cores capacity (7 Datanodes 32 cores each). Geocoding approximately 25 million records take 12-14 hours depending on the specifics of data, which was significantly better than POSTGIS time which took weeks.
Figure 1.0 - Geocoding Architecture
- IntelliJ Idea - JAVA IDE for code development
- Maven - Dependency Management
- Lucene - For Indexing and searching geo-spatial data
- libpostal - NLP based address processing/normalization engine
- US Census Shapefiles - Shapefiles (2018) used for indexing
- GeoSpark - for Spark based geo-processing
- Vert.x - For Simple API Deployment
- CDH - For Cluster Deployment
- Docker - For container based Deployment
These instructions will allow you to get a copy of the project up and running on your local machine for development and testing purposes. You can run the project either by spinning up an API service (locahost:8080) which will geocode addresses sequentially or run in batch mode using Spark-on-YARN service using CDH cluster. Both of these runtimes require you to have:
- JAR file built from the project codebase
- Location to Lucene Indexes
- Address Parsing libraries
- Configuration files to control geocoding parameters (
driver.ini
for batch mode andvertx-conf.json
for API mode)
The setup is tried and tested only on Linux/Ubuntu 16.04, for Windows the preferred way would be to use the docker setup..
- Install libpostal from here
- Install Java bindings(jpostal) to libpostal from here
Shared libraries will be available at
jpostal/src/main/jniLibs
. This location will be used in the driver configuration later. - Intall git(>=2.17) & maven(>=3.6.0)
sudo apt-get update
sudo apt-get install git maven
git clone https://github.com/NFPA/LocationTools.git
sudo mvn clean install
This generates JAR named location-tools-1.0-SNAPSHOT.jar
in your target
directory.
See sample configuration files drivier.ini
and vertx-conf.json
in the project root folder.
The driver.ini
configuration file serves multiple purposes
- Params for Downloading, indexing TIGER Census data and creating lucene indexes (see section
download
,process
andindex
) - Params for geocoding single address and multiple files (see section
geocode
,batch-geocode
) - Params for simple reverse geocoding. (see section
reverse-geocode
)
[ Note: Make sure that *.dir paths end in / except input.dir which can be regular expression ]
The vertx-conf.json
contains params for the webserver.
[ Note: Make sure you have about ~30G for complete US build ]
# Download the TIGER data:
# We use TIGER2018 data pulished by United States Census Bureau. TIGER data is in shape file format. We first download the zip files from census website. You can specify which states to download in the `driver.ini` file by changing the states key in [download] section. You can specify a comma separated list of US state codes.
```java
java -cp /usr/src/location-tools-1.0-SNAPSHOT.jar org.nfpa.spatial.Driver --download driver.ini
```
# Preprocess the data for indexing:
# For lucene to consume the downloaded data, we first convert the downloaded shape files to csv and combine information from multiple file types like FACES, EDGES, STATE, COUNTY using GeoSpark functionality.
```java
java -cp /usr/src/location-tools-1.0-SNAPSHOT.jar org.nfpa.spatial.Driver --process driver.ini
```
# Build Lucene Indexes:
# We can now index all the csv files into lucene.
```java
java -cp /usr/src/location-tools-1.0-SNAPSHOT.jar org.nfpa.spatial.Driver --index driver.ini
```
# You should now have files in the `lucene.index.dir` directory.
# It's always a good idea to check the index with Lucene Luke which you can find in [lucene binary releases](https://lucene.apache.org/core/downloads.html) (Lucene >= 8.1)
# Test Single address:
```java
java -cp target/location-tools-1.0-SNAPSHOT.jar org.nfpa.spatial.Driver --geocode driver.ini
```
# Test Reverse geocode single address:
```java
java -cp target/location-tools-1.0-SNAPSHOT.jar org.nfpa.spatial.Driver --reverse-geocode driver.ini
```
Now that you have the JAR file, location for address parsers(jpostal) and location for generated lucene indexes. The app can be deployed using a simple JAVA API webserver using vertx or batch mode using a cluster setup.
Starting the web server with below command creates two endpoints
/geocoder/v1
- geocoding which can take two arguments address
and n
number of results
/reverse-geocoder/v1
- geocoding which takes arguments lat
,lon
,n
and radius
java -Dvertx.options.blockedThreadCheckInterval=9999 \
-Djava.library.path="/path/to/jniLibs" \
-jar /path/to/location-tools-1.0-SNAPSHOT.jar \
-conf vertx-conf.json
Edit vertx-conf.json
file to run on different host and port.
Testing geocoding endpoint:
curl http://localhost:8080/geocoder/v1?n=1&address=1%20Batterymarch%20Park%20Quincy%20MA
Respose:
{
"version": "1.0",
"input": "1 Batterymarch Park Quincy MA",
"results": [
{
"ADDRESS_SCORE": "60.0",
"BLKGRPCE": "1",
"BLKGRPCE10": "1",
"BLOCKCE10": "1022",
"COUNTY": "Norfolk",
"COUNTYFP": "021",
"FULLNAME": "Batterymarch Park",
"GEOMETRY": "LINESTRING (-71.02700999999999 42.23033999999999, -71.02670999999998 42.230299999999986, -71.02660999999999 42.23028)",
"INTPTLAT": "42.23122024536133",
"INTPTLON": "-71.02692413330078",
"LFROMADD": "1",
"LINT_LAT": "42.23033893043786",
"LINT_LON": "-71.027001978284",
"LTOADD": "99",
"NAME": "Massachusetts",
"PLACE": "Quincy",
"RFROMADD": "",
"RTOADD": "",
"SEARCH_SCORE": "57.5",
"STATEFP": "25",
"STUSPS": "MA",
"SUFFIX1CE": "",
"TRACTCE": "418003",
"UACE10": "09271",
"ZCTA5CE10": "02169",
"ZIPL": "02169",
"ZIPR": "",
"ip_postal_city": "quincy",
"ip_postal_house_number": "1",
"ip_postal_road": "batterymarch park",
"ip_postal_state": "ma"
}
]
}
Testing Reverse geocoding endpoint:
curl http://localhost:8080/reverse-geocoder/v1?lat=42.2303&lon=-71.0269&radius=0.01&n=1
Response:
{
"version": "1.0",
"input": "42.2303, -71.0269",
"results": [
{
"BLKGRPCE": "1",
"BLKGRPCE10": "1",
"BLOCKCE10": "1022",
"COUNTY": "Norfolk",
"COUNTYFP": "021",
"FULLNAME": "Batterymarch Park",
"GEOMETRY": "LINESTRING (-71.02700999999999 42.23033999999999, -71.02670999999998 42.230299999999986, -71.02660999999999 42.23028)",
"INTPTLAT": "42.23122024536133",
"INTPTLON": "-71.02692413330078",
"LFROMADD": "1",
"LTOADD": "99",
"NAME": "Massachusetts",
"PLACE": "Quincy",
"RFROMADD": "",
"RTOADD": "",
"STATEFP": "25",
"STUSPS": "MA",
"SUFFIX1CE": "",
"TRACTCE": "418003",
"UACE10": "09271",
"ZCTA5CE10": "02169",
"ZIPL": "02169",
"ZIPR": ""
}
]
}
Our assumption is you have already setup a CDH cluster with SparkOnYARN enabled. Please see Cloudera documentation how to Run Spark application on YARN
Note: In order to run the Spark Application, all the nodes in the cluster needs to have the previously built lucene index at the same path. This path is then set for lucene.index.dir key in the [batch-geocode] section of driver.ini
In addition to Lucene index, all the nodes in the cluster need to have libpostal installed and must have the jpostal java bindings from
jpostal/src/main/jniLibs
after compilation. The jniLibs path is passed as config tospark2-submit
as--conf spark.driver.extraLibraryPath=/path/to/jniLibs
.
The format of input files should be tsv (tab separated values), first column is taken as address
and second as join_key
.
This application outputs data directly to hive. You need to change the hive.output.table
for each run or else the spark application will fail with AnalysisException: Table already exists
spark2-submit --master yarn \
--deploy-mode cluster \
--class "org.nfpa.spatial.Driver" \
--num-executors 80 \
--executor-memory 18G \
--executor-cores 2 \
--files=driver.ini \
--conf "spark.storage.memoryFraction=1" \
--conf "spark.yarn.executor.memoryOverhead=2048" \
--conf spark.executor.extraLibraryPath=/path/to/jniLibs \
--conf spark.driver.extraLibraryPath=/path/to/jniLibs \
path/to/location-tools-1.0-SNAPSHOT.jar --batch-geocode driver.ini
You should run spark2-submit in either headless mode or in tmux session since batch jobs may take several hours to execute.
Docker needs ~3 GB disk space to download libpostal data. We provide a pre-built Lucence index for Masschusetts and all US Census Shapefile Data which is stored in S3. Provision sufficient space ~10GB for all states lucene index(unzipped).
(Massachusetts)
git clone https://github.com/NFPA/LocationTools.git
cd LocationTools
mvn clean install
docker image build -t nfpa-location-tools .
docker container run -p 8080:8080 nfpa-location-tools
These steps will automate most of the setup and run API webserver on port 8080. You can refer Dockerfile in project folder for more details. As you see it uses build_libpostal.sh
, build_jpostal.sh
, onstart-docker.sh
and vertx-docker-conf.json
scripts.
Refer API WebServer section on how to query the endpoints.
To run the US build edit the onstart-docker.sh
file and change BUILD
variable to all
instead of sample
and then run the below commands again:
docker image build -t nfpa-location-tools .
docker container run -p 8080:8080 nfpa-location-tools
Refer API WebServer section on how to query the endpoints.
You can export the image as tar file using the below command
docker save -o nfpa-location-tools.tar nfpa-location-tools:latest
See Build your own docker section to generate a pre-built docker image
Load and run the image file
docker load --input nfpa-location-tools.tar
docker container run -p 8080:8080 nfpa-location-tools:latest
For additional documentation please check out https://nfpa.github.io/LocationTools
- Rahul Pande - A WPI data science graduate student who worked as intern at NFPA.
- Jason Yates - A developer at Clairvoyant who helped get the project off the ground.
- Mohammed Ayub - NFPA
- Joe Gochal - NFPA
- Clairvoyant - Shekhar Vemuri, Jason Yates, and team for their help in framing initial versions of the project.
- Cloudera Team for their support.
- NFPA Data Analytics Team for testing and providing feedback
- Initially funded as part of NFPA - WPI Data Science Graduate Qualifying Project (GQP) Initiative.
This project is licensed under the BSD 3-Clause License - see the LICENSE file for details