In this repository, we provide descriptions about how to reproduce our evaluation study. More information can found in our paper "Geoparsing: Solved or Biased? An Evaluation of Geographic Biases in Geoparsing" accepted by AGILE 2022.
To reproduce our data preprocessing steps, you need to download several datasets first and put them in corresponding directories. Alternatively, you can also directly use our shared preprocessed datasets and jump to the Deploying Geoparsers section.
The country/region shapefile can be accessed as Admin-0 Countries from Natural Earth, and should be put in data/admin0-natural-earth/
. LGL, GeoVirus, and WikToR can be accessed here, and should be put in data/evaluation-corpora/original-datasets/
. Their data patches can be accessed here, and should be put in data/evaluation-corpora/data-patches/
. GeoCorpora can be accessed here, and should be put in data/evaluation-corpora/original-datasets/
. GeoWiki can be accessed here, and should be put in
data/training-corpora/
. The GeoNames gazetteer used by CamCoder can be accessed here, and should be put in data/gazetteers/
.
To achieve both our spatially-explicit geoparsing performance evaluation and geographic bias evaluation, we extracted all annotated locations from training and evaluation corpora as well as the GeoNames gazetteer. The script used is scripts/annotated-poi-extraction.ipynb
. These extracted locations should be stored in data/extracted-annotated-locations/
.
To more easily run Edinburgh Geoparser, we split articles in LGL, GeoVirus, and WikToR into separate datasets. The scripts used is scripts/toponym-resolution-evaluation-corpora-splitting.ipynb
. The split datasets should be found at data/evaluation-corpora/split-datasets/
.
For the representation bias analysis, we generated grids containing a summary about the number of annotated locations located within for different datasets, respectively. This grid summary generation process was done in ArcGIS Pro 2.9.0. An example of using extracted annotated locations from WikToR is attached below. The grid summary dataset can be downloaded from our publicly-accessible figshare repository.
(1) First, use XY Table To Point to convert WikToR's annotated locations into point features.
(2) Then, use Project to project the point features from WGS 1984 to Eckert IV (World), which is the selected projected coordinate system in our study.
(3) After adding the country/region shapefile to the map project, apply the same projection to it.
(4) Use Grid Index Features to generate grids from the country/region polygon features.
(5) Then, use Spatial Join to join the country/region polygon features and grids to add the country/region information to every grid.
(6) Use Summarize Within to summarize the number of annotated locations within each grid.
(7) Use Export Table to export the attribute table of grid features to data/grids-100sqkm-admin0-natural-earth/
.
Note that you only need to perform step (2) to step (5) once to generate grids from the country/region shapefile. You can repeat step (1), step (6), and step (7) to generate grid summaries for LGL, GeoVirus, GeoWiki, GeoCorpora, and GeoNames, respectively.
For toponym recognition, we used spaCy (version 2.1) with the en_core_web_lg
English pipeline, and NeuroTPR.
After you unzip the pre-trained NeuroTPR models, please put all the files in models/NeuroTPR
. Also, because NeuroTPR uses tensorflow_hub which only supports Tensorflow 1.15 instead of Tensorflow 1.14, make sure you install Tensorflow 1.15. To deal with InvalidArgumentError: ConcatOp : Dimensions of inputs should match
that you may encounter when running NeuroTPR, you can change geoparse.py
in the NeuroTPR site-packages as below.
For toponym resolution, we used Edinburgh Geoparser and CamCoder. For CamCoder, because we run the experiment in Python 3.6.13 environment instead of Python 2.7+ environment used by their authors, here we provide scripts that were updated for our study. These scripts include models/CamCoder/root/context2vec.py
, models/CamCoder/root/geoparse.py
, models/CamCoder/root/preprocessing.py
, and models/CamCoder/root/text2mapVec.py
.
If you are interested in how we run toponym recognition and resolution models, you can follow the step-by-step instructions below. Alternatively, you can directly use our shared geoparsed results in geoparsed-results/
and jump to the Exploratory Analysis on Geoparsing Performance Indicators section.
scripts/toponym-recognition-GeoCorpora-spaCy.py
and scripts/toponym-recognition-GeoCorpora-NeuroTPR.py
perform toponym recognition on GeoCorpora with spaCy and NeuroTPR, respectively.
A tutorial on how to properly use Edinburgh Geoparser can be accessed here. After using Edinburgh Geoparser to perform toponym resolution on LGL, GeoVirus, and WikToR, scripts/toponym-resolution-results-Edinburgh-Geoparser-integration.ipynb
need to be be run to integrate the output files for further evaluations. Note that toponym resolution results provided by Edinburgh Geoparser may not be exactly the same when running it each time. Therefore, please make sure to use our provided results for further analyses. models/CamCoder/root/toponym-resolution-LGL-CamCoder.py
, models/CamCoder/root/toponym-resolution-GeoVirus-CamCoder.py
, and models/CamCoder/root/toponym-resolution-WikToR-CamCoder.py
perform toponym resolution with CamCoder on LGL, GeoVirus, and WikToR, respectively.
scripts/exploratory-analysis-recall.ipynb
and scripts/exploratory-analysis-mdned.ipynb
perform exploratory analyses on Median Error Distance (MdnED) and Recall, respectively.
scripts/standard-deviation-toponym-resolution-ambiguity.ipynb
calculates the standard deviation of MdnED for highly ambiguous toponyms.
After having all toponym recognition and resolution results ready, you can continue to perform spatial autocorrelation analysis on them, which was also done in ArcGIS Pro 2.9.0. An example of using toponym resolution result generated from WikToR by CamCoder is attached below.
(1) Same as the first two steps in Grid Summary Generation, use XY Table To Point to convert the toponym resolution result to point features, and then use Project to project the point features from WGS 1984 to Eckert IV (World).
(2) Use Hot Spot Analysis (Getis-Ord Gi*) to perform spatial autocorrelation analysis. In our study, the parameters of Conceptualization of Spatial Relationships
and Number of Neighbors
were set as K nearest neighbors
and 8
, respectively. Note that the Input Field
should be changed from median_error_distance
to recall
when performing spatial autocorrelation analysis on toponym recognition results.
You can repeat the above steps to perform spatial autocorrelation analysis on the remaining toponym recognition and resolution results.
In terms of geographic bias evaluation, representation bias measurement was achieved with the script scripts/representation-bias-measurement.ipynb
.