Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate geo matching of hospitals #14

Open
mradamcox opened this issue May 14, 2020 · 0 comments
Open

Consolidate geo matching of hospitals #14

mradamcox opened this issue May 14, 2020 · 0 comments

Comments

@mradamcox
Copy link
Collaborator

We have added a geojson file to the repo that shows all of the hospital locations to date. I have tested it and can confirm that the data is complete as of this writing, but we still haven't fully integrated it into the workflow, so that's what this ticket is about.

Background: Each row in an input CSV has the name of a hospital in it, and that name has to be matched to a lat/long/street address, etc which is then written to the output CSV. The input CSVs do come with some of this information for some of the hospitals, but it is not reliable so we disregard it.

Currently, the matching process uses two different files, geocode_cache.csv and pa_hospitals and combines them in geo_utils.py HospitalLocations(). Here's an example of where that is ultimately implemented in process_csv: https://github.com/RTCovid/PADataIngestion/blob/master/operators/process_csv.py#L54 (also scroll down to lines 120 and 133) in order to match coordinates to the hospitals based on their name.

Instead of that process, we can consolidate greatly by loading the geojson file, matching a name to each feature, and then taking all of the necessary information from the feature. We recently started using that matching process in the Validator class here:

def validate_locations(self, input_csv):
. That example also shows the simple pattern in place to handle misspellings or new names for hospitals: A "HospitalNameAliases" field is stored in the GeoJSON that can hold pipe-delimited alternate spellings, and it is parsed in the load_geojson function. The if a name doesn't immediate match one of the features, the list is iterated again.

Completing this ticket will be revamping process_csv to use the new matching method, so that HospitalLocations (and therefore the csv files mentioned above) are no longer needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant