Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format #42

Open
fititnt opened this issue May 16, 2022 · 1 comment
Labels
praeparatio-ex-automatis praeparātiō ex automatīs; preparation out of automations

Comments

@fititnt
Copy link
Member

fititnt commented May 16, 2022


This point is an minimal viable product of a one or more "crawlers" or "scripts" or "conversors" that transform external dictionaries (aka the ones we would label origo_per_automata, origin trough automation, vs origo_per_amanuenses, origin trough manuenses, the way we mostly optimized now) into the working format.

Focuses

  • The data we're interested are already referential data, which is smaller subset of what is shared
    • Is more important have less, but actively updated with primary source and very high quality than do data hoarding and ignore the important ones.
  • We're really interested in referential data we can document how to use
    • This also means we may intentionally name the data fields in ways that make easier to document; even if this means automatically generate user documentation
    • The entire idea must allow ways to receive collaborators help to translate documentation (not need be on sort term, but at least be planned from very start)
  • Referential data can be public; but most information managers will deal with sensitive data
    • The best potential end user, aka the information managers, are likely to ingest all the data as soon as new emergency happens.
    • Even if information managers have good data proeficiency, or know some programming language, they're likely to be overloaded; so we need to make as easier as possible to mitigate human error (on the reference tables)
  • We're interested on reference data useful to disaster preparedness
    • This makes even more important the idea of optimize for faster releases, user documentation, care about make less likely users would leak sensitive data, and to make data schema interoperable at international level

External examples of type of reference data

International Federation of Red Cross and Red Crescent Societies | IFRC Data initiatives

Common Operational datasets (overview)

2007 reference (somewhat outdated)

From https://interagencystandingcommittee.org/system/files/legacy_files/Country%20Level%20OCHA%20and%20HIC%20Minimum%20Common%20Operational%20Datasets%20v1.1.pdf

Table One :Minimum Common Operational Datasets
Category Datalayer Recommendedscaleof** sourcematerial**
Political/Administrativeboundaries CountryboundariesAdmin level1Adminlevel2Adminlevel3Adminlevel4 1:250K
Populated places (with attributes including:latitude/longitude,alternativenames,populationfigures,classification) Settlements
1:100K–1:250K
Transportationnetwork RoadsRailways 1:250K
Transportationinfrastructure Airports/HelipadsSeaports 1:250K
Hydrology RiversLakes 1:250K
Citymaps Scannedcitymaps 1:10K
Table Two: Optional Datasets
Category Datalayer Recommendedscaleofsourcematerial
Marine Coastlines 1:250K
Terrain Elevation 1:250K
Nationalmapseries Scannedtoposheets 1:50K-1:250K
Satelliteimagery Landsat,ASTER,Ikonos, Quickbirdimagery Various
Naturalhazards2 Various Various
Thematic Various Various
@fititnt
Copy link
Member Author

fititnt commented May 17, 2022

Okay. Doing some dogfooding with previous step on #41.

  • 999999999_268072.py is a script to preprocess IBGE (Brazil) related data (at least COD-ABs) from primary source at https://servicodados.ibge.gov.br/api/docs
  • 999999999_10263485.py is a script to preprocess CNES codes (brazil) related to mostly healtcare sites.

Both cases are what would be considered referential data.

On the issues to convert files: not a problem at all

I think the file converters (and later do the pipeline with GitHub actions or equivalent) is quite feasible. This is not what taking more time (at least, considering everything already done with the HXLTM and Numerodinatio).

However, most of the time, it was testing from data already added to HDX. Not from scratch (which have far more data to potentially document)

The new challenge: the MAPPINGS to full blow linked data

The numbers of identifiers which already have an Wikidata P are quite limited (see https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all). The IBGE municipality code is an perfect example. From this post https://dadosabertos.social/t/lista-de-codigos-de-referencia-para-intercambio-de-dados/1138/2, we're starting to get more data, which would need to be mapped.

In sort: things that already have an Wikidata P code are much simpler to deal with. But we need to be smarter on how to make it viable. I'm not saying that is not possible. But at this point, the level of abstraction of the converters (which now means even RDF, but at same time it should work if is on relational databases) is such that not only this would allow convert the reference data to pretty much any format out there, but also make easier to do automated discovery of final datasets.

(For what already is referential data) moving the logic to non code file

The print screen is very early stage, but to generalize better, since is "easier" to do software to convert whatever source was to something in tabular format, then after the tabular format is done (think raw CSV headers) an strategy to explain what it means will take time.


Captura de tela de 2022-05-16 23-34-46


Since at some point this will take far more thinking than just create more crawlers, the idea of move the logic to an YAML configuration makes sense.

fititnt added a commit that referenced this issue May 17, 2022
fititnt added a commit that referenced this issue May 18, 2022
…t better explanation (like relation with places)
fititnt added a commit that referenced this issue May 18, 2022
…ines from (until now) fully standalone 999999999_*.py stripts
fititnt added a commit that referenced this issue May 18, 2022
fititnt added a commit that referenced this issue May 18, 2022
fititnt added a commit that referenced this issue May 19, 2022
…235.py already is able to create the CSV, HXL and HXL+tm
fititnt added a commit that referenced this issue May 20, 2022
fititnt added a commit that referenced this issue May 21, 2022
fititnt added a commit that referenced this issue May 21, 2022
fititnt added a commit that referenced this issue May 21, 2022
fititnt added a commit that referenced this issue May 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
praeparatio-ex-automatis praeparātiō ex automatīs; preparation out of automations
Projects
None yet
Development

No branches or pull requests

1 participant