[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format #42

fititnt · 2022-05-16T15:39:03Z

Related issues, but not equal:
- MVP of read access to Wikidata MVP of read access to Wikidata #3
- Automate SPARQL query generation to Wikidata by items with P Automate SPARQL query generation to Wikidata by items with P #40
Related concepts
Projects with scrappers in a very structured way (many external referential data)
- https://github.com/OCHA-DAP?q=scraper
- https://github.com/datasets
  - Beyond data, most also contain the scripts to process such data. Some may use this https://github.com/datahq/dataflows

This point is an minimal viable product of a one or more "crawlers" or "scripts" or "conversors" that transform external dictionaries (aka the ones we would label origo_per_automata, origin trough automation, vs origo_per_amanuenses, origin trough manuenses, the way we mostly optimized now) into the working format.

Focuses

The data we're interested are already referential data, which is smaller subset of what is shared
- Is more important have less, but actively updated with primary source and very high quality than do data hoarding and ignore the important ones.
We're really interested in referential data we can document how to use
- This also means we may intentionally name the data fields in ways that make easier to document; even if this means automatically generate user documentation
- The entire idea must allow ways to receive collaborators help to translate documentation (not need be on sort term, but at least be planned from very start)
Referential data can be public; but most information managers will deal with sensitive data
- The best potential end user, aka the information managers, are likely to ingest all the data as soon as new emergency happens.
- Even if information managers have good data proeficiency, or know some programming language, they're likely to be overloaded; so we need to make as easier as possible to mitigate human error (on the reference tables)
We're interested on reference data useful to disaster preparedness
- This makes even more important the idea of optimize for faster releases, user documentation, care about make less likely users would leak sensitive data, and to make data schema interoperable at international level

External examples of type of reference data

International Federation of Red Cross and Red Crescent Societies | IFRC Data initiatives

https://data.ifrc.org/

Common Operational datasets (overview)

2007 reference (somewhat outdated)

From https://interagencystandingcommittee.org/system/files/legacy_files/Country%20Level%20OCHA%20and%20HIC%20Minimum%20Common%20Operational%20Datasets%20v1.1.pdf

Table One :Minimum Common Operational Datasets

Category	Datalayer	Recommendedscaleof sourcematerial
Political/Administrativeboundaries	CountryboundariesAdmin level1Adminlevel2Adminlevel3Adminlevel4	1:250K
Populated places (with attributes including:latitude/longitude,alternativenames,populationfigures,classification)	Settlements
1:100K–1:250K
Transportationnetwork	RoadsRailways	1:250K
Transportationinfrastructure	Airports/HelipadsSeaports	1:250K
Hydrology	RiversLakes	1:250K
Citymaps	Scannedcitymaps	1:10K

Table Two: Optional Datasets

Category	Datalayer	Recommendedscaleofsourcematerial
Marine	Coastlines	1:250K
Terrain	Elevation	1:250K
Nationalmapseries	Scannedtoposheets	1:50K-1:250K
Satelliteimagery	Landsat,ASTER,Ikonos, Quickbirdimagery	Various
Naturalhazards2	Various	Various
Thematic	Various	Various

The text was updated successfully, but these errors were encountered:

…healtcare referential data)

…e in YAML

fititnt · 2022-05-17T02:46:28Z

Okay. Doing some dogfooding with previous step on #41.

999999999_268072.py is a script to preprocess IBGE (Brazil) related data (at least COD-ABs) from primary source at https://servicodados.ibge.gov.br/api/docs
999999999_10263485.py is a script to preprocess CNES codes (brazil) related to mostly healtcare sites.

Both cases are what would be considered referential data.

On the issues to convert files: not a problem at all

I think the file converters (and later do the pipeline with GitHub actions or equivalent) is quite feasible. This is not what taking more time (at least, considering everything already done with the HXLTM and Numerodinatio).

However, most of the time, it was testing from data already added to HDX. Not from scratch (which have far more data to potentially document)

The new challenge: the MAPPINGS to full blow linked data

The numbers of identifiers which already have an Wikidata P are quite limited (see https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all). The IBGE municipality code is an perfect example. From this post https://dadosabertos.social/t/lista-de-codigos-de-referencia-para-intercambio-de-dados/1138/2, we're starting to get more data, which would need to be mapped.

In sort: things that already have an Wikidata P code are much simpler to deal with. But we need to be smarter on how to make it viable. I'm not saying that is not possible. But at this point, the level of abstraction of the converters (which now means even RDF, but at same time it should work if is on relational databases) is such that not only this would allow convert the reference data to pretty much any format out there, but also make easier to do automated discovery of final datasets.

(For what already is referential data) moving the logic to non code file

The print screen is very early stage, but to generalize better, since is "easier" to do software to convert whatever source was to something in tabular format, then after the tabular format is done (think raw CSV headers) an strategy to explain what it means will take time.

Since at some point this will take far more thinking than just create more crawlers, the idea of move the logic to an YAML configuration makes sense.

…belecimentos de saude; teste inicial

… (comment)

…n-triples

…thing

…tems) on Turtle

…t better explanation (like relation with places)

…ines from (until now) fully standalone 999999999_*.py stripts

…999999999_54872.py)

…ectly from XLSXs)

…original yet)

…on file

…bash

…235.py already is able to create the CSV, HXL and HXL+tm

…t of hxltm_ex_selectis()

…um_columnis()

…mber of the library)

…dex (use case: add columns to datasets)

…99/0/{1}.index.json

…iso3166p1a3

…ed data

fititnt added the praeparatio-ex-automatis praeparātiō ex automatīs; preparation out of automations label May 16, 2022

fititnt added a commit that referenced this issue May 16, 2022

999999999_10263485.py (#42): draft of automation for brazilian CNES (…

d2ede16

…healtcare referential data)

fititnt added a commit that referenced this issue May 16, 2022

999999999_10263485.py (#42): bash archivum_speculo_ex_ftp()

63e299e

fititnt added a commit that referenced this issue May 16, 2022

999999999_10263485.py (#42): draft of XML iterator

fca3841

fititnt added a commit that referenced this issue May 16, 2022

999999999_10263485.py (#42): already possible to export CSV

449ed01

fititnt added a commit that referenced this issue May 16, 2022

999999999_268072.py (#42): started brazilian IBGE automated importer

b130339

fititnt added a commit that referenced this issue May 16, 2022

999999999_268072.py (#42): draft of use of external configuration fil…

79ae657

…e in YAML

fititnt added a commit that referenced this issue May 17, 2022

999999999_268072.py (#42): TabulaAdHXLTM

81b97f9

fititnt added a commit that referenced this issue May 17, 2022

999999999_268072.py (#42): YAML configuration

85d6a44

fititnt added a commit that referenced this issue May 17, 2022

999999999_268072.py (#42): 1603_63_49_76.no1.tm.hxl.csv; 372.560 esta…

d00672c

…belecimentos de saude; teste inicial

fititnt added a commit that referenced this issue May 17, 2022

999999999_54872.py (#42): 999999999_54872.py started; see comment #42…

6b8b6fe

… (comment)

fititnt added a commit that referenced this issue May 17, 2022

999999999_54872.py (#42): planned: application/x-turtle, application/…

d0760f9

…n-triples

fititnt added a commit that referenced this issue May 17, 2022

999999999_54872.py (#42): HXLTMAdRDFSimplicis started

a154c85

fititnt added a commit that referenced this issue May 17, 2022

999999999_54872.py (#42): HXLTMAdRDFSimplicis headers

46eac59

fititnt added a commit that referenced this issue May 17, 2022

999999999_54872.py (#42): resultatum_ad_turtle() started to show some…

85669d9

…thing

fititnt added a commit that referenced this issue May 18, 2022

999999999_54872.py (#42): still copying logic from 1603_1.py

b88befe

fititnt added a commit that referenced this issue May 18, 2022

999999999_54872.py (#42): improved debug information (for unlabeled i…

bb8fdea

…tems) on Turtle

fititnt added a commit that referenced this issue May 18, 2022

999999999_54872.py (#42): start to use skos:related for things withou…

97db6d1

…t better explanation (like relation with places)

fititnt mentioned this issue May 18, 2022

MVP of [1603.45.16] /"Ontologia"."United Nations"."P"/@eng-Latn #2

Open

fititnt added a commit that referenced this issue May 18, 2022

L999999999_0.py (#42): new L999999999_0.py; used to store reused rout…

18486da

…ines from (until now) fully standalone 999999999_*.py stripts

fititnt added a commit that referenced this issue May 18, 2022

999999999_10263485.py (#42, #2): new 999999999_10263485.py (based on …

73188dd

…999999999_54872.py)

fititnt added a commit that referenced this issue May 18, 2022

999999999_10263485.py (#42, #2): draft of XLSXSimplici (read data dir…

ab84d20

…ectly from XLSXs)

fititnt added a commit that referenced this issue May 18, 2022

999999999_10263485.py (#42, #2): improved XLSXSimplici

fb187ec

fititnt added a commit that referenced this issue May 19, 2022

999999999_10263485.py (#42, #2): able to infer basic CSV (not COD-AB …

ea84b73

…original yet)

fititnt added a commit that referenced this issue May 19, 2022

L999999999_0.py (#42): refactoring; moved logi from 1603_1.py to comm…

ce84dfc

…on file

fititnt added a commit that referenced this issue May 19, 2022

L999999999_0.py (#42): CodAbTabulae draft

01b3fd9

fititnt added a commit that referenced this issue May 19, 2022

L999999999_0.py (#42): CodAbTabulae improved inferece of final hashtags

a7ddad0

fititnt added a commit that referenced this issue May 19, 2022

L999999999_0.py (#42): CodAbTabulae started HXLTM inference

5ab69fe

fititnt added a commit that referenced this issue May 19, 2022

999999999_7200235.py + 1603_45_16.sh (#42): started glue things with …

5ffc21b

…bash

fititnt added a commit that referenced this issue May 19, 2022

999999999_7200235.py + 1603_45_16.sh (#42): great! Now 999999999_7200…

03ed697

…235.py already is able to create the CSV, HXL and HXL+tm

fititnt added a commit that referenced this issue May 20, 2022

999999999_7200235.py + 1603_45_16.sh (#42, #2): index files

9ae8357

fititnt added a commit that referenced this issue May 20, 2022

999999999_7200235.py + 1603_45_16.sh (#42): hxltm_ex_columnis(), draf…

cb37db1

…t of hxltm_ex_selectis()

fititnt added a commit that referenced this issue May 20, 2022

999999999_7200235.py (#42): hxltm_ex_selectis() works

0b120a7

fititnt added a commit that referenced this issue May 20, 2022

999999999_7200235.py (#42): draft of hxltm_per_columnas() and hxltm_c…

125f876

…um_columnis()

fititnt added a commit that referenced this issue May 20, 2022

999999999_7200235.py (#42): improved in-memory generalizations

b7c4735

fititnt added a commit that referenced this issue May 20, 2022

999999999_0.py (#42): hxltm__quaestio_significatis_i()

3d339cb

fititnt added a commit that referenced this issue May 21, 2022

999999999_0.py (#42): generalized data input (allow direct file or nu…

bc0e79a

…mber of the library)

fititnt added a commit that referenced this issue May 21, 2022

999999999_0.py (#42): draft of pre-processing of cached version of in…

3698363

…dex (use case: add columns to datasets)

fititnt added a commit that referenced this issue May 21, 2022

999999999_0.py (#42): cached index already able to be created at 9999…

911e501

…99/0/{1}.index.json

fititnt added a commit that referenced this issue May 21, 2022

999999999_0.py (#42): regex draft of DATA_REFERENTIBUS()

a4d4c81

fititnt added a commit that referenced this issue May 21, 2022

999999999_0.py (#42): in memory CONCAT() works

0db3da5

fititnt added a commit that referenced this issue May 21, 2022

999999999_0.py (#42): external DATA_REFERENTIBUS() worked

f015ce6

fititnt added a commit that referenced this issue May 21, 2022

999999999_0.py (#42): bugfixes

9b19fcc

fititnt added a commit that referenced this issue May 21, 2022

999999999_0.py (#42): --cum-ordinibus-ex-columnis

56a2420

fititnt added a commit that referenced this issue May 22, 2022

999999999_0.py (#42): prebuild index (key-pair) UN m49, iso3166p1a2, …

4cd35ad

…iso3166p1a3

fititnt added a commit that referenced this issue May 22, 2022

999999999_0.py (#42): documentation update

3bcde2b

fititnt added a commit that referenced this issue May 22, 2022

1603_45_16.sh (#42): 1603_45_16.sh partial refactoring to fetch updat…

9a99946

…ed data

fititnt added a commit that referenced this issue May 22, 2022

L999999999_0.py (#42): refactoring; moving to L999999999_0.py

33170a8

fititnt added a commit that referenced this issue May 24, 2022

999999999_7200235.py (#42): --methodus='cod_ab_index_levels'

b1801b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format #42

[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format #42

fititnt commented May 16, 2022

fititnt commented May 17, 2022

[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format #42

[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format #42

Comments

fititnt commented May 16, 2022

Focuses

External examples of type of reference data

International Federation of Red Cross and Red Crescent Societies | IFRC Data initiatives

Common Operational datasets (overview)

2007 reference (somewhat outdated)

Table One :Minimum Common Operational Datasets

Table Two: Optional Datasets

fititnt commented May 17, 2022

On the issues to convert files: not a problem at all

The new challenge: the MAPPINGS to full blow linked data

(For what already is referential data) moving the logic to non code file