Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes #43

fititnt · 2022-05-27T01:30:08Z

Organization strategy to deal with numeric namespace of dictionaries which are handled at administrative level Organization strategy to deal with numeric namespace of dictionaries which are handled at administrative level #39

After #39, we will eventually have to integrate at least population statistics. There's several sources for it, but we need at least decide the organization strategy and prepare the tooling.

fititnt · 2022-07-25T05:14:01Z

Know challenges

1. Primary: Ways to semantically encode with both tabular and graph format

The newer versions of No1 and No11 on CSV already are allowing for generalization conversion to RDF and (this is important) in the path of allowing related concepts (but which are not the same thing. This may seem strange as for tabular databases format, but is quite hard to make both without overcomplicating for end user.

We could simply do a 1:1 Wikidata mapping (some attributes already differentiate like male and female populations) bit we should still leave room for more specialized variants.

While not fully self-testable (We still rely only on frictionless validator and simplify Apache Jena riot validator), content already published on the fully automated organization @MDCIII are using stricter HXL Standard in ways which allow RDF mappings. That's why the way we would encode metadata about collections of humans by theme becomes quite a big deal. Less be used the data itself, but the well documented/predictable schemas to allow tooling integration for data which doesn't need to be public and the ones which are public, but would be easier for others to convert their data to our taxonomy and reuse everything else.

2. Secondary: the crawlers

Whole we obviously can also fetch Wikidata population statistics (and likely to do this even for testing the schemas) it is still viable to get population from other places.

However, this already becomes a part we're I'm not as sure if worth the trouble to focus on already published data on for example the Humanitarian Data Exchange (HDX) than crawler common APIs directly. Maybe it could be, but for a small subset of countries.

3. Comment: about population data for humanitarian use which is inferences

On the notes about population statistics at least from Mozambique, comments warn that humanitarians are requesting granular information (such as age by gender and disability) which simply is not available for real, so except obviously because of some extra human review, they're already simulations.

By no means I'm saying these data are bad or not worth it. In fact, well done, are cost effective. But by having other metrics which could act as seed, users can derivate on demand other inferences or compare different sources. Not something for the short term (maybe not even mid term) but the level of detail we're doing to taxonomize would allow it.

However, while it might seem strange, a major feature are the population statistics which already use PCodes. Unless we evolve the P-Codes to Wikidata QIDs, we can't automate several features which we can get really fast. We're really aiming to get things very well integrated, not mere data hoarding.

…) started

…OP.TOTL draft

… links

…SP.POP.TOTL --objectivum-formato=hxltm

fititnt · 2022-07-30T04:47:57Z

The ./999999999/0/999999999_521850.py is the program used as data scrapper for some thematic data. Regardless of the challange pointed here #45 (comment) about eventually need to map 1:1 with P-Codes at subnatinal levels, the ammout of thematic data we're already able to create tabular data is so high, that brings the next issue:

To dos

Thematic themes will need stricter graph mappings

Several places can talk near the same content. HXL could work with plain natural language, but to allow automated documentation (and things like semantic reasoning without centralized server) we really need to make very well documented mappings.

This is one of the whys, the slow down can be less about data scrapping and make the crawlers working, but about... how to taxonomize them. Things which are Wikidata P could even use properties which could make sense if ingested back on Wikidata, but we're likely to deal with things that still need to be encoded, but will need be referentually to individual Q items.

The decision about naming things (e.g. the infixes to use for thematic data)

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Regardless of how we use RDF or other mappings, as we don't rely on public URLs, but URNs will very well defined meaning, we need to think very well how to organize the structural numering. By the taxonomy alone already is possible to infer what data is about, and simplify a lot when the data in ingested in SQL databases (the #37), but we still need to think the entrypoints.

BFO, RDF and how to package the datasets when in tabular format

For now, in addition to think about such numeric infixes, we're likely to think how to final organization would be in terms of Basic Formal Ontology. For sake of RDF and maximize reuse for people who can't have centralized systems, this means think in ways that both basic usage (likely uses just using common tables to have entrypints) and advanced ones (triplestores, where massive number of properties, easily over what could blow up limits of columns of PostgreSQL) could work.

Likely one of the main features would be think in terms of allow one strongly suggested option of single inheritance which could still be reasonable. Other hardcore ontologists could still reshape the data later, but at least one way we could make it work and deal with real data makes sense.

Non goals of to dos

However, note that while this is a point needs documentation, things that would break reasoning even in graph format for final user (mostly notable when by ingesting data, co-existence of contradictory facts, such one fact that some administrative boundary is sovereign country, but this break other parts) we can simply don't try to assume the user will ingest all data at same time. Other point is when if users ingest two or more models to explain the world (for example, ways to express what is biological sex, what is gender identity) and this would break the reasoning because they would try to re-classify facts in different ways.

I think this limitation of not trying to make structural taxonomy to deal with such advanced inconsistencies (and justify that we need to leave users to decide what is not trivial to decide) can actually can simplify the overall structure. And, this is important, likely end users (or other ontologists) will be less likely to complain about the structural taxonomy be less flexible because some limitations would be more directed on things already likely to not make sense if they're already opinionated with the world. This approach would:

allow users with very strong views (or if they're receive data from providers more close to the real world)
(while not the main goal) likely tolerate the idea that sometimes, providers which do have strong views, might have to make something they don't agree because the target system receive the data can't understand context
- An region which is not supposed to do foreign relations but would be done something trivial (likely the data simply don't have context to explain itself) would be inconsistent if a provider deploy complex rules of international diplomacy with something as dumb as doing commercial trade (actually mostly likely the case for donations from other countries or which the source or target is governamental)
  - One alternative here could be either make the global mode more flexible (but this would make increasingly complex of exceptions) or... simply disable rule on this case because of massive amounts of false positives.
  - The tempting idea from top level is way too academic and impratical in the real world: imagine asking everyone to add much more metadata to every data point just to make the model simpler. Thats unrealistic. People would immediately refuse to do this sometimes not even because is slower, but the additional data they would need they're could not have access because would require information of other organization (likely top validating entrypoint by entry point).
    - I think is fair for example that at least top organizations which could need to validate to make the data aware of context should be able to allow disable their rule if this would make use in some contexts impractical.

…g RDF graphs instead of full blow SDMX?)

…aybe we just use anonymous nodes and abuse turtle [] syntax all the way?

…hashtag_mached_is()

…tabula HXLTM)

fititnt · 2022-07-30T15:39:12Z

Hum... first early attempt, while is storing the statistical data on RDF, have this issue:

as expected, without any additional step compared to what we're doing, it will store the statistical data (great) but would lost the reference about... the years. Thinks like population (P1082) (means total population), female population (P1539), male population (P1540). urban, rural, households, these we can have shared verbs. However, for things which are very structured, like dates, maybe other variants, we need try to optimize how to add sufficient metadata to know the context.

Either the way https://sdmx.org/ or (more likely what we would do it) the way Wikidata do with Qualifiers https://www.wikidata.org/wiki/Help:Qualifiers could be our goal. However, even Wikidata tend to not have massive amount of data already very well structured such as how we use on tabular formats.

To do

While migth need future review (likely based on make easier to query the data) on sort term we can just make any strategy that at least store the reference to the date of the statistics, so at least we can say that the data on .no1.tm.hxl.csv and .no1.owl.ttl are equivalent.

…of concept)

…hich can be merged back to ttl with other tools)

…population) bu ABL 1-3 (#43)

…topic-health

…HXLTM later

…erate data (but incomplete)

…s review

…) allow explain which HXLTM -x- BCP47 attribibutes would become RDF main entrypoints

… RDF form start to make sense

…ticaAI/lexicographi-sine-finibus#43)

fititnt · 2022-08-09T23:00:59Z

Hummm. the drafted 1603_9966_1 (from worldbank) is great, have population by year and other classifications (such gender-or-sex, urban/rural, etc), but for most cases, we also need something far simpler (only population statistics from some recent year.

Eventually more focused (without series by year) should also be available. Maybe we place with other variables

fititnt changed the title ~~Organization strategy to deal with population statistics (total and thematic) by published codes~~ Organization strategy to deal with population statistics (total and thematic) by published place codes May 27, 2022

fititnt mentioned this issue Jun 30, 2022

New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature) #37

Open

fititnt mentioned this issue Jul 13, 2022

Taxonomic strategy to encode individual humans by P-Codes #44

Open

fititnt changed the title ~~Organization strategy to deal with population statistics (total and thematic) by published place codes~~ Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes Jul 13, 2022

fititnt mentioned this issue Jul 13, 2022

Mappings from UN P-Codes and Wikidata Q IDs #45

Open

fititnt mentioned this issue Jul 22, 2022

MVP of taxonomical+packaging strategy to handle collections of same concept (but incompatible or overly complex to exchange together) #47

Open

fititnt added a commit that referenced this issue Jul 27, 2022

999999999_521850.py (#43): 999999999_521850.py data scraping (Q521850…

251b3f0

…) started

fititnt added a commit that referenced this issue Jul 27, 2022

999999999_521850.py (#43): clean up; draft of source for methods

fd554ed

fititnt added a commit that referenced this issue Jul 27, 2022

999999999_521850.py (#43): --methodus-fonti=worldbank --methodus=SP.P…

abe9d4b

…OP.TOTL draft

fititnt added a commit that referenced this issue Jul 28, 2022

999999999_521850.py (#43): DataScrappingUNDATA created

cccd646

fititnt added a commit that referenced this issue Jul 29, 2022

999999999_521850.py (#43): --methodus-fonti=sdmx-tests

2e4a187

fititnt added a commit that referenced this issue Jul 29, 2022

999999999_521850.py (#43): --methodus-fonti=sdmx-tests more tests and…

e3d4e9e

… links

fititnt added a commit that referenced this issue Jul 29, 2022

999999999_521850.py (#43): MVP --methodus-fonti=worldbank --methodus=…

e368bdf

…SP.POP.TOTL --objectivum-formato=hxltm

fititnt added a commit that referenced this issue Jul 30, 2022

(#43): drafting RDF mappings for complex statistical data (maybe usin…

74770dd

…g RDF graphs instead of full blow SDMX?)

fititnt added a commit that referenced this issue Jul 30, 2022

(#43): tooling for named graphs (even sintax for files) still poor; m…

3b46abd

…aybe we just use anonymous nodes and abuse turtle [] syntax all the way?

fititnt added a commit that referenced this issue Jul 30, 2022

999999999_521850.py (#43): DATA_HXL_AD_HXLTM, parse_hashtag(), parse_…

1c7eb32

…hashtag_mached_is()

fititnt added a commit that referenced this issue Jul 30, 2022

999999999_521850.py (#43): improved way to generate HXLTM from HXL

8ea96f2

fititnt added a commit that referenced this issue Jul 30, 2022

999999999_521850.py (#43): early draft of No1 (generic HXL from only …

f5f1ecc

…tabula HXLTM)

fititnt added a commit that referenced this issue Jul 31, 2022

999999999_521850.py (#43): draft examples (not yet implemented proof …

77689f7

…of concept)

fititnt added a commit that referenced this issue Jul 31, 2022

999999999_521850.py (#43): --rdf-per-trivio, partial work

1587d50

fititnt added a commit that referenced this issue Jul 31, 2022

999999999_521850.py (#43): --rdf-per-trivio MVP; create blanknodes (w…

dcb5138

…hich can be merged back to ttl with other tools)

fititnt added a commit that referenced this issue Jul 31, 2022

999999999_521850.py (#43): hxl_hashtag_normalizatio() created

0c17c6b

fititnt added a commit that referenced this issue Jul 31, 2022

999999999_521850.py (#43): hxl_hashtag_normalizatio() implemented

46144ff

fititnt added a commit that referenced this issue Jul 31, 2022

999999999_521850.py (#43): try guess the WorldBank region codes

4b08e7e

fititnt added a commit that referenced this issue Aug 1, 2022

999999999/0/MDCIII.simulato.owl (#41): drafted collective of humans (…

c041b60

…population) bu ABL 1-3 (#43)

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): worldbank, draft of get all PS stats from …

c7654f9

…topic-health

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): worldbank, pre-select what could generate …

57a410d

…HXLTM later

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): worldbank, prepare creation of pivot column

93cb7bf

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): worldbank, pivot column almost there

abb462d

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): de_hxltm_ad_hxltm_wide()

55798d7

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): _data_sort()

0a88819

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): hxltm__data_sort()

3da0784

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): hxltm__data_pivot_wide() started

fceb3a7

fititnt added a commit that referenced this issue Aug 3, 2022

999999999_521850.py (#43): HXLTM wide near working; No1 (RDF ttl) gen…

7beef2e

…erate data (but incomplete)

fititnt added a commit that referenced this issue Aug 4, 2022

999999999_521850.py (#43): humm... still generating invalid csv; need…

5c9c786

…s review

fititnt added a commit that referenced this issue Aug 4, 2022

999999999_521850.py (#43): we will need some way to (via command line…

7291b74

…) allow explain which HXLTM -x- BCP47 attribibutes would become RDF main entrypoints

fititnt added a commit that referenced this issue Aug 4, 2022

999999999_521850.py (#43): hxltm_hashtag_ix_ad_rdf() draft

ca3c5bd

fititnt added a commit that referenced this issue Aug 4, 2022

999999999_521850.py (#43): hxltm_hashtag_ix_ad_rdf() MVP

8799a2a

fititnt added a commit that referenced this issue Aug 4, 2022

999999999_521850.py (#43): great. Now the nested complex relations on…

64bbcf2

… RDF form start to make sense

fititnt added a commit to EticaAI/MDCIII-boostrapper that referenced this issue Aug 4, 2022

1603_9966.sh started (refs EticaAI/lexicographi-sine-finibus#43)

d4069f9

fititnt added a commit to EticaAI/MDCIII-boostrapper that referenced this issue Aug 4, 2022

1603_9966.sh: gh_repo_update_1603_9966_1__boostrap_0() created (refs E…

c00a3c3

…ticaAI/lexicographi-sine-finibus#43)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes #43

Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes #43

fititnt commented May 27, 2022

fititnt commented Jul 25, 2022

fititnt commented Jul 30, 2022

fititnt commented Jul 30, 2022

fititnt commented Aug 9, 2022

Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes #43

Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes #43

Comments

fititnt commented May 27, 2022

fititnt commented Jul 25, 2022

Know challenges

1. Primary: Ways to semantically encode with both tabular and graph format

2. Secondary: the crawlers

3. Comment: about population data for humanitarian use which is inferences

fititnt commented Jul 30, 2022

To dos

Thematic themes will need stricter graph mappings

The decision about naming things (e.g. the infixes to use for thematic data)

BFO, RDF and how to package the datasets when in tabular format

Non goals of to dos

fititnt commented Jul 30, 2022

To do

fititnt commented Aug 9, 2022