Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes #43

Open
fititnt opened this issue May 27, 2022 · 4 comments

Comments

@fititnt
Copy link
Member

fititnt commented May 27, 2022


After #39, we will eventually have to integrate at least population statistics. There's several sources for it, but we need at least decide the organization strategy and prepare the tooling.

@fititnt fititnt changed the title Organization strategy to deal with population statistics (total and thematic) by published codes Organization strategy to deal with population statistics (total and thematic) by published place codes May 27, 2022
@fititnt fititnt changed the title Organization strategy to deal with population statistics (total and thematic) by published place codes Taxonomic strategy to encode collective of humans (such as population statistics by total and thematic) by P-codes Jul 13, 2022
@fititnt
Copy link
Member Author

fititnt commented Jul 25, 2022

Know challenges

1. Primary: Ways to semantically encode with both tabular and graph format

The newer versions of No1 and No11 on CSV already are allowing for generalization conversion to RDF and (this is important) in the path of allowing related concepts (but which are not the same thing. This may seem strange as for tabular databases format, but is quite hard to make both without overcomplicating for end user.

We could simply do a 1:1 Wikidata mapping (some attributes already differentiate like male and female populations) bit we should still leave room for more specialized variants.

While not fully self-testable (We still rely only on frictionless validator and simplify Apache Jena riot validator), content already published on the fully automated organization @MDCIII are using stricter HXL Standard in ways which allow RDF mappings. That's why the way we would encode metadata about collections of humans by theme becomes quite a big deal. Less be used the data itself, but the well documented/predictable schemas to allow tooling integration for data which doesn't need to be public and the ones which are public, but would be easier for others to convert their data to our taxonomy and reuse everything else.

2. Secondary: the crawlers

Whole we obviously can also fetch Wikidata population statistics (and likely to do this even for testing the schemas) it is still viable to get population from other places.

However, this already becomes a part we're I'm not as sure if worth the trouble to focus on already published data on for example the Humanitarian Data Exchange (HDX) than crawler common APIs directly. Maybe it could be, but for a small subset of countries.

3. Comment: about population data for humanitarian use which is inferences

On the notes about population statistics at least from Mozambique, comments warn that humanitarians are requesting granular information (such as age by gender and disability) which simply is not available for real, so except obviously because of some extra human review, they're already simulations.

By no means I'm saying these data are bad or not worth it. In fact, well done, are cost effective. But by having other metrics which could act as seed, users can derivate on demand other inferences or compare different sources. Not something for the short term (maybe not even mid term) but the level of detail we're doing to taxonomize would allow it.

However, while it might seem strange, a major feature are the population statistics which already use PCodes. Unless we evolve the P-Codes to Wikidata QIDs, we can't automate several features which we can get really fast. We're really aiming to get things very well integrated, not mere data hoarding.

@fititnt
Copy link
Member Author

fititnt commented Jul 30, 2022

The ./999999999/0/999999999_521850.py is the program used as data scrapper for some thematic data. Regardless of the challange pointed here #45 (comment) about eventually need to map 1:1 with P-Codes at subnatinal levels, the ammout of thematic data we're already able to create tabular data is so high, that brings the next issue:

To dos

Thematic themes will need stricter graph mappings

Several places can talk near the same content. HXL could work with plain natural language, but to allow automated documentation (and things like semantic reasoning without centralized server) we really need to make very well documented mappings.

This is one of the whys, the slow down can be less about data scrapping and make the crawlers working, but about... how to taxonomize them. Things which are Wikidata P could even use properties which could make sense if ingested back on Wikidata, but we're likely to deal with things that still need to be encoded, but will need be referentually to individual Q items.

The decision about naming things (e.g. the infixes to use for thematic data)

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Regardless of how we use RDF or other mappings, as we don't rely on public URLs, but URNs will very well defined meaning, we need to think very well how to organize the structural numering. By the taxonomy alone already is possible to infer what data is about, and simplify a lot when the data in ingested in SQL databases (the #37), but we still need to think the entrypoints.

BFO, RDF and how to package the datasets when in tabular format

For now, in addition to think about such numeric infixes, we're likely to think how to final organization would be in terms of Basic Formal Ontology. For sake of RDF and maximize reuse for people who can't have centralized systems, this means think in ways that both basic usage (likely uses just using common tables to have entrypints) and advanced ones (triplestores, where massive number of properties, easily over what could blow up limits of columns of PostgreSQL) could work.

Likely one of the main features would be think in terms of allow one strongly suggested option of single inheritance which could still be reasonable. Other hardcore ontologists could still reshape the data later, but at least one way we could make it work and deal with real data makes sense.

Non goals of to dos

However, note that while this is a point needs documentation, things that would break reasoning even in graph format for final user (mostly notable when by ingesting data, co-existence of contradictory facts, such one fact that some administrative boundary is sovereign country, but this break other parts) we can simply don't try to assume the user will ingest all data at same time. Other point is when if users ingest two or more models to explain the world (for example, ways to express what is biological sex, what is gender identity) and this would break the reasoning because they would try to re-classify facts in different ways.

I think this limitation of not trying to make structural taxonomy to deal with such advanced inconsistencies (and justify that we need to leave users to decide what is not trivial to decide) can actually can simplify the overall structure. And, this is important, likely end users (or other ontologists) will be less likely to complain about the structural taxonomy be less flexible because some limitations would be more directed on things already likely to not make sense if they're already opinionated with the world. This approach would:

  • allow users with very strong views (or if they're receive data from providers more close to the real world)
  • (while not the main goal) likely tolerate the idea that sometimes, providers which do have strong views, might have to make something they don't agree because the target system receive the data can't understand context
    • An region which is not supposed to do foreign relations but would be done something trivial (likely the data simply don't have context to explain itself) would be inconsistent if a provider deploy complex rules of international diplomacy with something as dumb as doing commercial trade (actually mostly likely the case for donations from other countries or which the source or target is governamental)
      • One alternative here could be either make the global mode more flexible (but this would make increasingly complex of exceptions) or... simply disable rule on this case because of massive amounts of false positives.
      • The tempting idea from top level is way too academic and impratical in the real world: imagine asking everyone to add much more metadata to every data point just to make the model simpler. Thats unrealistic. People would immediately refuse to do this sometimes not even because is slower, but the additional data they would need they're could not have access because would require information of other organization (likely top validating entrypoint by entry point).
        • I think is fair for example that at least top organizations which could need to validate to make the data aware of context should be able to allow disable their rule if this would make use in some contexts impractical.

fititnt added a commit that referenced this issue Jul 30, 2022
fititnt added a commit that referenced this issue Jul 30, 2022
…aybe we just use anonymous nodes and abuse turtle [] syntax all the way?
fititnt added a commit that referenced this issue Jul 30, 2022
@fititnt
Copy link
Member Author

fititnt commented Jul 30, 2022

Hum... first early attempt, while is storing the statistical data on RDF, have this issue:


Captura de tela de 2022-07-30 12-22-45


as expected, without any additional step compared to what we're doing, it will store the statistical data (great) but would lost the reference about... the years. Thinks like population (P1082) (means total population), female population (P1539), male population (P1540). urban, rural, households, these we can have shared verbs. However, for things which are very structured, like dates, maybe other variants, we need try to optimize how to add sufficient metadata to know the context.

Either the way https://sdmx.org/ or (more likely what we would do it) the way Wikidata do with Qualifiers https://www.wikidata.org/wiki/Help:Qualifiers could be our goal. However, even Wikidata tend to not have massive amount of data already very well structured such as how we use on tabular formats.

To do

While migth need future review (likely based on make easier to query the data) on sort term we can just make any strategy that at least store the reference to the date of the statistics, so at least we can say that the data on .no1.tm.hxl.csv and .no1.owl.ttl are equivalent.

fititnt added a commit that referenced this issue Aug 3, 2022
fititnt added a commit that referenced this issue Aug 4, 2022
…) allow explain which HXLTM -x- BCP47 attribibutes would become RDF main entrypoints
fititnt added a commit that referenced this issue Aug 4, 2022
fititnt added a commit to EticaAI/MDCIII-boostrapper that referenced this issue Aug 4, 2022
@fititnt
Copy link
Member Author

fititnt commented Aug 9, 2022

Hummm. the drafted 1603_9966_1 (from worldbank) is great, have population by year and other classifications (such gender-or-sex, urban/rural, etc), but for most cases, we also need something far simpler (only population statistics from some recent year.

Eventually more focused (without series by year) should also be available. Maybe we place with other variables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant