Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) #41

Open
fititnt opened this issue May 14, 2022 · 5 comments
Labels
librarium-formato librārium fōrmātō; /library format/@eng-Latn; Related to storage of entire referential data

Comments

@fititnt
Copy link
Member

fititnt commented May 14, 2022


The way we organize the dictionaries entry point for some time already is very regular, and after the #38, there's no reason to we start to do practical tests.

To Do's of this minimal viable product

Export a format easy to parse

While #38 could be used to import on some graph database, it's not as optimized for speed. So, it's better we export at least one format as easier and compact to parse than alternatives intended to be edited by hand.

Do actual test on one or more graph database

While on #37 the SQLite is quite useful for quick debug, we would need at least one or two tests actually importing to some graph database.

We also need to somewhat take in account ways to potentially allow validation/integrity tests of the entire library as soon as it is on a graph database. It would be easier do this way.

@fititnt fititnt added the librarium-formato librārium fōrmātō; /library format/@eng-Latn; Related to storage of entire referential data label May 14, 2022
fititnt added a commit that referenced this issue May 14, 2022
fititnt added a commit that referenced this issue May 14, 2022
fititnt added a commit that referenced this issue May 14, 2022
@fititnt fititnt changed the title New data warehouse strategy: SPARQL graph database populated with dictionaries data (experimental feature) New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) Jun 4, 2022
@fititnt
Copy link
Member Author

fititnt commented Jun 4, 2022

We do already have proof of concept of converting specially crafted BCP47 language tags (similar to HXL we've been using, to a point there mapping between both) so this eventually will be equally or actually much, much more usable than the tabular alternative of #37.

@todo assume BCP47 -r- extension actually mimics RDF-Star

At this moment, every -g- part is a pair. This is quite easy to split by "-" the parts. However, sometimes we need push even more information to know what to do. And this is starting to become the common case, not the exception.

I think the best would be assume the BCP47 tabular heading versions should assume by default 3 instead of 2, and then if the last part is not necessary, we can use 0 to represent "self".

Current example (without mimic RDF-star)

This example does not implement all the correct semantics. Also the "bags" is not really different at all (because the idea of have different groupings is actually relevant only when tables contain different data with different levels, like the COD-ABs with have multiple concepts, but the endpoints would break then in different tables).

Also, these converters can get pretty, pretty complicated. In fact, to avoid create ourselves full-inference on python, for every bag on a tabular format, we export different triples, and then join then with something like Apache Jena (riot).

unesco-thesaurus.bcp47g.tsv

qcc-Zxxx-r-sU2200-s1	qcc-Zxxx-r-sU2203-s2-yCSVWseparator-u007c-yPREFIX-unescothes	qcc-Zxxx-r-pSKOS-broader-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes	qcc-Zxxx-r-pSKOS-narrower-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes	qcc-Zxxx-r-pSKOS-related-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes	rus-Cyrl-r-pSKOS-prefLabel-sS-s1	arb-Arab-r-pSKOS-prefLabel-sS-s1	spa-Latn-r-pSKOS-prefLabel-sS-s1	qcc-Zxxx-r-pDCT-modified-txsd-datetime-sS-s1
1603:999:9	concept9			concept10	Политика в области образования	سياسة تربوية	Política educacional	2019-12-15T22:36:40Z
1603:999:10	concept10		concept4938|concept7597	concept9	Право на образование	حق في التعليم	Derecho a la educación	2019-12-15T13:26:49Z
1603:999:4938	concept4938	concept10		concept10	Возможности получения образования	فرص تربوية	Oportunidades educacionales	2019-12-15T22:36:42Z

unesco-thesaurus.rdf.ttl

not a good example, not because of the tools, but the input data

@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix wdata: <http://www.wikidata.org/wiki/Special:EntityData/> .
@prefix obo: <http://purl.obolibrary.org/obo/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix unescothes: <http://vocabularies.unesco.org/thesaurus/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

unescothes:concept10  skos:related  unescothes:concept9 ;
        skos:narrower  unescothes:concept7597 ;
        skos:narrower  unescothes:concept4938 ;
        rdf:type       rdfs:Class .

unescothes:concept9  skos:related  unescothes:concept10 ;
        rdf:type      rdfs:Class .

unescothes:concept4938
        skos:related  unescothes:concept10 ;
        skos:broader  unescothes:concept10 ;
        rdf:type      rdfs:Class .

<urn:1603:999:9>  skos:prefLabel  "سياسة تربوية"@arb-Arab ;
        skos:prefLabel  "Политика в области образования"@rus-Cyrl ;
        skos:prefLabel  "Política educacional"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T22:36:40Z" .

<urn:1603:999:10>  skos:prefLabel  "حق في التعليم"@arb-Arab ;
        skos:prefLabel  "Право на образование"@rus-Cyrl ;
        skos:prefLabel  "Derecho a la educación"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T13:26:49Z" .

<urn:1603:999:4938>  skos:prefLabel  "فرص تربوية"@arb-Arab ;
        skos:prefLabel  "Возможности получения образования"@rus-Cyrl ;
        skos:prefLabel  "Oportunidades educacionales"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T22:36:42Z" .

fititnt added a commit that referenced this issue Jun 5, 2022
fititnt added a commit that referenced this issue Jun 6, 2022
…espaces, common prefixes which need rewritting (use case: revert from HXL attributes to CamelCase RDF namespaces) started
fititnt added a commit that referenced this issue Jun 6, 2022
…ee to allow multiple information about subject of predicates with ||
fititnt added a commit that referenced this issue Jun 6, 2022
…s pass; AST not converted yet to || division
fititnt added a commit that referenced this issue Jun 6, 2022
…again; still need to refactor for grouped datasets
fititnt added a commit that referenced this issue Jun 6, 2022
fititnt added a commit that referenced this issue Jun 7, 2022
…nly); thanks Simone Torres de Souza! (refs https://mba.eci.ufmg.br/legal/bfo-pt/); ; Need translations review (source updated): BFO_0000142, BFO_0000147, BFO_0000146; needs new translation BFO_0000202, BFO_0000203; definitions not added; properties still need translation
fititnt added a commit that referenced this issue Jun 7, 2022
…0% sure yet; BFO_0000082, BFO_0000171 <-> BFO_0000124
fititnt added a commit that referenced this issue Jun 12, 2022
…ries, local only (time: 30m28,338s); before RDF relations
fititnt added a commit that referenced this issue Jun 12, 2022
fititnt added a commit that referenced this issue Jun 12, 2022
…coded list will make it work for common cases at sort term
fititnt added a commit that referenced this issue Jun 12, 2022
fititnt added a commit that referenced this issue Jun 12, 2022
…9999_54872.py --objectivum-formato=_temp_hxl_meta_in_json
fititnt added a commit that referenced this issue Jun 13, 2022
fititnt added a commit that referenced this issue Jun 13, 2022
fititnt added a commit that referenced this issue Jun 14, 2022
fititnt added a commit that referenced this issue Jun 19, 2022
@fititnt
Copy link
Member Author

fititnt commented Jun 30, 2022

Okay. We managed to automate how users can load the tabular data on CSVs on SQL databases (description here #37 (comment)).

We already export RDF Triples on a more linguistic focused version and others optimized for computational use. However, for cases where a user is planning to work with a massive amount of data, graphical interfaces such as Protege may not scale. This obviously needs more testing.

Why might be relevant SQL storage here

Turns out that is feasible use R2RML (https://www.w3.org/TR/r2rml/) or similar compatible tools such as ontop (https://ontop-vkg.org/) to map to

  • Use SQL data storage as if they were the triples we're generating
  • Create Ontology Based Data Access (e.g. user can access data inside Protege, despite being on other place)
  • Create API endpoints

To allow such a feature, what would be necessary would be generating files such as R2RML and making sure users have data that perfectly matches the configuration. The beauty here is we can keep user documentation at bare minimum while allowing state of the art uses and (this is important) interfaces such as what is on Protege can be on user language.

Most of the underlying details on how to eventually reach this point are very, very boring to explain even for advanced users which wouldn't be familiar with ontology engineering, which bring us to...

Basic Formal Ontology path as foundational ontology

In any case, for future readers: we're going Basic Formal Ontology path as upper ontology, which is the most already used and by far the most referenced on sciences. But at same time the decision making is not trivial, it doesn't tolerate abstractions/vagueness such as "concept" or "agent", however the end result is culturally less likely to have divergences at all The BFO is very, very realistic.

but what about references to Wikidata (maybe others not on OBO Foundry)

This might change later, but at least for non structural content (for example, we use Wikidata P297 https://www.wikidata.org/wiki/Special:EntityData/P297.ttl , https://www.wikidata.org/wiki/Property:P297 to mentions that something is what HXL calls +v_iso2) we may use other namespaces.

However, the way default choices to organize "the skeleton" would still be BFO, which means data integration would be less painful as we really avoid patterns which would break reasoning. Again, the details about this would be boring, but in practice this means every introductory course on "how to use Protege" to categorize things (such as instances of other classes) is what we cannot do in ontologies designed to be used in production by groups which would disagree with others.

fititnt added a commit that referenced this issue Aug 1, 2022
fititnt added a commit that referenced this issue Aug 1, 2022
@fititnt
Copy link
Member Author

fititnt commented Aug 2, 2022

Captura de tela de 2022-08-02 03-43-25

@fititnt
Copy link
Member Author

fititnt commented Aug 4, 2022

Naming things is so hard.

Anyway, most of these groups will have some temporary number, starting with 99. This already allow to draft other tests. The "1603_9966" (@1603_{SPOP}()) becomes a prefix for what Basic Formal ontology "object aggregate" (http://purl.obolibrary.org/obo/BFO_0000027).

While the idea is be minimalist, at least things that are fully (such as population statistics without further divisions, but allow for year and place) need to have entire base @1603_{POP}().

Draft of organization

  • 1603_16(), http://purl.obolibrary.org/obo/BFO_0000029
    • This would contains every place in earth by some logical numerical code. No plan yet for other schemas (such as the ones frm Facebook, or inferences using URIs such as geo:lat,long?u=123
  • @1603_{POP}(), http://purl.obolibrary.org/obo/BFO_0000027
    • Population, with direct link to exact 1603_16()
    • Number of base categories for division very low. Maybe only total population. This decision make it very clear that @1603_{POP}() could be automated to generate others as need
  • @1603_{SPOP}(), http://purl.obolibrary.org/obo/BFO_0000027
    • Here would be pretty much everything that still population of some linked 1603_16(), but is small subset of @1603_{POP}()
      • Likely even sex/gender and other statistics that are still possible to take from census would still be here, not on @1603_{POP}()

Why this organization

At the moment, thinking in terms of ontology (as per BFO). The lower number of categories also make less prone people would put data in other places.

Note that databases could have some sort of suffixes (likely to allow cope with over 1000's columns), but the way it would works, would force user to point related things together

Question: but so HOW to categorize further?

By properties.

For example, @1603_{SPOP}() as now is explained by properties and qualifiers. It's similar, but not equal, to https://www.wikidata.org/wiki/Help:Qualifiers.

Impact on medium term on tooling / automation

There's other groups of information to organize (and @1603_{QLTY}() are big one), but if things that most people would have no reason to move away AND, because of nature of how BFO works, the end result could also allow for automated testing or inferences even for data which is not strictly well documented, but because is organized in some way (not even need be fully explained) this allow a lot of time saving.

Is likely to be more easier to explain this by going further and make it also with organizations (which also would need attributes to explain what they are) after the qualities like the @1603_{QLTY}()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
librarium-formato librārium fōrmātō; /library format/@eng-Latn; Related to storage of entire referential data
Projects
None yet
Development

No branches or pull requests

1 participant