New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) #41

fititnt · 2022-05-14T14:59:44Z

Related issues
- New exported format: Simple Knowledge Organization System (SKOS) (Basic implementation or better); RDF on Turtle New exported format: Simple Knowledge Organization System (SKOS) (Basic implementation or better); RDF on Turtle #38
- Organization strategy to deal with numeric namespace of dictionaries which are handled at administrative level Organization strategy to deal with numeric namespace of dictionaries which are handled at administrative level #39
- Another strategy
  - New exported format: SQL database populated with dictionaries data (experimental feature) New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature) #37
Related software and concept
Related software
- List of softwares
- Some open source software
  - https://en.wikipedia.org/wiki/Wikibase (used by https://en.wikipedia.org/wiki/Wikidata)
  - https://en.wikipedia.org/wiki/Blazegraph (used by Wikidata/Wikibase)
  - https://en.wikipedia.org/wiki/Apache_Jena
  - https://en.wikipedia.org/wiki/RDFLib
    - https://rdflib.dev/
    - https://github.com/RDFLib

The way we organize the dictionaries entry point for some time already is very regular, and after the #38, there's no reason to we start to do practical tests.

To Do's of this minimal viable product

Export a format easy to parse

While #38 could be used to import on some graph database, it's not as optimized for speed. So, it's better we export at least one format as easier and compact to parse than alternatives intended to be edited by hand.

Do actual test on one or more graph database

While on #37 the SQLite is quite useful for quick debug, we would need at least one or two tests actually importing to some graph database.

We also need to somewhat take in account ways to potentially allow validation/integrity tests of the entire library as soon as it is on a graph database. It would be easier do this way.

…05_sparql_endpoint_jena_fuseki_install()

…fuseki (alternative to local download)

…ystem()

fititnt · 2022-06-04T21:08:50Z

We do already have proof of concept of converting specially crafted BCP47 language tags (similar to HXL we've been using, to a point there mapping between both) so this eventually will be equally or actually much, much more usable than the tabular alternative of #37.

@todo assume BCP47 -r- extension actually mimics RDF-Star

https://w3c.github.io/rdf-star/

At this moment, every -g- part is a pair. This is quite easy to split by "-" the parts. However, sometimes we need push even more information to know what to do. And this is starting to become the common case, not the exception.

I think the best would be assume the BCP47 tabular heading versions should assume by default 3 instead of 2, and then if the last part is not necessary, we can use 0 to represent "self".

Current example (without mimic RDF-star)

This example does not implement all the correct semantics. Also the "bags" is not really different at all (because the idea of have different groupings is actually relevant only when tables contain different data with different levels, like the COD-ABs with have multiple concepts, but the endpoints would break then in different tables).

Also, these converters can get pretty, pretty complicated. In fact, to avoid create ourselves full-inference on python, for every bag on a tabular format, we export different triples, and then join then with something like Apache Jena (riot).

`unesco-thesaurus.bcp47g.tsv`

qcc-Zxxx-r-sU2200-s1	qcc-Zxxx-r-sU2203-s2-yCSVWseparator-u007c-yPREFIX-unescothes	qcc-Zxxx-r-pSKOS-broader-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes	qcc-Zxxx-r-pSKOS-narrower-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes	qcc-Zxxx-r-pSKOS-related-sS-s2-yCSVWseparator-u007c-yPREFIX-unescothes	rus-Cyrl-r-pSKOS-prefLabel-sS-s1	arb-Arab-r-pSKOS-prefLabel-sS-s1	spa-Latn-r-pSKOS-prefLabel-sS-s1	qcc-Zxxx-r-pDCT-modified-txsd-datetime-sS-s1
1603:999:9	concept9			concept10	Политика в области образования	سياسة تربوية	Política educacional	2019-12-15T22:36:40Z
1603:999:10	concept10		concept4938|concept7597	concept9	Право на образование	حق في التعليم	Derecho a la educación	2019-12-15T13:26:49Z
1603:999:4938	concept4938	concept10		concept10	Возможности получения образования	فرص تربوية	Oportunidades educacionales	2019-12-15T22:36:42Z

`unesco-thesaurus.rdf.ttl`

not a good example, not because of the tools, but the input data

@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix wdata: <http://www.wikidata.org/wiki/Special:EntityData/> .
@prefix obo: <http://purl.obolibrary.org/obo/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix unescothes: <http://vocabularies.unesco.org/thesaurus/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

unescothes:concept10  skos:related  unescothes:concept9 ;
        skos:narrower  unescothes:concept7597 ;
        skos:narrower  unescothes:concept4938 ;
        rdf:type       rdfs:Class .

unescothes:concept9  skos:related  unescothes:concept10 ;
        rdf:type      rdfs:Class .

unescothes:concept4938
        skos:related  unescothes:concept10 ;
        skos:broader  unescothes:concept10 ;
        rdf:type      rdfs:Class .

<urn:1603:999:9>  skos:prefLabel  "سياسة تربوية"@arb-Arab ;
        skos:prefLabel  "Политика в области образования"@rus-Cyrl ;
        skos:prefLabel  "Política educacional"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T22:36:40Z" .

<urn:1603:999:10>  skos:prefLabel  "حق في التعليم"@arb-Arab ;
        skos:prefLabel  "Право на образование"@rus-Cyrl ;
        skos:prefLabel  "Derecho a la educación"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T13:26:49Z" .

<urn:1603:999:4938>  skos:prefLabel  "فرص تربوية"@arb-Arab ;
        skos:prefLabel  "Возможности получения образования"@rus-Cyrl ;
        skos:prefLabel  "Oportunidades educacionales"@spa-Latn ;
        rdf:type        rdfs:Class ;
        dct:modified    "2019-12-15T22:36:42Z" .

…d results

… reference ajusted

…espaces, common prefixes which need rewritting (use case: revert from HXL attributes to CamelCase RDF namespaces) started

…ee to allow multiple information about subject of predicates with ||

…s pass; AST not converted yet to || division

…again; still need to refactor for grouped datasets

…ection features

…//basic-formal-ontology.org/bfo-2020.html)

…nly); thanks Simone Torres de Souza! (refs https://mba.eci.ufmg.br/legal/bfo-pt/); ; Need translations review (source updated): BFO_0000142, BFO_0000147, BFO_0000146; needs new translation BFO_0000202, BFO_0000203; definitions not added; properties still need translation

…0% sure yet; BFO_0000082, BFO_0000171 <-> BFO_0000124

…ies #39)

…ries, local only (time: 30m28,338s); before RDF relations

…A_AD_RDF started

…coded list will make it work for common cases at sort term

…around for duplicated items

…9999_54872.py --objectivum-formato=_temp_hxl_meta_in_json

…45_16__item_rdf() draft

…36,514s real 37m35,515s)

…draft

…1603 as first level

…entation: 1603/1/1603/1603_1_1603.owl)

fititnt · 2022-06-30T08:36:30Z

Okay. We managed to automate how users can load the tabular data on CSVs on SQL databases (description here #37 (comment)).

We already export RDF Triples on a more linguistic focused version and others optimized for computational use. However, for cases where a user is planning to work with a massive amount of data, graphical interfaces such as Protege may not scale. This obviously needs more testing.

Why might be relevant SQL storage here

Turns out that is feasible use R2RML (https://www.w3.org/TR/r2rml/) or similar compatible tools such as ontop (https://ontop-vkg.org/) to map to

Use SQL data storage as if they were the triples we're generating
Create Ontology Based Data Access (e.g. user can access data inside Protege, despite being on other place)
Create API endpoints

To allow such a feature, what would be necessary would be generating files such as R2RML and making sure users have data that perfectly matches the configuration. The beauty here is we can keep user documentation at bare minimum while allowing state of the art uses and (this is important) interfaces such as what is on Protege can be on user language.

Most of the underlying details on how to eventually reach this point are very, very boring to explain even for advanced users which wouldn't be familiar with ontology engineering, which bring us to...

Basic Formal Ontology path as foundational ontology

In any case, for future readers: we're going Basic Formal Ontology path as upper ontology, which is the most already used and by far the most referenced on sciences. But at same time the decision making is not trivial, it doesn't tolerate abstractions/vagueness such as "concept" or "agent", however the end result is culturally less likely to have divergences at all The BFO is very, very realistic.

but what about references to Wikidata (maybe others not on OBO Foundry)

https://obofoundry.org/

This might change later, but at least for non structural content (for example, we use Wikidata P297 https://www.wikidata.org/wiki/Special:EntityData/P297.ttl , https://www.wikidata.org/wiki/Property:P297 to mentions that something is what HXL calls +v_iso2) we may use other namespaces.

However, the way default choices to organize "the skeleton" would still be BFO, which means data integration would be less painful as we really avoid patterns which would break reasoning. Again, the details about this would be boring, but in practice this means every introductory course on "how to use Protege" to categorize things (such as instances of other classes) is what we cannot do in ontologies designed to be used in production by groups which would disagree with others.

…_ix started

… draft

…dividual humans (#44)

…population) bu ABL 1-3 (#43)

…rity control

fititnt · 2022-08-02T06:47:19Z

fititnt · 2022-08-04T16:05:59Z

Naming things is so hard.

Anyway, most of these groups will have some temporary number, starting with 99. This already allow to draft other tests. The "1603_9966" (@1603_{SPOP}()) becomes a prefix for what Basic Formal ontology "object aggregate" (http://purl.obolibrary.org/obo/BFO_0000027).

While the idea is be minimalist, at least things that are fully (such as population statistics without further divisions, but allow for year and place) need to have entire base @1603_{POP}().

Draft of organization

1603_16(), http://purl.obolibrary.org/obo/BFO_0000029
- This would contains every place in earth by some logical numerical code. No plan yet for other schemas (such as the ones frm Facebook, or inferences using URIs such as geo:lat,long?u=123
@1603_{POP}(), http://purl.obolibrary.org/obo/BFO_0000027
- Population, with direct link to exact 1603_16()
- Number of base categories for division very low. Maybe only total population. This decision make it very clear that @1603_{POP}() could be automated to generate others as need
@1603_{SPOP}(), http://purl.obolibrary.org/obo/BFO_0000027
- Here would be pretty much everything that still population of some linked 1603_16(), but is small subset of @1603_{POP}()
  - Likely even sex/gender and other statistics that are still possible to take from census would still be here, not on @1603_{POP}()

Why this organization

At the moment, thinking in terms of ontology (as per BFO). The lower number of categories also make less prone people would put data in other places.

Note that databases could have some sort of suffixes (likely to allow cope with over 1000's columns), but the way it would works, would force user to point related things together

Question: but so HOW to categorize further?

By properties.

For example, @1603_{SPOP}() as now is explained by properties and qualifiers. It's similar, but not equal, to https://www.wikidata.org/wiki/Help:Qualifiers.

Impact on medium term on tooling / automation

There's other groups of information to organize (and @1603_{QLTY}() are big one), but if things that most people would have no reason to move away AND, because of nature of how BFO works, the end result could also allow for automated testing or inferences even for data which is not strictly well documented, but because is organized in some way (not even need be fully explained) this allow a lot of time saving.

Is likely to be more easier to explain this by going further and make it also with organizations (which also would need attributes to explain what they are) after the qualities like the @1603_{QLTY}()

fititnt added the librarium-formato librārium fōrmātō; /library format/@eng-Latn; Related to storage of entire referential data label May 14, 2022

fititnt added a commit that referenced this issue May 14, 2022

1603_3_12_16_5305.sh (#41): started

db0711d

fititnt added a commit that referenced this issue May 14, 2022

1603_3_12_16_5305.sh (#41): p5305_sparql_endpoint_jena_install(), p53…

f63829e

…05_sparql_endpoint_jena_fuseki_install()

fititnt added a commit that referenced this issue May 14, 2022

1603_3_12_16_5305.sh (#41): p5305_sparql_endpoint_jena_fuseki_start()

570f4c4

fititnt added a commit that referenced this issue May 14, 2022

1603_3_12_16_5305.sh (#41): system install helpers for jena and jena-…

09656b9

…fuseki (alternative to local download)

fititnt added a commit that referenced this issue May 15, 2022

1603_3_12_16_5305.sh (#41): p5305_sparql_endpoint_jena_fuseki_start_s…

fb8745c

…ystem()

This was referenced May 15, 2022

Automate SPARQL query generation to Wikidata by items with P #40

Closed

[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format #42

Open

fititnt added a commit that referenced this issue May 18, 2022

999999999_10263485.py (#41): bugfixes on the Datasus CNES automata

48a91fb

fititnt mentioned this issue May 18, 2022

MVP of [1603.45.16] /"Ontologia"."United Nations"."P"/@eng-Latn #2

Open

fititnt changed the title ~~New data warehouse strategy: SPARQL graph database populated with dictionaries data (experimental feature)~~ New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) Jun 4, 2022

fititnt added a commit that referenced this issue Jun 4, 2022

rdf+bcp47+hxl (#41): partial refactoring

e3945af

fititnt added a commit that referenced this issue Jun 4, 2022

rdf+bcp47+hxl (#41): partial refactoring... 2

801b936

fititnt added a commit that referenced this issue Jun 5, 2022

rdf+bcp47+hxl (#41): partial refactoring... 3, HXL tags generation

49dcdac

fititnt added a commit that referenced this issue Jun 5, 2022

rdf+bcp47+hxl (#41): partial refactoring; started roundtrip tests

a646fe6

fititnt added a commit that referenced this issue Jun 5, 2022

rdf+bcp47+hxl (#41): partial refactoring; rountrip drill with expecte…

c503bc9

…d results

fititnt added a commit that referenced this issue Jun 5, 2022

rdf+bcp47+hxl (#41): partial refactoring; bcp47-to-hxl-to-rdf.hxl.tsv…

dc27774

… reference ajusted

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): RDF_NAMESPACES_PREFIX; for list of hardcoded nam…

8baab27

…espaces, common prefixes which need rewritting (use case: revert from HXL attributes to CamelCase RDF namespaces) started

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): OBO Ontology (from HXL hashtags) draft

7323103

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): partial refactoring; changing abstract syntax tr…

928f35b

…ee to allow multiple information about subject of predicates with ||

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): partial refactoring; bcp47-to-hxl-to-rdf.sh test…

edfa4a0

…s pass; AST not converted yet to || division

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): tests for individual bcp47/hxl hashtags working …

8fd5296

…again; still need to refactor for grouped datasets

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): current refactoring done; starting metadata insp…

39bbaa8

…ection features

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): MDCII.owl boostrapper draft

0722863

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): MDCIII.owl; maybe copy directly BFO? Hummm

b69f02b

fititnt added a commit that referenced this issue Jun 6, 2022

rdf+bcp47+hxl (#41): MDCIII.owl; added directly BFO 2020 (refs https:…

cc3fbbd

…//basic-formal-ontology.org/bfo-2020.html)

fititnt added a commit that referenced this issue Jun 7, 2022

rdf+bcp47+hxl (#41): draft BFO relations for COD-AB like data; Not 10…

4a21825

…0% sure yet; BFO_0000082, BFO_0000171 <-> BFO_0000124

fititnt added a commit that referenced this issue Jun 7, 2022

rdf+bcp47+hxl (#41): RDF+HXL early test cases (administrative boundar…

c2d7669

…ies #39)

fititnt added a commit that referenced this issue Jun 12, 2022

rdf+bcp47+hxl (#41), admin-l (#39), pcodes (#2): full drill 140 count…

731b579

…ries, local only (time: 30m28,338s); before RDF relations

fititnt added a commit that referenced this issue Jun 12, 2022

rdf+bcp47+hxl (#41), admin-l (#39), pcodes (#2): HXL_HASH_ET_ATTRIBUT…

a4a3508

…A_AD_RDF started

fititnt added a commit that referenced this issue Jun 12, 2022

rdf+bcp47+hxl (#41), admin-l (#39), pcodes (#2): not perfect but hard…

21a9f6a

…coded list will make it work for common cases at sort term

fititnt added a commit that referenced this issue Jun 12, 2022

rdf+bcp47+hxl (#41), admin-l (#39), pcodes (#2): CodAbTabulae(); work…

4e00998

…around for duplicated items

fititnt added a commit that referenced this issue Jun 12, 2022

rdf+bcp47+hxl (#41), admin-l (#39), pcodes (#2): added draft of 99999…

ff6d2e8

…9999_54872.py --objectivum-formato=_temp_hxl_meta_in_json

fititnt added a commit that referenced this issue Jun 13, 2022

rdf+bcp47+hxl (#41), admin-l (#39), pcodes (#2): bash bootstrap_1603_…

1877474

…45_16__item_rdf() draft

fititnt added a commit that referenced this issue Jun 13, 2022

rdf+bcp47+hxl (#41), admin-l (#39), pcodes (#2): full drill (user 53m…

c0c954c

…36,514s real 37m35,515s)

fititnt added a commit that referenced this issue Jun 14, 2022

rdf+bcp47+hxl (#41): owl_index() helper created

fb815b9

fititnt added a commit that referenced this issue Jun 14, 2022

rdf+bcp47+hxl (#41), skos (#38): --numerordinatio-cum-antecessoribus …

5f0d308

…draft

fititnt added a commit that referenced this issue Jun 14, 2022

rdf+bcp47+hxl (#41), skos (#38): numerordinatio_cum_antecessoribus() …

819a31e

…draft

fititnt added a commit that referenced this issue Jun 14, 2022

rdf+bcp47+hxl (#41), skos (#38): numerordinatio_cum_antecessoribus() …

c907333

…1603 as first level

fititnt added a commit that referenced this issue Jun 14, 2022

rdf+bcp47+hxl (#41), skos (#38): humm... organize things is hard

24d9f1c

fititnt added a commit that referenced this issue Jun 14, 2022

rdf+bcp47+hxl (#41), skos (#38): ok, at least not breaking the reasoner

3c0d91c

fititnt added a commit that referenced this issue Jun 19, 2022

(#41): urn:mdciii:1603:1:1603[] dedicated numerordinatio (disk repres…

4914b0d

…entation: 1603/1/1603/1603_1_1603.owl)

fititnt mentioned this issue Jun 30, 2022

New data warehouse strategy [tabular]: SQL database populated with dictionaries data (experimental feature) #37

Open

fititnt added a commit that referenced this issue Jul 20, 2022

999999999_54872.py (#41) --objectivum-formato=_temp_hxlstandard_vocab…

e486a77

…_ix started

fititnt added a commit that referenced this issue Jul 20, 2022

999999999_54872.py (#41) OntologiaVocabularioHXL class started

4b50ac1

fititnt added a commit that referenced this issue Jul 20, 2022

999999999_54872.py (#41) 999999/1603/1/1603/13/1603_1_1603_13.owl.ttl…

bb1e631

… draft

fititnt mentioned this issue Jul 26, 2022

New data frontend strategy [map]: lightweight data layers for every entry point with location component (focus on non-binary static files) #48

Open

fititnt added a commit that referenced this issue Aug 1, 2022

999999999/0/MDCIII.simulato.owl (#41): started

f7e25ab

fititnt added a commit that referenced this issue Aug 1, 2022

999999999/0/MDCIII.simulato.owl (#41): drafted ABL 0-6 (#45, #39), in…

eba1d78

…dividual humans (#44)

fititnt added a commit that referenced this issue Aug 1, 2022

999999999/0/MDCIII.simulato.owl (#41): drafted collective of humans (…

c041b60

…population) bu ABL 1-3 (#43)

fititnt added a commit that referenced this issue Aug 1, 2022

999999999/0/MDCIII.simulato.owl (#41): Organization by ABL 1-3; Autho…

e5523c0

…rity control

fititnt added a commit that referenced this issue Aug 2, 2022

999999999/0/MDCIII.simulato.owl (#41) 1603_{PROC}_N_0()

44384c2

fititnt added a commit that referenced this issue Aug 2, 2022

999999999/0/MDCIII.simulato.owl (#41) Qualities (ABL 0-2)

c45074f

fititnt added a commit that referenced this issue Aug 2, 2022

999999999/0/MDCIII.simulato.owl (#41) subclasses

79fbde7

fititnt added a commit that referenced this issue Aug 3, 2022

999999999/0/MDCIII.simulato.owl (#41) additional subclasses

929943e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) #41

New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) #41

fititnt commented May 14, 2022

fititnt commented Jun 4, 2022

fititnt commented Jun 30, 2022

fititnt commented Aug 2, 2022

fititnt commented Aug 4, 2022

New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) #41

New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) #41

Comments

fititnt commented May 14, 2022

To Do's of this minimal viable product

Export a format easy to parse

Do actual test on one or more graph database

fititnt commented Jun 4, 2022

@todo assume BCP47 -r- extension actually mimics RDF-Star

Current example (without mimic RDF-star)

unesco-thesaurus.bcp47g.tsv

unesco-thesaurus.rdf.ttl

fititnt commented Jun 30, 2022

Why might be relevant SQL storage here

Basic Formal Ontology path as foundational ontology

but what about references to Wikidata (maybe others not on OBO Foundry)

fititnt commented Aug 2, 2022

fititnt commented Aug 4, 2022

Draft of organization

Why this organization

Question: but so HOW to categorize further?

Impact on medium term on tooling / automation

`unesco-thesaurus.bcp47g.tsv`

`unesco-thesaurus.rdf.ttl`