-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New data warehouse strategy (graph): RDF/SPARQL graph database populated with dictionaries data (experimental feature) #41
Comments
…05_sparql_endpoint_jena_fuseki_install()
…fuseki (alternative to local download)
We do already have proof of concept of converting specially crafted BCP47 language tags (similar to HXL we've been using, to a point there mapping between both) so this eventually will be equally or actually much, much more usable than the tabular alternative of #37. @todo assume BCP47 -r- extension actually mimics RDF-StarAt this moment, every -g- part is a pair. This is quite easy to split by "-" the parts. However, sometimes we need push even more information to know what to do. And this is starting to become the common case, not the exception. I think the best would be assume the BCP47 tabular heading versions should assume by default 3 instead of 2, and then if the last part is not necessary, we can use Current example (without mimic RDF-star)This example does not implement all the correct semantics. Also the "bags" is not really different at all (because the idea of have different groupings is actually relevant only when tables contain different data with different levels, like the COD-ABs with have multiple concepts, but the endpoints would break then in different tables). Also, these converters can get pretty, pretty complicated. In fact, to avoid create ourselves full-inference on python, for every bag on a tabular format, we export different triples, and then join then with something like Apache Jena (riot).
|
…espaces, common prefixes which need rewritting (use case: revert from HXL attributes to CamelCase RDF namespaces) started
…ee to allow multiple information about subject of predicates with ||
…s pass; AST not converted yet to || division
…again; still need to refactor for grouped datasets
…nly); thanks Simone Torres de Souza! (refs https://mba.eci.ufmg.br/legal/bfo-pt/); ; Need translations review (source updated): BFO_0000142, BFO_0000147, BFO_0000146; needs new translation BFO_0000202, BFO_0000203; definitions not added; properties still need translation
…0% sure yet; BFO_0000082, BFO_0000171 <-> BFO_0000124
…ries, local only (time: 30m28,338s); before RDF relations
…coded list will make it work for common cases at sort term
…9999_54872.py --objectivum-formato=_temp_hxl_meta_in_json
…entation: 1603/1/1603/1603_1_1603.owl)
Okay. We managed to automate how users can load the tabular data on CSVs on SQL databases (description here #37 (comment)). We already export RDF Triples on a more linguistic focused version and others optimized for computational use. However, for cases where a user is planning to work with a massive amount of data, graphical interfaces such as Protege may not scale. This obviously needs more testing. Why might be relevant SQL storage hereTurns out that is feasible use R2RML (https://www.w3.org/TR/r2rml/) or similar compatible tools such as ontop (https://ontop-vkg.org/) to map to
To allow such a feature, what would be necessary would be generating files such as R2RML and making sure users have data that perfectly matches the configuration. The beauty here is we can keep user documentation at bare minimum while allowing state of the art uses and (this is important) interfaces such as what is on Protege can be on user language. Most of the underlying details on how to eventually reach this point are very, very boring to explain even for advanced users which wouldn't be familiar with ontology engineering, which bring us to... Basic Formal Ontology path as foundational ontologyIn any case, for future readers: we're going Basic Formal Ontology path as upper ontology, which is the most already used and by far the most referenced on sciences. But at same time the decision making is not trivial, it doesn't tolerate abstractions/vagueness such as "concept" or "agent", however the end result is culturally less likely to have divergences at all The BFO is very, very realistic. but what about references to Wikidata (maybe others not on OBO Foundry)This might change later, but at least for non structural content (for example, we use Wikidata P297 https://www.wikidata.org/wiki/Special:EntityData/P297.ttl , https://www.wikidata.org/wiki/Property:P297 to mentions that something is what HXL calls However, the way default choices to organize "the skeleton" would still be BFO, which means data integration would be less painful as we really avoid patterns which would break reasoning. Again, the details about this would be boring, but in practice this means every introductory course on "how to use Protege" to categorize things (such as instances of other classes) is what we cannot do in ontologies designed to be used in production by groups which would disagree with others. |
Naming things is so hard. Anyway, most of these groups will have some temporary number, starting with 99. This already allow to draft other tests. The "1603_9966" ( While the idea is be minimalist, at least things that are fully (such as population statistics without further divisions, but allow for year and place) need to have entire base Draft of organization
Why this organizationAt the moment, thinking in terms of ontology (as per BFO). The lower number of categories also make less prone people would put data in other places. Note that databases could have some sort of suffixes (likely to allow cope with over 1000's columns), but the way it would works, would force user to point related things together Question: but so HOW to categorize further?By properties. For example, Impact on medium term on tooling / automationThere's other groups of information to organize (and Is likely to be more easier to explain this by going further and make it also with organizations (which also would need attributes to explain what they are) after the qualities like the |
The way we organize the dictionaries entry point for some time already is very regular, and after the #38, there's no reason to we start to do practical tests.
To Do's of this minimal viable product
Export a format easy to parse
While #38 could be used to import on some graph database, it's not as optimized for speed. So, it's better we export at least one format as easier and compact to parse than alternatives intended to be edited by hand.
Do actual test on one or more graph database
While on #37 the SQLite is quite useful for quick debug, we would need at least one or two tests actually importing to some graph database.
We also need to somewhat take in account ways to potentially allow validation/integrity tests of the entire library as soon as it is on a graph database. It would be easier do this way.
The text was updated successfully, but these errors were encountered: