MVP of RDF/Turtle canonization/file formatting for generated dictionaries #46

fititnt · 2022-07-22T01:18:43Z

As we're moving to prepare more data to be shared, the nature of RDF triples may be easier to compare than when same data is on 250+ column CSVs (and one change update the entire row), but several tools can have variations on the way white space, line breaks and etc are handled, so we need to think about this to reduce noise.

The https://json-ld.github.io/rdf-dataset-canonicalization/spec/ and sveral of their mentioned works or papers which discuss this in deep. Some of then even try ideas as far as make digital signatures to assert one or a group of RDF tripples would really come from a source, but this is not as scope now, not only because lack of tooling, but we really, need to fix first the file diffs.

The MVP

The idea here (before we start generating very, very large RDF files which naturally will evolve over time) is create some tool or documentation on how to make some conventions about the turtle outputs in such way that every generated files uses it.

Eventually this could be improved, but for now if we do not do this, the repositories which receive updates would increase in size for something which could have simple solution sooner.

…ne-finibus#46)

fititnt · 2022-07-22T04:13:44Z

Changes made here at EticaAI/lexicographi-sine-finibus (which is used by EticaAI/MDCIII-boostrapper, the automated agent).

And, oh my. This commit triggering an massive update at all repos at @MDCIII!

Around this time, starts the batch of 1/10 each hour. Maybe eventually could make sense we self-test the tools for the way the files are formated (which would pass the tests for validation, but not for changes on GitHub.

About the changes

I still looking on how rdflib proposed "longturtle" actually is doing its formating (the main reference is this email thread https://groups.google.com/g/rdflib-dev/c/EUW2fawv4mw). But at the moment, the changes seems reasonable. It is slower than rapper (sudo apt install raptor2-utils), but beyond being written in python (not compiled language) the rdfpipe --input-format=turtle --output-format=longturtle input.ttl > output.ttl is doing more changes.

Some points:

The 4-spaces tabulation is more obvious at first.
It "closes" the entire rule with an ..
- The way is done makes Turtle seems more friendly to how programming languages work than most documentation of triples with RDF (not a issue, just a comment); I was not aware this approach was possible.
It enforce subject ordering (aka our URNs).
- I still looking how it works with the way we differentiate universals (classes, collections, references to physical links) which we use () to change what would be the individuals. But for now since we divide using ':' this seems to make the universals perfectly appear before their individuals.
  - However, this still not the way for example Protege would do it (which would require somewhat more manual/arbitrary decisions) but the way this is working seems to be predicable.
Major point: the items which have language (e.g. skos:prefLabel "my term"@eng-Latn) when have more than 1 item, will put in next line, and they're sorted by language attribute! (e.g. eng-Latn vs arb-Arab)
- I find this default really nice. We really use a lot of linguistic content (some go over 250) and this make then preditable. Tried another alternative which sorted by the value of the language, so if string "a term"@zzz-Latn vs "the term"@yyy-Latn in other implementations would make too much noise. But the default of rdfpipe --output-format=longturtle on rdflib 6.1.1 deal nicely with this.

fititnt added a commit that referenced this issue Jul 22, 2022

rdf_ttl_canonization.py (#46): started

ba82d1a

fititnt added a commit to EticaAI/MDCIII-boostrapper that referenced this issue Jul 22, 2022

RDF/Turtle (more) canonical file format (refs EticaAI/lexicographi-si…

66b4c14

…ne-finibus#46)

fititnt added a commit that referenced this issue Jul 22, 2022

rdf_ttl_canonization.py (#46): rdflib, using rdfpipe cli for formating

09657d0

fititnt added a commit that referenced this issue Jul 22, 2022

(#46): prefer PREFIX wdata: over p: for datasets

b1a0a6a

VladimirAlexiev mentioned this issue Aug 11, 2024

add more sorting options in longturtle serializer RDFLib/rdflib#2880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVP of RDF/Turtle canonization/file formatting for generated dictionaries #46

MVP of RDF/Turtle canonization/file formatting for generated dictionaries #46

fititnt commented Jul 22, 2022

fititnt commented Jul 22, 2022

MVP of RDF/Turtle canonization/file formatting for generated dictionaries #46

MVP of RDF/Turtle canonization/file formatting for generated dictionaries #46

Comments

fititnt commented Jul 22, 2022

The MVP

fititnt commented Jul 22, 2022

About the changes