Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP of read access to Wikidata #3

Closed
fititnt opened this issue Jan 10, 2022 · 2 comments
Closed

MVP of read access to Wikidata #3

fititnt opened this issue Jan 10, 2022 · 2 comments
Labels
reconciliatio-erga-verba reconciliātiō ergā verba; Lit. /reconciliation with respect to words/@eng-Latn; Term reconciliation

Comments

@fititnt
Copy link
Member

fititnt commented Jan 10, 2022

This issue is about Minimal Viable Product with read-only access to Wikidata. One of main advantages is it's content already be on public domain, so this would allow generating external datasets some vocabularies even original copyright holders still need a long process of formal allowing any type of re-publishable license.


Trivia: Wikidata actually allows extraction of label translations from Wikipedia's related terms and it's explicitly public domain. This means any potential care will have very consistent mappings between our codes and Wikidata Q codes very relevant.

fititnt added a commit that referenced this issue Jan 21, 2022
fititnt added a commit that referenced this issue Jan 21, 2022
fititnt added a commit that referenced this issue Jan 21, 2022
fititnt added a commit that referenced this issue Jan 21, 2022
…ed yet) query generation to extract translations
@fititnt
Copy link
Member Author

fititnt commented Jan 22, 2022

Fantastic!

We're using the #9 curated table to know the language mappings. Wikipedia have over 300 languages (some obviously have more content than others) but the way we use data, is not viable not build such intermediate tables.

The current version is using only some hardcoded Q codes (so is not all the ones compiled previously from https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1894917893). But already gives an idea

fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ ./999999999/0/1603_3_12.py --actionem-sparql

SELECT ?item ?item__rem__i_ara__is_arab ?item__rem__i_ben__is_beng ?item__rem__i_grc__is_grek ?item__rem__i_lat__is_latn ?item__rem__i_rus__is_cyrl ?item__rem__i_san__is_zzzz ?item__rem__i_por__is_latn ?item__rem__i_eng__is_latn ?item__rem__i_fra__is_latn ?item__rem__i_nld__is_latn ?item__rem__i_deu__is_latn ?item__rem__i_spa__is_latn ?item__rem__i_ita__is_latn ?item__rem__i_gle__is_latn
WHERE
{
  VALUES ?item { wd:Q1065 wd:Q82151 wd:Q125761 wd:Q7809 wd:Q386120 wd:Q61923 wd:Q7164 }
  OPTIONAL { ?item rdfs:label ?item__rem__i_ara__is_arab filter (lang(?item__rem__i_ara__is_arab) = "ar"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_ben__is_beng filter (lang(?item__rem__i_ben__is_beng) = "bn"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_grc__is_grek filter (lang(?item__rem__i_grc__is_grek) = "grc"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_lat__is_latn filter (lang(?item__rem__i_lat__is_latn) = "la"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_rus__is_cyrl filter (lang(?item__rem__i_rus__is_cyrl) = "ru"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_san__is_zzzz filter (lang(?item__rem__i_san__is_zzzz) = "sa"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_por__is_latn filter (lang(?item__rem__i_por__is_latn) = "pt"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_eng__is_latn filter (lang(?item__rem__i_eng__is_latn) = "en"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_fra__is_latn filter (lang(?item__rem__i_fra__is_latn) = "fr"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_nld__is_latn filter (lang(?item__rem__i_nld__is_latn) = "nl"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_deu__is_latn filter (lang(?item__rem__i_deu__is_latn) = "de"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_spa__is_latn filter (lang(?item__rem__i_spa__is_latn) = "es"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_ita__is_latn filter (lang(?item__rem__i_ita__is_latn) = "it"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_gle__is_latn filter (lang(?item__rem__i_gle__is_latn) = "ga"). }
}

TRY IT ↗

fititnt added a commit that referenced this issue Jan 22, 2022
…enerate raw SPARQL query and generate CSV/TSV directly
fititnt added a commit that referenced this issue Jan 22, 2022
…it more flexible; still need solve cases where multiple columns have Q codes
@fititnt fititnt added the reconciliatio-erga-verba reconciliātiō ergā verba; Lit. /reconciliation with respect to words/@eng-Latn; Term reconciliation label Feb 4, 2022
@fititnt
Copy link
Member Author

fititnt commented Feb 4, 2022

For an Minimal Viable Product, the read access to Wikidata already is working nicely. This issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reconciliatio-erga-verba reconciliātiō ergā verba; Lit. /reconciliation with respect to words/@eng-Latn; Term reconciliation
Projects
None yet
Development

No branches or pull requests

1 participant