Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate SPARQL query generation to Wikidata by items with P #40

Closed
fititnt opened this issue May 12, 2022 · 6 comments
Closed

Automate SPARQL query generation to Wikidata by items with P #40

fititnt opened this issue May 12, 2022 · 6 comments

Comments

@fititnt
Copy link
Member

fititnt commented May 12, 2022

One item from #39, the P1585 https://www.wikidata.org/wiki/Property:P1585 //Dicionários de bases de dados espaciais do Brasil//@por-Latn actually is very well documented on Wikidata, so we would not need to fetch Wikidata Q one by one.

It's a rare case something so perfect, but the idea here would be create an additional option on ./999999999/0/1603_3_12.py to create the SPARQL query for us.

This obviously will need pagination. If with ~300 Wikidata Q we already timeout with over 250 languages on 1603_1_51 (for now using 5 batches), with sometime with 5700 items, well, this will be fun

@fititnt
Copy link
Member Author

fititnt commented May 12, 2022

Okay, this one is a generic query

SELECT DISTINCT ?item ?itemLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
  {
    SELECT DISTINCT ?item WHERE {
      {
        ?item p:P1585 ?statement0.
        ?statement0 (ps:P1585) _:anyValueP1585.
        #FILTER(EXISTS { ?statement0 prov:wasDerivedFrom ?reference. })
      }
    }
  }
}

However, we already query the human languages, (but we can workaroud it). Maybe this feature will be somewhat hardcoded, because if we implement at full potential, it would means also read the 1603_1_7 and "undestand" what each P means.

@fititnt
Copy link
Member Author

fititnt commented May 12, 2022

Great. We managed to use a pre-processor to create the queries (--lingua-divisioni=18 --lingua-paginae=1 and are paginators). I think the brazilian cities may be one of those codex with over 200 languages.

Current working example

fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ printf "P1585\n" | ./999999999/0/1603_3_12.py --actionem-sparql --de=P --query --lingua-divisioni=18 --lingua-paginae=1

SELECT (STRAFTER(STR(?item), "entity/") AS ?item__conceptum__codicem) ?item__rem__i_ara__is_arab ?item__rem__i_hye__is_armn ?item__rem__i_ben__is_beng ?item__rem__i_rus__is_cyrl ?item__rem__i_hin__is_deva ?item__rem__i_amh__is_ethi ?item__rem__i_kat__is_geor ?item__rem__i_grc__is_grek ?item__rem__i_guj__is_gujr ?item__rem__i_pan__is_guru ?item__rem__i_kan__is_knda ?item__rem__i_kor__is_hang ?item__rem__i_lzh__is_hant ?item__rem__i_heb__is_hebr ?item__rem__i_khm__is_khmr WHERE {
  {
    SELECT DISTINCT ?item WHERE {
      ?item p:P1585  ?statement0.
      ?statement0 (ps:P1585 ) _:anyValueP1585 .
    }
  }
  OPTIONAL { ?item rdfs:label ?item__rem__i_ara__is_arab filter (lang(?item__rem__i_ara__is_arab) = "ar"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_hye__is_armn filter (lang(?item__rem__i_hye__is_armn) = "hy"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_ben__is_beng filter (lang(?item__rem__i_ben__is_beng) = "bn"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_rus__is_cyrl filter (lang(?item__rem__i_rus__is_cyrl) = "ru"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_hin__is_deva filter (lang(?item__rem__i_hin__is_deva) = "hi"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_amh__is_ethi filter (lang(?item__rem__i_amh__is_ethi) = "am"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_kat__is_geor filter (lang(?item__rem__i_kat__is_geor) = "ka"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_grc__is_grek filter (lang(?item__rem__i_grc__is_grek) = "grc"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_guj__is_gujr filter (lang(?item__rem__i_guj__is_gujr) = "gu"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_pan__is_guru filter (lang(?item__rem__i_pan__is_guru) = "pa"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_kan__is_knda filter (lang(?item__rem__i_kan__is_knda) = "kn"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_kor__is_hang filter (lang(?item__rem__i_kor__is_hang) = "ko"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_lzh__is_hant filter (lang(?item__rem__i_lzh__is_hant) = "lzh"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_heb__is_hebr filter (lang(?item__rem__i_heb__is_hebr) = "he"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_khm__is_khmr filter (lang(?item__rem__i_khm__is_khmr) = "km"). }
  bind(xsd:integer(strafter(str(?item), 'Q')) as ?id_numeric) .
}
ORDER BY ASC (?id_numeric)

Captura de tela de 2022-05-12 00-37-02

Potential annoying issue

Queries with so much itens varies a lot the runtime. Even with pagination, sometimes it go over 40 seconds (but if cached to 5 seconds). So I think we will definitely need to make some rudimentary way to on bash functions to check if timeouted and ajust the times for try again.

Still to do

However, we maybe will need to create more than one query, because this strategy (to merge with datasets) would require we already know upfront what Wikidata Q is linked to IBGE code.

fititnt added a commit that referenced this issue May 12, 2022
fititnt added a commit that referenced this issue May 12, 2022
fititnt added a commit that referenced this issue May 12, 2022
@fititnt
Copy link
Member Author

fititnt commented May 12, 2022

Current working example

Query building of only interlingual codes (can be used as key to merge translations)

fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ printf "P1585\n" | ./999999999/0/1603_3_12.py --actionem-sparql --de=P --query --ex-interlinguis

SELECT (?wikidata_p_value AS ?item__conceptum__codicem) (STRAFTER(STR(?item), "entity/") AS ?item__rem__i_qcc__is_zxxx__ix_wikiq) WHERE {
  {
    SELECT DISTINCT ?item WHERE {
      ?item p:P1585  ?statement0.
      ?statement0 (ps:P1585 ) _:anyValueP1585 .
    }
  }
  ?item wdt:P1585  ?wikidata_p_value . 
}
ORDER BY ASC (?wikidata_p_value)

Image of generated csv (proof of concept; 3 merged parts of 20; need automate the rest and fix order of columns)

Captura de tela de 2022-05-12 12-59-38

temp directory

We did not even merged the 17 pages from 20 of languages and the file size already is 1,6MB. No idea how big this will be with all languages.

Captura de tela de 2022-05-12 13-03-43

fititnt added a commit that referenced this issue May 12, 2022
…s --cum-interlinguis=P1,P2... (MVP of generation of query adding more attributes)
fititnt added a commit that referenced this issue May 12, 2022
…s --cum-interlinguis=P1,P2... (bugfix; duplicated related atributes now concatenate with |)
fititnt added a commit that referenced this issue May 12, 2022
…ata_p_ex_linguis, (draft) wikidata_p_ex_totalibus
@fititnt
Copy link
Member Author

fititnt commented May 12, 2022

Performance issues

Humm... now the issue is do heavy optimization on the queries to mitigate the timeouts. The ones with over 5000 concepts and even splitting 20 parts the 1603_1_51 languages are the issue here.

Maybe one strategy would be allow removing the ORDER BY on the queries which deal with languages and do it via client side (aka sort with bash / hxl cli tools).

fititnt added a commit that referenced this issue May 13, 2022
fititnt added a commit that referenced this issue May 13, 2022
… ordering on linguistic query to mitigate timeouts
fititnt added a commit that referenced this issue May 14, 2022
fititnt added a commit that referenced this issue May 15, 2022
@fititnt
Copy link
Member Author

fititnt commented May 15, 2022

bash wikidata_p_ex_totalibus()

The bash helper, while still need more testing, somewhat already deal with retrying again. For something such as P1585 it using now 1 + 20 queries.

However, later we obviously should get data from primary sources (in case of IBGE, I think https://servicodados.ibge.gov.br/api/docs/localidades do it) and use as primary reference, potentially validating information from Wikidata

Generalizations

Turns out that we can start to bootstrapping other tables (the ones already perfect on Wikidata) the same way done with IBGE municipalities, including translations on several languages!

However, the same ideal approaches (such as rely on primary sources, then increment with Wikidata) would somewhat apply too. Sometimes this may not be really relevant. For example, something not stricly a place (like the P6555 //identificador de Unidade Eleitoral brasileira//@por-Latn) would not really make sense for the end user just print the municipalities.

Also, eventually we will need to think like somewhat as an Ontology, otherwise the #41 would be as efficient for general users.

Print screen

Captura de tela de 2022-05-15 15-46-41

Captura de tela de 2022-05-15 15-51-32

@fititnt
Copy link
Member Author

fititnt commented Jul 22, 2022

Already implemented and used in practice. Closing for now.

@fititnt fititnt closed this as completed Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant