Automate SPARQL query generation to Wikidata by items with P #40

fititnt · 2022-05-12T01:36:32Z

One item from #39, the P1585 https://www.wikidata.org/wiki/Property:P1585 //Dicionários de bases de dados espaciais do Brasil//@por-Latn actually is very well documented on Wikidata, so we would not need to fetch Wikidata Q one by one.

It's a rare case something so perfect, but the idea here would be create an additional option on ./999999999/0/1603_3_12.py to create the SPARQL query for us.

This obviously will need pagination. If with ~300 Wikidata Q we already timeout with over 250 languages on 1603_1_51 (for now using 5 batches), with sometime with 5700 items, well, this will be fun

The text was updated successfully, but these errors were encountered:

fititnt · 2022-05-12T02:14:07Z

Okay, this one is a generic query

SELECT DISTINCT ?item ?itemLabel WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
  {
    SELECT DISTINCT ?item WHERE {
      {
        ?item p:P1585 ?statement0.
        ?statement0 (ps:P1585) _:anyValueP1585.
        #FILTER(EXISTS { ?statement0 prov:wasDerivedFrom ?reference. })
      }
    }
  }
}

However, we already query the human languages, (but we can workaroud it). Maybe this feature will be somewhat hardcoded, because if we implement at full potential, it would means also read the 1603_1_7 and "undestand" what each P means.

fititnt · 2022-05-12T03:43:33Z

Great. We managed to use a pre-processor to create the queries (--lingua-divisioni=18 --lingua-paginae=1 and are paginators). I think the brazilian cities may be one of those codex with over 200 languages.

Current working example

fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ printf "P1585\n" | ./999999999/0/1603_3_12.py --actionem-sparql --de=P --query --lingua-divisioni=18 --lingua-paginae=1

SELECT (STRAFTER(STR(?item), "entity/") AS ?item__conceptum__codicem) ?item__rem__i_ara__is_arab ?item__rem__i_hye__is_armn ?item__rem__i_ben__is_beng ?item__rem__i_rus__is_cyrl ?item__rem__i_hin__is_deva ?item__rem__i_amh__is_ethi ?item__rem__i_kat__is_geor ?item__rem__i_grc__is_grek ?item__rem__i_guj__is_gujr ?item__rem__i_pan__is_guru ?item__rem__i_kan__is_knda ?item__rem__i_kor__is_hang ?item__rem__i_lzh__is_hant ?item__rem__i_heb__is_hebr ?item__rem__i_khm__is_khmr WHERE {
  {
    SELECT DISTINCT ?item WHERE {
      ?item p:P1585  ?statement0.
      ?statement0 (ps:P1585 ) _:anyValueP1585 .
    }
  }
  OPTIONAL { ?item rdfs:label ?item__rem__i_ara__is_arab filter (lang(?item__rem__i_ara__is_arab) = "ar"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_hye__is_armn filter (lang(?item__rem__i_hye__is_armn) = "hy"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_ben__is_beng filter (lang(?item__rem__i_ben__is_beng) = "bn"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_rus__is_cyrl filter (lang(?item__rem__i_rus__is_cyrl) = "ru"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_hin__is_deva filter (lang(?item__rem__i_hin__is_deva) = "hi"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_amh__is_ethi filter (lang(?item__rem__i_amh__is_ethi) = "am"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_kat__is_geor filter (lang(?item__rem__i_kat__is_geor) = "ka"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_grc__is_grek filter (lang(?item__rem__i_grc__is_grek) = "grc"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_guj__is_gujr filter (lang(?item__rem__i_guj__is_gujr) = "gu"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_pan__is_guru filter (lang(?item__rem__i_pan__is_guru) = "pa"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_kan__is_knda filter (lang(?item__rem__i_kan__is_knda) = "kn"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_kor__is_hang filter (lang(?item__rem__i_kor__is_hang) = "ko"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_lzh__is_hant filter (lang(?item__rem__i_lzh__is_hant) = "lzh"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_heb__is_hebr filter (lang(?item__rem__i_heb__is_hebr) = "he"). }
  OPTIONAL { ?item rdfs:label ?item__rem__i_khm__is_khmr filter (lang(?item__rem__i_khm__is_khmr) = "km"). }
  bind(xsd:integer(strafter(str(?item), 'Q')) as ?id_numeric) .
}
ORDER BY ASC (?id_numeric)

Potential annoying issue

Queries with so much itens varies a lot the runtime. Even with pagination, sometimes it go over 40 seconds (but if cached to 5 seconds). So I think we will definitely need to make some rudimentary way to on bash functions to check if timeouted and ajust the times for try again.

Still to do

However, we maybe will need to create more than one query, because this strategy (to merge with datasets) would require we already know upfront what Wikidata Q is linked to IBGE code.

…y already works)

…s (draft)

…ot function yet)

…erge test already works)

fititnt · 2022-05-12T16:04:17Z

Current working example

Query building of only interlingual codes (can be used as key to merge translations)

fititnt@bravo:/workspace/git/EticaAI/multilingual-lexicography-automation/officinam$ printf "P1585\n" | ./999999999/0/1603_3_12.py --actionem-sparql --de=P --query --ex-interlinguis

SELECT (?wikidata_p_value AS ?item__conceptum__codicem) (STRAFTER(STR(?item), "entity/") AS ?item__rem__i_qcc__is_zxxx__ix_wikiq) WHERE {
  {
    SELECT DISTINCT ?item WHERE {
      ?item p:P1585  ?statement0.
      ?statement0 (ps:P1585 ) _:anyValueP1585 .
    }
  }
  ?item wdt:P1585  ?wikidata_p_value . 
}
ORDER BY ASC (?wikidata_p_value)

Image of generated csv (proof of concept; 3 merged parts of 20; need automate the rest and fix order of columns)

temp directory

We did not even merged the 17 pages from 20 of languages and the file size already is 1,6MB. No idea how big this will be with all languages.

…s --cum-interlinguis=P1,P2... (MVP of generation of query adding more attributes)

…s --cum-interlinguis=P1,P2... (bugfix; duplicated related atributes now concatenate with |)

…ata_p_ex_linguis, (draft) wikidata_p_ex_totalibus

fititnt · 2022-05-12T20:39:55Z

Performance issues

Humm... now the issue is do heavy optimization on the queries to mitigate the timeouts. The ones with over 5000 concepts and even splitting 20 parts the 1603_1_51 languages are the issue here.

Maybe one strategy would be allow removing the ORDER BY on the queries which deal with languages and do it via client side (aka sort with bash / hxl cli tools).

…GET}], --optimum

… ordering on linguistic query to mitigate timeouts

…orks

…result to mitigate timeouts)

…fails

fititnt · 2022-05-15T18:56:42Z

bash wikidata_p_ex_totalibus()

The bash helper, while still need more testing, somewhat already deal with retrying again. For something such as P1585 it using now 1 + 20 queries.

However, later we obviously should get data from primary sources (in case of IBGE, I think https://servicodados.ibge.gov.br/api/docs/localidades do it) and use as primary reference, potentially validating information from Wikidata

Generalizations

Turns out that we can start to bootstrapping other tables (the ones already perfect on Wikidata) the same way done with IBGE municipalities, including translations on several languages!

However, the same ideal approaches (such as rely on primary sources, then increment with Wikidata) would somewhat apply too. Sometimes this may not be really relevant. For example, something not stricly a place (like the P6555 //identificador de Unidade Eleitoral brasileira//@por-Latn) would not really make sense for the end user just print the municipalities.

Also, eventually we will need to think like somewhat as an Ontology, otherwise the #41 would be as efficient for general users.

Print screen

…ickly boostrap tables if reference P already not fully numeric (like brazilian URN Lex and CNPJ)

fititnt · 2022-07-22T08:52:28Z

Already implemented and used in practice. Closing for now.

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): --actionem-sparql renamed to --actionem-sparql-q

aaca5b8

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): --actionem-sparql --de=P --query (draft)

4d8a056

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): --actionem-sparql --de=P --query (linguistic quer…

6f7bafb

…y already works)

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): --actionem-sparql --de=P --query --ex-interlingui…

3e87ff6

…s (draft)

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): 999999_1679.sh bash script glue 3 out 20 pages (n…

3ad10eb

…ot function yet)

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): 999999_1679.sh bash script glue 3 out 20 pages (m…

9796d64

…erge test already works)

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): --actionem-sparql --de=P --query --ex-interlingui…

8a84ec6

…s --cum-interlinguis=P1,P2... (MVP of generation of query adding more attributes)

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): --actionem-sparql --de=P --query --ex-interlingui…

36dff26

…s --cum-interlinguis=P1,P2... (bugfix; duplicated related atributes now concatenate with |)

fititnt added a commit that referenced this issue May 12, 2022

1603_3_12.py (#40): bash functions: wikidata_p_ex_interlinguis, wikid…

293501a

…ata_p_ex_linguis, (draft) wikidata_p_ex_totalibus

fititnt added a commit that referenced this issue May 13, 2022

1603_3_12.py (#40): before starting optimizations (refs #40 (comment))

6eabf0d

fititnt added a commit that referenced this issue May 13, 2022

1603_3_12.py (#40): draft of --cum-somno=N, --ex-http-methodo=[{POST,…

f66cd03

…GET}], --optimum

fititnt added a commit that referenced this issue May 13, 2022

1603_3_12.py (#40): draft of (bash) wikidata_p_ex_totalibus; disabled…

d132a17

… ordering on linguistic query to mitigate timeouts

fititnt added a commit that referenced this issue May 14, 2022

1603_3_12.py (#40) wikidata_p_ex_totalibus (bash function) somewhat w…

a3f7ad6

…orks

fititnt added a commit that referenced this issue May 14, 2022

1603_3_12.py (#40) --optimum implemted (client-side sort of wikidata …

91a7d66

…result to mitigate timeouts)

fititnt added a commit that referenced this issue May 15, 2022

(#40): bash wikidata_p_ex_linguis() try again; still not coping with …

6deee6c

…fails

fititnt added a commit that referenced this issue May 15, 2022

(#40): improved error retries

0e8f578

fititnt added a commit that referenced this issue May 15, 2022

(#40): now also testing with P4251, P6204 (bug), P6555, P9119 (bug)

70a4302

fititnt added a commit that referenced this issue May 15, 2022

1603_3_12.py (#40): 1603_3_12.py --identitas-ex-wikiq; now able to qu…

9dedd43

…ickly boostrap tables if reference P already not fully numeric (like brazilian URN Lex and CNPJ)

fititnt added a commit that referenced this issue May 16, 2022

(#40): complete MVP drill works for 5 entrypoints

1387004

This was referenced May 16, 2022

[praeparātiō ex automatīs] MVP of idea of automation to pre-process external data to LSF internal file format #42

Open

MVP of [1603.45.16] /"Ontologia"."United Nations"."P"/@eng-Latn #2

Open

fititnt closed this as completed Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate SPARQL query generation to Wikidata by items with P #40

Automate SPARQL query generation to Wikidata by items with P #40

fititnt commented May 12, 2022

fititnt commented May 12, 2022

fititnt commented May 12, 2022

fititnt commented May 12, 2022

fititnt commented May 12, 2022

fititnt commented May 15, 2022

fititnt commented Jul 22, 2022

Automate SPARQL query generation to Wikidata by items with P #40

Automate SPARQL query generation to Wikidata by items with P #40

Comments

fititnt commented May 12, 2022

fititnt commented May 12, 2022

fititnt commented May 12, 2022

Current working example

Potential annoying issue

Still to do

fititnt commented May 12, 2022

Current working example

Query building of only interlingual codes (can be used as key to merge translations)

Image of generated csv (proof of concept; 3 merged parts of 20; need automate the rest and fix order of columns)

temp directory

fititnt commented May 12, 2022

Performance issues

fititnt commented May 15, 2022

bash wikidata_p_ex_totalibus()

Generalizations

Print screen

fititnt commented Jul 22, 2022