Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP of [1603.45.16] /"Ontologia"."United Nations"."P"/@eng-Latn #2

Open
fititnt opened this issue Jan 4, 2022 · 7 comments
Open

MVP of [1603.45.16] /"Ontologia"."United Nations"."P"/@eng-Latn #2

fititnt opened this issue Jan 4, 2022 · 7 comments
Labels
dictionaria-specificis dictiōnāria specificīs; /specific group of dictionaries/@eng-Latn

Comments

@fititnt
Copy link
Member

fititnt commented Jan 4, 2022

Quick links


This issue is about minimal viable product of encode the entire public available P-Codes on numerordinatio. The scripts may need to get some cron job or manual upgrade over time, but this issue is mostly about at least have first version.

Replacing ISO 3661-1 alpha 2 with UN M49

P-codes are prefixed with 2 letter codes, which have advantage of deal with leading zeros. So, for P-Codes, this make sense leading letters, which also allow use pure P-Codes as programming variables. However the numerordinatio works, we can go fully numeric.

[1603.45.16] vs [1603.45.49]

In theory, [1603.45.16] could be a more specific version of [1603.45.49] (https://unstats.un.org/unsd/methodology/m49/) instead of have own base namespace. This may change later.

Another point is that depending of how numerordinatio would be done, the codes could have aliases.


Changes

  • [1603.45.15] renamed to [1603.45.16] (US-ASCII alphabet with K makes P as 16, not as 15).
@fititnt fititnt changed the title Jan 4, 2022
fititnt referenced this issue in EticaAI/lsf-cache Jan 5, 2022
@fititnt
Copy link
Member Author

fititnt commented Jan 5, 2022

While the https://drive.google.com/file/d/1jRshR0Mywd_w8r6W2njUFWv7oDVLgKQi/view?usp=sharing is not a new version (was from around 8 months ago, so things are likely be more consistent) sheet names already are not as consistent (so is not possible to just pipepile zip output as some files would replace other. However, they is likely they still have patterns.

Since each dataset metadata can (and often would) be upgraded over time, then ingestion of more centralized version would need to be able to normalize more than one format at same time, like "Admin" "Adm" "adm", and the combinations with country prefix.

I will copy here the preview, since the https://github.com/EticaAI/ndata is likely to have history wiped several times to save space.


___.csv

#meta,#meta+archivum,#meta+iso3,#meta+sheets+original,#meta+sheets+new
1603.45.16:,afg.xlsx,afg,afg_adm2 afg_adm1 afg_adm0 ,
1603.45.16:,ago.xlsx,ago,ago_adm3 ago_adm2 ago_adm1 ago_adm0 ,
1603.45.16:,arg.xlsx,arg,arg_adm2 arg_adm1 arg_adm0 ,
1603.45.16:,arm.xlsx,arm,arm_adm2 arm_adm1 arm_adm0 ,
1603.45.16:,aze.xlsx,aze,adm1 adm0 ,
1603.45.16:,bdi.xlsx,bdi,bdi_adm2 bdi_adm1 bdi_adm0 ,
1603.45.16:,ben.xlsx,ben,ben_adm2 ben_adm1 ben_adm0 ,
1603.45.16:,bfa.xlsx,bfa,bfa_adm3 bfa_adm2 bfa_adm1 bfa_adm0 ,
1603.45.16:,bgd.xlsx,bgd,bgd_adm4 bgd_adm3 bgd_adm2 bgd_adm1 bgd_adm0 ,
1603.45.16:,bgr.xlsx,bgr,bgr_adm2 bgr_adm1 bgr_adm0 ,
1603.45.16:,blr.xlsx,blr,blr_adm2 blr_adm1 blr_adm0 ,
1603.45.16:,bmu.xlsx,bmu,bmu_adm2 bmu_adm1 bmu_adm0 ,
1603.45.16:,bol.xlsx,bol,bol_adm3 bol_adm2 bol_adm1 bol_adm0 ,
1603.45.16:,bra.xlsx,bra,adm2 adm1 adm0 ,
1603.45.16:,btn.xlsx,btn,adm2 adm1 adm0 ,
1603.45.16:,caf.xlsx,caf,caf_adm4 caf_adm3 caf_adm2 caf_adm1 caf_adm0 ,
1603.45.16:,chl.xlsx,chl,adm3 adm2 adm1 adm0 ,
1603.45.16:,chn.xlsx,chn,adm2 adm1 adm0 ,
1603.45.16:,civ.xlsx,civ,civ_adm3 civ_adm2 civ_adm1 civ_adm0 ,
1603.45.16:,cmr.xlsx,cmr,cmr_adm3 cmr_adm2 cmr_adm1 cmr_adm0 ,
1603.45.16:,cod.xlsx,cod,cod_adm2 cod_adm1 cod_adm0 ,
1603.45.16:,cog.xlsx,cog,cog_adm2 cog_adm1 cog_adm0 ,
1603.45.16:,col.xlsx,col,col_adm2 col_adm1 col_adm0 ,
1603.45.16:,com.xlsx,com,com_adm3 com_adm2 com_adm1 com_adm0 ,
1603.45.16:,cpv.xlsx,cpv,cpv_adm2 cpv_adm1 cpv_adm0 ,
1603.45.16:,cri.xlsx,cri,adm2 adm1 adm0 ,
1603.45.16:,dji.xlsx,dji,dji_adm2 dji_adm1 dji_adm0 ,
1603.45.16:,dma.xlsx,dma,dma_adm1 dma_adm0 ,
1603.45.16:,dom.xlsx,dom,dom_adm4 dom_adm3 dom_adm2 dom_adm1 dom_adm0 ,
1603.45.16:,dza.xlsx,dza,dza_adm2 dza_adm1 dza_adm0 ,
1603.45.16:,ecu.xlsx,ecu,ecu_adm3 ecu_adm2 ecu_adm1 ecu_adm0 ,
1603.45.16:,egy.xlsx,egy,egy_adm3 egy_adm2 egy_adm1 egy_adm0 ,
1603.45.16:,eri.xlsx,eri,eri_adm2 eri_adm1 eri_adm0 ,
1603.45.16:,eth.xlsx,eth,adm3 adm2 adm1 adm0 ,
1603.45.16:,fji.xlsx,fji,fji_adm3 fji_adm2 fji_adm1 fji_adm0 ,
1603.45.16:,fsm.xlsx,fsm,fsm_adm2 fsm_adm1 fsm_adm0 ,
1603.45.16:,gab.xlsx,gab,gab_adm2 gab_adm1 gab_adm0 ,
1603.45.16:,geo.xlsx,geo,geo_adm2 geo_adm1 geo_adm0 ,
1603.45.16:,gha.xlsx,gha,gha_adm2 gha_adm1 gha_adm0 ,
1603.45.16:,gin.xlsx,gin,gin_adm3 gin_adm2 gin_adm1 gin_adm0 ,
1603.45.16:,gtm.xlsx,gtm,gtm_adm2 gtm_adm1 gtm_adm0 ,
1603.45.16:,guf.xlsx,guf,guf_adm2 guf_adm1 guf_adm0 ,
1603.45.16:,hnd.xlsx,hnd,adm2 adm1 adm0 ,
1603.45.16:,hti.xlsx,hti,hti_adm3 hti_adm2 hti_adm1 hti_adm0 ,
1603.45.16:,idn.xlsx,idn,idn_adm4 idn_adm3 idn_adm2 idn_adm1 idn_adm0 ,
1603.45.16:,irn.xlsx,irn,irn_adm2 irn_adm1 irn_adm0 ,
1603.45.16:,irq.xlsx,irq,irq_adm3 irq_adm2 irq_adm1 irq_adm0 ,
1603.45.16:,kaz.xlsx,kaz,kaz_adm2 kaz_adm1 kaz_adm0 ,
1603.45.16:,ken.xlsx,ken,ken_adm2 ken_adm1 ken_adm0 ,
1603.45.16:,kgz.xlsx,kgz,kgz_adm3 kgz_adm2 kgz_adm1 kgz_adm0 ,
1603.45.16:,khm.xlsx,khm,khm_adm3 khm_adm2 khm_adm1 khm_adm0 ,
1603.45.16:,kir.xlsx,kir,kir_adm2 kir_adm1 kir_adm0 ,
1603.45.16:,lao.xlsx,lao,lao_adm2 lao_adm1 lao_adm0 ,
1603.45.16:,lbn.xlsx,lbn,adm3 adm2 adm1 adm0 ,
1603.45.16:,lbr.xlsx,lbr,lbr_adm2 lbr_adm1 lbr_adm0 ,
1603.45.16:,lby.xlsx,lby,lby_adm2 lby_adm1 lby_adm0 ,
1603.45.16:,lca.xlsx,lca,lca_adm2 lca_adm1 lca_adm0 ,
1603.45.16:,lka.xlsx,lka,lka_adm4 lka_adm3 lka_adm2 lka_adm1 lka_adm0 ,
1603.45.16:,lso.xlsx,lso,lso_adm2 lso_adm1 lso_adm0 ,
1603.45.16:,mar.xlsx,mar,adm2 adm1 adm0 ,
1603.45.16:,mda.xlsx,mda,mda_adm1 mda_adm0 ,
1603.45.16:,mdg.xlsx,mdg,mdg_adm4 mdg_adm3 mdg_adm2 mdg_adm1 mdg_adm0 ,
1603.45.16:,mex.xlsx,mex,adm2 adm1 adm0 ,
1603.45.16:,mkd.xlsx,mkd,mkd_adm4 mkd_adm3 mkd_adm2 mkd_adm1 mkd_adm0 ,
1603.45.16:,mli.xlsx,mli,mli_adm3 mli_adm2 mli_adm1 mli_adm0 ,
1603.45.16:,mng.xlsx,mng,adm2 adm1 adm0 ,
1603.45.16:,moz.xlsx,moz,moz_adm3 moz_adm2 moz_adm1 moz_adm0 ,
1603.45.16:,mrt.xlsx,mrt,adm2 adm1 adm0 ,
1603.45.16:,mtq.xlsx,mtq,mtq_adm2 mtq_adm1 mtq_adm0 ,
1603.45.16:,mus.xlsx,mus,adm1 adm0 ,
1603.45.16:,mwi.xlsx,mwi,mwi_adm3 mwi_adm2 mwi_adm1 mwi_adm0 ,
1603.45.16:,nam.xlsx,nam,nam_adm2 nam_adm1 nam_adm0 ,
1603.45.16:,ner.xlsx,ner,ner_adm3 ner_adm2 ner_adm1 ner_adm0 ,
1603.45.16:,nga.xlsx,nga,nga_adm3 nga_adm2 nga_adm1 nga_adm0 ,
1603.45.16:,nic.xlsx,nic,adm2 adm1 adm0 ,
1603.45.16:,npl.xlsx,npl,adm2 adm1 adm0 ,
1603.45.16:,pak.xlsx,pak,pak_adm3 pak_adm2 pak_adm1 pak_adm0 ,
1603.45.16:,pan.xlsx,pan,adm3 adm2 adm1 adm0 ,
1603.45.16:,per.xlsx,per,adm3 adm2 adm1 adm0 ,
1603.45.16:,phl.xlsx,phl,adm3 adm2 adm1 adm0 ,
1603.45.16:,png.xlsx,png,png_adm3 png_adm2 png_adm1 png_adm0 ,
1603.45.16:,pry.xlsx,pry,adm2 adm1 adm0 ,
1603.45.16:,pse.xlsx,pse,adm2 adm1 adm0 ,
1603.45.16:,rwa.xlsx,rwa,rwa_adm4 rwa_adm3 rwa_adm2 rwa_adm1 rwa_adm0 ,
1603.45.16:,sdn.xlsx,sdn,adm2 adm1 adm0 ,
1603.45.16:,sen.xlsx,sen,sen_adm3 sen_adm2 sen_adm1 sen_adm0 ,
1603.45.16:,slb.xlsx,slb,slb_adm3 slb_adm2 slb_adm1 slb_adm0 ,
1603.45.16:,sle.xlsx,sle,sle_adm4 sle_adm3 sle_adm2 sle_adm1 sle_adm0 ,
1603.45.16:,slv.xlsx,slv,adm2 adm1 adm0 ,
1603.45.16:,som.xlsx,som,som_adm2 som_adm1 som_adm0 ,
1603.45.16:,ssd.xlsx,ssd,adm2 adm1 adm0 ,
1603.45.16:,stp.xlsx,stp,adm2 adm1 adm0 ,
1603.45.16:,swz.xlsx,swz,swz_adm2 swz_adm1 swz_adm0 ,
1603.45.16:,sxm.xlsx,sxm,sxm_adm2 sxm_adm1 sxm_adm0 ,
1603.45.16:,syc.xlsx,syc,adm3 adm2 adm1 adm0 ,
1603.45.16:,syr.xlsx,syr,Admin3 Admin2 Admin1 Admin0 ,
1603.45.16:,tcd.xlsx,tcd,adm3 adm2 adm1 adm0 ,
1603.45.16:,tgo.xlsx,tgo,Admin3 Admin2 Admin1 Admin0 ,
1603.45.16:,tha.xlsx,tha,tha_adm3 tha_adm2 tha_adm1 tha_adm0 ,
1603.45.16:,tls.xlsx,tls,adm3 adm2 adm1 adm0 ,
1603.45.16:,ton.xlsx,ton,ton_adm3 ton_adm2 ton_adm1 ton_adm0 ,
1603.45.16:,tur.xlsx,tur,tur_adm4 tur_adm3 tur_adm2 tur_adm1 tur_adm0 ,
1603.45.16:,tza.xlsx,tza,tza_adm3 tza_adm2 tza_adm1 tza_adm0 ,
1603.45.16:,uga.xlsx,uga,adm4 adm3 adm2 adm1 adm0 ,
1603.45.16:,ukr.xlsx,ukr,ukr_adm4 ukr_adm3 ukr_adm2 ukr_adm1 ukr_adm0 ,
1603.45.16:,ury.xlsx,ury,adm2 adm1 adm0 ,
1603.45.16:,uzb.xlsx,uzb,adm2 adm1 adm0 ,
1603.45.16:,ven.xlsx,ven,ven_adm3 ven_adm2 ven_adm1 ven_adm0 ,
1603.45.16:,vnm.xlsx,vnm,adm2 adm1 adm0 ,
1603.45.16:,vut.xlsx,vut,vut_adm2 vut_adm1 vut_adm0 ,
1603.45.16:,yem.xlsx,yem,yem_adm3 yem_adm2 yem_adm1 yem_adm0 ,
1603.45.16:,zaf.xlsx,zaf,adm4 adm3 adm2 adm1 adm0 ,
1603.45.16:,zmb.xlsx,zmb,adm2 adm1 adm0 ,
1603.45.16:,zwe.xlsx,zwe,zwe_adm3 zwe_adm2 zwe_adm1 zwe_adm0 ,

@fititnt
Copy link
Member Author

fititnt commented Jan 5, 2022

I think a pure POSIX-shell function to make quick-n-dirty conversion from these headings to HXL could could, without need to more complex features.

meta-de-caput.uniq.txt
meta-de-caput.csv
meta-de-archivum.csv

Maybe will not need full table of languages to generate the terms. So worst case scenario they can be hardcoded

@fititnt
Copy link
Member Author

fititnt commented Jan 5, 2022

An MVP of the HXLated result already exist.

Captura de tela de 2022-01-05 04-24-39

Notes

I'm not 100% sure about the HXL hashtags for raw headers validOn and validTo. The https://tools.humdata.org/examples/hxl/ do have examples, but for data that uses PCodes, not the PCode tables themselves.

fititnt referenced this issue in EticaAI/lsf-cache Jan 5, 2022
fititnt referenced this issue in EticaAI/lsf-cache Jan 6, 2022
@fititnt fititnt transferred this issue from EticaAI/numerordinatio Jan 10, 2022
@fititnt
Copy link
Member Author

fititnt commented Jan 10, 2022

$ wc -l 1603/45/16/999/1603_45_16_1_15828996298662.hxl.csv 
432262 1603/45/16/999/1603_45_16_1_15828996298662.hxl.csv

m$ ls -lha  999999/1603/45/16/hxl/ | wc -l
408
$ ls -lha  999999/1603/45/16/hxl/*_0* | wc -l
114
$ ls -lha  999999/1603/45/16/hxl/* | wc -l
404
$ ls -lha  999999/1603/45/16/hxl/*0* | wc -l
114
$ ls -lha  999999/1603/45/16/hxl/*1* | wc -l
114
$ ls -lha  999999/1603/45/16/hxl/*2* | wc -l
110
$ ls -lha  999999/1603/45/16/hxl/*3* | wc -l
53
$ ls -lha  999999/1603/45/16/hxl/*4* | wc -l
13
$ ls -lha  999999/1603/45/16/hxl/*5* | wc -l
0

Trivia: do exist at least 432.262 published Place codes worldwide. (from 0 to 4, not attested admin level 5 and 6).

The minimum an non-compressed CSV with every code would be around 13MB. Also, how they are flattened would make difference on the space. But the good thing is we're far lower than GitHub ideal maximum of 50mb (hardlimit is 100MB)

Conventions on how to use UN M49 private namespaces as reference for compiled results

On this topic

from wikipedia
https://en.wikipedia.org/wiki/UN_M49#Private-use_codes_and_reserved_codes
Private-use codes and reserved codes
Beside the codes standardized above, the numeric codes 900 to 999 are reserved for private-use in ISO 3166-1 (under agreement by the UNSD) and in the UN M.49 standard. They may be used for any other groupings or subdivision of countries, territories and regions.

Some of these private-use codes may be found in some UN statistics reports and databases, for their own specific purpose. They are not portable across databases from third parties (except through private agreement), and may be changed without notice.

Note that the code 000 is reserved and not used for defining any region. It is used in absence of data, or for data in which no region (not even the World as a whole) is applicable. For unknown or unencoded regions, private-use codes should preferably be used.

For aggregated datasets related to world places, I believe we should start using private namespaces for it and document the logic. This saves a lot of upfront drama with scripting.

On logic about "population statistics"

I think aggregate population statistics is a different issue, but the sole major reason for the classical 70's UN m49 (https://unstats.un.org/unsd/publication/SeriesM/Series_M49_(1970)_en-fr.pdf) was this type of statistics. Wikipedia says this is not more used, but makes total sense for us here.

However I think population statistics is not a priority. But I know there is more than one datasets (and they are more automated) so at least at adm0 (country level) this would not be hard to automate. But we're already going for more detailed data, at least for countries such as Brazil which we may have additional sources.

Other priorities would be start mapping the P-Codes with Wikidata. Then things are going to be relevant.

fititnt added a commit that referenced this issue Jun 10, 2022
…DATA / HXL_ATTRIBUTES_AD_WIKIDATA mappings draft
fititnt added a commit that referenced this issue Jun 10, 2022
fititnt added a commit that referenced this issue Jun 11, 2022
fititnt added a commit that referenced this issue Jun 11, 2022
fititnt added a commit that referenced this issue Jun 11, 2022
…lementation (based on dictionary) of COD-AB like data to RDF+HXL
fititnt added a commit that referenced this issue Jun 11, 2022
…rk the original CSV/HXL/HXLTM exporter also save upper levels, so it make easier for make RDF relationship from the most detailed administrative region availible
fititnt added a commit that referenced this issue Jun 12, 2022
fititnt added a commit that referenced this issue Jun 12, 2022
…ries, local only (time: 30m28,338s); before RDF relations
fititnt added a commit that referenced this issue Jun 12, 2022
fititnt added a commit that referenced this issue Jun 12, 2022
…coded list will make it work for common cases at sort term
fititnt added a commit that referenced this issue Jun 12, 2022
fititnt added a commit that referenced this issue Jun 12, 2022
…9999_54872.py --objectivum-formato=_temp_hxl_meta_in_json
fititnt added a commit that referenced this issue Jun 13, 2022
fititnt added a commit that referenced this issue Jun 13, 2022
@fititnt
Copy link
Member Author

fititnt commented Jun 18, 2022

TL;DR: the way graph databases works means, in humanitarian jargon, a single 10mb to 200mb RDF file(*) can have the entire country-level data, but instead of Excel or SQL, users could use high level interfaces, such as Protege, to even have semantic reasoning. While some humanitarians too focused would like the AI part of this, from our point of view this actually 1) helps to *allow the numeric codes we use be on local language of the persons who actually work on that at they own country" and 2) **the initial very, very hard work to automate documentation eventually will use reasoning to validate itself"*.

(*): however. the "single file" with all relevant to a region does not meant public data should be already such huge file because people would likely have own local data or want different public data. Either users adding each reference data or tools could merge the final result, but sensitive data do not need to leave users network.


Rationale behind [1603:16:{unm49}] prefix

While we could start generating all the works weeks ago (and not going further on RDF SKOS) the after being viable load every dictionary we have to SQLite/PostgreSQL using well formed CSVs (topic #37), my time trying to "organize things" also in graph format (topic #41) made rethink the entire organization.

under graph format, some entry points would have hunger amount of connected data

While there's different ways to partition data, unless we employ pure algorithmic (aka use some way to divide earth in equal blocks) most users will tend to organize or share data by administrative region. Humanitarian sector (at international level) tends to use more high levels, but places like Brazil would go much more specialized by region.

Side comment. Ok. After this, there's also cases were data would be shared by region (for example, coverage area of an hospital), which likely be more dynamic (and so, would need to have the area shared too), but for sake of both national and international interoperability, makes sense to make as easier as possible to everyone at bare minimum have ways to use the closest to standard codes.

One implication of the "way to organize in graph format" is that while something such as [1603:45:16:76:2:3106200] today we're using to represent the Belo Horizonte, both removing "45" is a number shorter and the amount of data attached to the [1603:16:{unm49}] very, very high. Actually, if we ignore translations for places names we can get from Wikidata (places such as Rio de Janeiro can have over 200), pretty much every time someone would use the namespaces with key by P codes, they will share final data, not data about CODs.

On the idea that we might have other [1603:16:{unm49}] where "16" is a different nomenclature

After such simplification of remove the 45, as the entire idea is document/automate data interoperability (so, different from HXL Standard which only have vocabulary for aggregated, we're optimizing for > 90% use cases, e.g. sensitive data that should be processed offline) we will start to have namespaces for things that are not strictly places. For example:

  • Natural persons
  • Organizations

However, the way the world is organized, means we would also plan ahead how to partition the data which is not about place at all. Even if we consider Brazil alone, several government instances can have different entry points for same concept (like a person) and even at admin0 ("country" level in case of Brazil) for cases such as COVID-19 vaccines the entry points could up with public data go over 170 millions natural persons (>177.550.128). It's so much data that RDF wouldn't scale well compared to SQLite.

Ok. I'm aware it might sound "lazy" to recommend people to, unless someone else's in their country proposes better numeric namespace, as soon as we later create a base namespace to suggest as reference to store disaggregated data (at least Persons and Organizations) when in doubt, we suggest similar key to what we do to administrative boundaries.

This "lazy" way somewhat also helps to simplify a lot tooling that could allow (like by user additional parameter" to infer that such organization or person have its data coordinated by respectively administrative region. This might not be the case, but at least users have some default suggestion that tools would work perfectly. In any case, it would still be possible to (at country level) different organizations have mappings of what one person or organization code inside them means on another level.

A different "1603" on something like  [1603:16:{unm49}] might be used for data that is not 100% factual (like anonymized, test data, simulated data, etc)

This makes far more sense when dealing with data for natural persons or organizations, but we might have a totally different global prefix for data that is an entirely different class of data.

In a context of data directly associated with places [1603:16:{unm49}], this would means (at worst case to NOT use for real)

  • The place country and it's internal divisions are a fake country; this might be relevant on simulations about epidemics we're people would complain if using real regions (less relevant if is country level preparedness; but not if other regions are invited to discuss)
  • The places are real, but associated data is fully simulated. Since root 1603 would be different, tools would need humans to explicitly ask what data from "production data / reference data" can be applied to this simulated place.

This topic alone would require an open issue here about how we will handle this type of data. However there's some edge cases where what humanitarians use as production data already are "simulations" or statistical inferences based on real data (use case: country person-to-person Census is too old) so even something like persons living in an area might get quite complicated. For example, the same way we would "get data from more specific namespace" at [1603:45:16] (Humanitarian P-Codes) to publish on  [1603:16], since we're already preparing to allow ingest data on Tabular and Graph databases, we need somewhat assume users might not like our default choices, so things likely to have more versions (not just statistical Inference about population, which the entire methodology is different) makes sense have different options.

@fititnt
Copy link
Member Author

fititnt commented Jun 29, 2022

Humm... we will need some documented way to

  1. Encode PCodes by administrative level which would also allow compatibility with RDF.
    1. However, there is Wikidata P for UN M49, but not for P-Codes. (see https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all)
    2. We can still use https://en.wikipedia.org/wiki/Place_code which have the Q code https://www.wikidata.org/wiki/Q7200235 as something like part_of
  2. For #admX+i_nnn+i_Nnnn without +altN, we can start encoding as skos:prefLabel
  3. For #admX+i_nnn+i_Nnnn with +altN, we can start encoding as skos:altLabel, but since we can have several, it would be a good idea merge the fields with some separator

the current drafts already work for HXL / HXLTM, but without this change, it would need extra hardcoded logic. Another issue is that the current use of +altN allow pass validation of not having same fields (which frictionless validate file.csv and databases complain) but "is not semantic".


Edit:

  • a reason to encode PCodes by administrative level is for performance reasons when on graph databases. with over 450.000 P-Codes, if everyone become directly linked to a single upper concept, it would be not scale. Even dumping a Turtle file, it would get those > 450.000 potentially on a single line. Not good.
  • We could still "merge" (maybe with RML) some way to allow a global search without administrative level, but at least this issue would be restricted to less queries. If the user or api generating the query at least know either the level OR that the prefix means a specific country, it could rewrite the query to much, much less search space. Either by administrative level OR prefix already would allow reduce drastically the performance issues (which makes sense only for cases where entire world is loaded, not really a issue for regional users)

fititnt added a commit that referenced this issue Jun 30, 2022
fititnt added a commit that referenced this issue Jul 12, 2022
…local numeric identifiers (brute force creation of IDs based on P-Codes may fail since some places have letters in the middle of P-Codes)
fititnt added a commit that referenced this issue Jul 12, 2022
… local numeric identifiers (brute force creation of IDs based on P-Codes may fail since some places have letters in the middle of P-Codes)
fititnt added a commit that referenced this issue Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dictionaria-specificis dictiōnāria specificīs; /specific group of dictionaries/@eng-Latn
Projects
None yet
Development

No branches or pull requests

1 participant