Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategies to detect them fix on external sources or report issues with likely wrong codes themselves (not just translations) #6

Open
fititnt opened this issue Jan 11, 2022 · 0 comments
Labels

Comments

@fititnt
Copy link
Member

fititnt commented Jan 11, 2022

On the temporary namespace 999999 we're already downloading/pre-processing similar datasets. However, even for data conciliation between sources, in best case we have missing data, but at worst is likely that even codes (at least the ones by non-primary sources) may actually be wrong. This is starting to become clear as we make this a monorepo.

Please note that I'm not talking about "Wikipedia (actually Wikidata, which allow public domain reuse, such as translations, we don't even need web scraping) is wrong", but this can happens in non-primary sources, such as thessauries or data providers using code from others, so either Wikidata can be a reflex of this or someone else sharing data from a concept (such as an location).

This type of experience is also called "data roundtrip", https://diff.wikimedia.org/2019/12/13/data-roundtripping-a-new-frontier-for-glam-wiki-collaborations/. And this is quite relevant for under financed organizations who already exchange codes.

Potential actions outside here (medium to long term)

If Wikidara "is wrong" we can do it directly. There are even APIs to allow command line operations, but this type of thing I believe would still require human input. First because sometimes the amount isn't worth time to automate, but also because sometimes there are more than one concept on Wikidata (such as cases related to administrative regions with disputed territory; these actually reflect even translation labels).

However, do other ontologies / dictionaries /thesauruses have inconsistencies? In simple cases it's missing data. This could be as simple as send mails and point a link to a spreadsheet where they could update it. However if it is not only lack of data, but potentially errors in code (such as sharing data with ISO 3661 part 1 alpha 2) then the humans could compare and check with others.

Optimizations here (short to medium term)

We're already starting in every stage to check if file types are not malformed. This is quite relevant in special for outside world input data but can also help with tools. It's not perfect, and will not catch more specific human errors, but it helps with quality control.

Then, since the ideal case would be to run jobs from time to time, it is quite realistic that at some point new versions (as this happens with tabular data) may have new bugs with such malformed data, then the ideal is keep using old cached data until manual human fix. One strategy we're doing is always preparing the new datasets on temporary files and then, after checking at least if it is valid format (later could be more than this) we replace the final result. This division also helps to know what files actually need to be updated (and use faster temporary directories) while also allowing us files update time to be aware of dependency necessary to rebuild everything else.


Additional notes

  • "Primary sources"
    • We must be aware that for very primary sources (think organization saying code for an "English" word like a name for a country) the lack of explaining exactly what this is makes primary sources non-falsifiable by design.
      • Such vagueness (in particular if such organizations do not even actually exchange data related to that topic, such as ISO endorsing other coding) makes it quite convenient for ISO to do a lazy job.
      • If we want to conciliate translations such as the ones from Wikidata, this means potentially helping more the real primary sources which actually explain better instead of wasting time with ISO (which, by the way, prohibit translations). The real primary sources already are more likely to welcome this.
  • Focus on points likely to be human error
    • Sometimes problems happen because of actually underlying political issues (such as territory disputes). However, we were actually more concerned with human error such as when labels in one language are very different or if a standard reuse another (such as ISO 3166 part 1 alpha 2) inconsistently. So at least make them aware can allow change to consider corrections unless they're cannot break internal systems
@fititnt fititnt added the epic label Jan 11, 2022
@fititnt fititnt pinned this issue Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant