Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-publish the translation memories and translated list of words with reviewed language codes #5

Open
fititnt opened this issue Nov 20, 2021 · 0 comments

Comments

@fititnt
Copy link
Member

fititnt commented Nov 20, 2021

Note: this issue is different from the #4.


Except by the Google datasets (which they explicitly explain usage of BCP-47 (https://tools.ietf.org/html/bcp47), and at the moment there is no know potential error) , both the work from Translators Without Borders (which may been imported to some centralized tool before exported to the TICO-19 repository, so this may actually not be a mistake from TWB) and the datasets provide by Facebook have some non-standard language codes.

The ideal

One approach here is also use BCP-47 on the work already not transformed to HXLTM. The way we encode the HXL attributes also adds ISO 369-3 and ISO 15924, but we need some better starting point.

This both apply for content on the CSVs and TMXs, and the filenames.

About the redundant country names

Some country names seems to be relevant, but some are overlong. This is different from normalize the language codes, since we cannot either simply remove the countries. Also, countries which seems to not have countries associated to then, with suffix XX tend to be exactly the ones who may have bigger variation, so for example the translated wordlists from Facebook seems to be overlong on languages that are spoken mostly on a single country, while omitting on the ones that actually have more variation.

On this point, we could try to do some checks based on how the Unicode CLDR would consider it redundant the country codes.

Also, while I'm citing the Facebook datasets, since 3 big collaborators (Google, Facebook, TWB) use different language codes, they may actually be using what is common inside their company. So, except by cases which may be #4, this normalization step would need to be done after datasets from several collaborators are distributed on initiatives like TICO-19 on future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant