Re-publish the translation memories and translated list of words with reviewed language codes #5

fititnt · 2021-11-20T06:09:13Z

Note: this issue is different from the #4.

Except by the Google datasets (which they explicitly explain usage of BCP-47 (https://tools.ietf.org/html/bcp47), and at the moment there is no know potential error) , both the work from Translators Without Borders (which may been imported to some centralized tool before exported to the TICO-19 repository, so this may actually not be a mistake from TWB) and the datasets provide by Facebook have some non-standard language codes.

The ideal

One approach here is also use BCP-47 on the work already not transformed to HXLTM. The way we encode the HXL attributes also adds ISO 369-3 and ISO 15924, but we need some better starting point.

This both apply for content on the CSVs and TMXs, and the filenames.

About the redundant country names

Some country names seems to be relevant, but some are overlong. This is different from normalize the language codes, since we cannot either simply remove the countries. Also, countries which seems to not have countries associated to then, with suffix XX tend to be exactly the ones who may have bigger variation, so for example the translated wordlists from Facebook seems to be overlong on languages that are spoken mostly on a single country, while omitting on the ones that actually have more variation.

On this point, we could try to do some checks based on how the Unicode CLDR would consider it redundant the country codes.

Also, while I'm citing the Facebook datasets, since 3 big collaborators (Google, Facebook, TWB) use different language codes, they may actually be using what is common inside their company. So, except by cases which may be #4, this normalization step would need to be done after datasets from several collaborators are distributed on initiatives like TICO-19 on future.

The text was updated successfully, but these errors were encountered:

fititnt added a commit that referenced this issue Nov 20, 2021

#1, #4, #5: scripts/fn_tico19_datainfo_tmx.py started

50135f2

fititnt added a commit that referenced this issue Nov 20, 2021

#1, #4, #5: scripts/fn/datainfo_tmx.py, scripts/build-ebooks.sh

6562df5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-publish the translation memories and translated list of words with reviewed language codes #5

Re-publish the translation memories and translated list of words with reviewed language codes #5

fititnt commented Nov 20, 2021

Re-publish the translation memories and translated list of words with reviewed language codes #5

Re-publish the translation memories and translated list of words with reviewed language codes #5

Comments

fititnt commented Nov 20, 2021

The ideal

About the redundant country names