Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TICO-19 paper usage of "translated terminologies" vs the, de facto, "translation of list of words" on the published material #3

Open
fititnt opened this issue Nov 20, 2021 · 0 comments

Comments

@fititnt
Copy link
Member

fititnt commented Nov 20, 2021

This topic is actually something I took some time to realize, but as we're importing to HXL and planning to export to everything else, including TBX, it needs to be addressed: the majority of "data rows" on TICO-19 terminology donated in good faith by Google and 100% of TICO-19 terminology donated in good faith by Facebook cannot be called terminology.

(This topic don't apply for Translators Without Border collaboration or the ones with at least the part of base minimum annotated by Google)

It also cannot be called "translated terminology" (terms used on the TICO-19 website) because the source content needs to be terminology at first. And, the initial content was, in fact, more near a "list of words". Also note that translating isolated words (or terms, which can be a composition of words) is much more complex than sentences. And by arbitrarily preparing just the words without any context of what that means, don't make it terminology.

Something fair to call these data rows is "translated list of words" or "translated wordlist" (not to be confused with WordNet project). It also cannot be post annotated (e.g. we or someone else corrects what is worth to be properly explained) after the translation. To be called translated terminology it would need to be reviewed again after corrections.

Translated wordlist is tolerable as it allows less enforced quality control and is more aligned with what is on the TICO-19 paper considering what was, de facto, the shared data. By no means I'm saying that it is not useful for such translated word lists (and, In fact, they're more easy to bootstrap under urgency, and can be good enough for first days, maybe weeks), but calling it terminology is doing no good not only for translators (and not for "low resource languages" by every one who was forced to generate to translate terms from English) but also to users and consumers of the end material who may assume the quality control of terminology when is theoretical impossible to do with mere word list.

We when importing to HXLTM will need to split the content. But makes sense to report back to the online material on the https://github.com/tico-19/tico-19.github.io. This still not affect the TICO-19 paper (maybe except for criticism they make the poor understanding of translators while rushing too fast) but do affect costumers of the datasets not provided by the Translators Without Borders.

fititnt added a commit that referenced this issue Nov 20, 2021
… they actually did not advertised as _translated terminology_; Google even explicitly still call it _Draft_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant