Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate other language detection solutions #656

Open
violine1101 opened this issue May 16, 2021 · 6 comments
Open

Investigate other language detection solutions #656

violine1101 opened this issue May 16, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@violine1101
Copy link
Member

The Problem

Currently, we're using dandelion.eu for language detection. This has the major disadvantage that we need to send all public bug reports to another server to have them analyzed there. Additionally there's a limit on how many requests we can send to the dandelion API.

Both of these things are suboptimal and it would be best to not rely on a third-party service for language detection. If we get rid of that dependency, we would be able to detect the language on private tickets as well, which would be very helpful.

We've initially tried to do language detection directly in Arisa (see #60 and #104), however quickly noticed that the library we used (lingua) needed way too much memory.

We only have limited resources on our server, so we need to be careful about that or get a better server.

Possible Solutions

At the moment, I see the following possibilities:

  • lingua has been updated, we might want to try it again, perhaps it's been getting better memory-consumption-wise?
  • lingua is also available as a Rust crate, which might have better performance and use less memory than the Kotlin version thanks to no dependency on the JVM.
  • whatlang is another Rust crate that should be sufficient for our use case, and is supposedly very lightweight.
  • Continue as is and don't do anything.

It seems to be fairly trivial to use Rust crates together with Kotlin, even though it introduces some complications in the build process.

@violine1101 violine1101 added the enhancement New feature or request label May 16, 2021
@Marcono1234
Copy link
Contributor

Marcono1234 commented Jun 13, 2021

I have been tinkering around with Lingua recently and with all the languages covered originally by #104 the memory usage for running Lingua on its own is ~100MB with models for all these languages preloaded. That is, memory usage will only increase temporarily during detection. When used with Arisa it would probably be higher since the other modules of Arisa also take up some memory. Is that acceptable for use with Arisa? Also no worries if that is still too high and Lingua is still not an option; it was definitely not wasted time for me.

These Lingua changes can be found here (Jitpack build should hopefully succeed, but have not tested it). However these changes are not in a state in which they can be submitted upstream (tests won't compile, Git history is messy) and I am not sure if they would even be accepted due to some extensive refactoring. Accuracy seems to be roughly the same as the upstream one.

I have been testing it on a few reports and results seem to be fairly good. However, there are a few things to consider:

  • Languages not requested when building the LanguageDetector will not be detected, e.g. both BDS-13619 and MC-228681 are Swedish, but are detected as German. Similarly BDS-13576 which is Italian is detected as English.
    However, that is (unless we eventually want to use translated bot messages) not an issue; either it causes false negatives, or true positives (where a wrong non-English language is detected, but the report is not in English anyway).
  • Detection for reports written in the native language of the user and with an English translation might not be detected correctly, especially for languages with logograms (Chinese, Japanese, Korean). Ideally we should treat them as English, but confidence values for the non-English language might be higher. We could check if the returned language is one of them and check if English is close enough, e.g. 0.8 confidence (and the next language is far enough away from English).
  • We should not rely on the most confident language but instead check the value for English as well, e.g. MCPE-130754 seems to have {GERMAN=1.0, ENGLISH=0.9995330130848736, ...}.

Additionally the notes from #60 and #104 are likely still relevant (some of this is covered by the points above).

@urielsalis
Copy link
Member

100mb should be fine. Our main problem was it was using more than 2GB meaning GitHub actions didn't run it

@violine1101
Copy link
Member Author

Hmm. I'm not sure whether it's really worth it to maintain a separate fork of lingua just for our purposes.

In regard to accuracy, I've tested the few examples from your comment with whatlang too, and it appeared to have gotten those right. The additional advantage of whatlang is that we don't need to interpret our results -- whatlang will straight up let us know whether its results can be considered reliable or not.

I haven't tinkered with whatlang too much yet, so I also can't really say how much memory it would need compared to lingua, but to me it seems like a more straight-forward approach that would also require less maintenance.

Nevertheless, getting lingua from gigabytes of memory usage down to only 100MB is really impressive, good job!

@Marcono1234
Copy link
Contributor

Marcono1234 commented Jun 13, 2021

whatlang will straight up let us know whether its results can be considered reliable or not

The issue with this is that we don't have a chance to find out if a reports consists of multiple languages, so we have to trust Whatlang to pick the correct language. Additionally it might not work well for languages with logograms (Chinese, Korean and Japanese) when the text also contains a few sentences of other languages, prefering that other language. For example MC-228001 is detected with a confidence of 100% as English. However, that might not actually be a problem because such reports are likely rare and it would only result in a false negative.

On the other hand in some cases it seems to be more accurate than Lingua. For example MC-212097 (including the summary!) is not reliable detected by Whatlang, while Lingua (at least with my changes) is rather certain that it is Italian ({ITALIAN=1.0, ENGLISH=0.9181598778136931, ...}). So with Lingua the report might have been resolved erroneously (though this might be a rare corner case).

Here is a query for some potentially interesting reports (also contains some short texts which are currently ignored by Arisa).

@violine1101
Copy link
Member Author

violine1101 commented Jun 13, 2021

Yeah, that's true, whatlang takes a fairly naïve approach when it comes to mixed languages. For example with MC-228001 gets correctly detected as Japanese when there are equally as many Japanese characters as there are Latin characters. Very interesting that it changes from 100% confidence English to 100% confidence Japanese if you do this.

I think this is actually an advantage: In case of mixed languages English/Chinese (for example MC-227856) whatlang will decide that it's English simply because of the fact that English uses more letters. This avoids false positives, which is very good.

I've looked through your filter and from what I can tell, whatlang generally appears to make the right call. Don't have lingua set up to compare it though.

Mixed languages in bug reports actually happens relatively frequently on the bug tracker actually, people who aren't sure of their English skills will simply also add the same text in their native tongue too.

IMO the main advantage of changing language detection solutions is that we'll be able to expand the module to work on ticket updates as well, instead of just ticket creations. Then we can also make Arisa reopen bug reports once the reporter has translated it (currently we just tell them to file a new ticket). From that perspective, considering mixed languages in tickets will become more relevant and something important to consider, since it's likely that users might just append a translation to the description to the bug report.

I'd propose that we could try implementing both libraries in the background, while still using Dandelion in the meantime, just to collect some data on where the three approaches differ in practice. Perhaps we could combine both, e.g. if lingua doesn't give crystal clear results, let whatlang make the final call, or vice-versa.

Edit: For things like MC-227773, I wonder if it'd make sense to exclude some phrases (e.g. the template in that report) from the text that gets sent over to lingua/whatlang/dandelion? That way we could get rid of some false negatives.

Edit 2: Another interesting example: MC-227132 -- whatlang is very hesistant to detect this as French, you basically need to delete every English word for it to do so.

To keep in mind, what I said above in regards to Japanese/Chinese is not true for languages that don't use as "condensed" characters, like Turkish: MC-227029. This ticket doesn't get detected correctly. Wondering whether something like a minimum length would still be required?

@urielsalis
Copy link
Member

Btw you should PR your changes to Lingua :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants