Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance and reduce memory consumption #101

Closed
Marcono1234 opened this issue May 31, 2021 · 14 comments
Closed

Improve performance and reduce memory consumption #101

Marcono1234 opened this issue May 31, 2021 · 14 comments
Milestone

Comments

@Marcono1234
Copy link
Contributor

Marcono1234 commented May 31, 2021

As pointed out in #39 and #57 Lingua's great accuracy comes at the cost of high memory usage. This imposes a problem for some projects trying to use Lingua.
In this issue I will try to highlight some main areas where performance can be improved, some of this is already covered by #98. Note that some of the proposed changes might decrease execution speed or require some larger refactoring.

Model files

  • Instead of storing the model data in JSON format, a binary format could be used matching the in-memory format (see "In-memory models" section). This would have the advantage that:

    • Lookup maps such as Char2DoubleOpenHashMap could be created with the expected size avoiding rehashing of the maps during deserialization.
    • Model file loading is faster.
    • Model file sizes will be slightly smaller when encoding the frequency only once, followed by the number of ngrams which share this frequency, followed by the ngram values.

    Note that even though the fastutil maps are Serializable, using JDK serialization might introduce unnecessary overhead and would make this library dependent on the internal serialization format of the fastutil maps. Instead the data could be written manually to a DataOutputStream.

Model file loading

  • Use streaming JSON library. The currently used kotlinx-serialization-json does not seem to support streaming yet. Therefore currently the complete model files are loaded as String before being parsed. This is (likely) slow and requires large amounts of memory. Instead streaming JSON libraries such as https://github.com/square/moshi should be used.
    Note that this point becomes obsolete if a binary format (as described in the "Model files" section above) is used.

In-memory models

  • Object2DoubleOpenHashMap load factor can increased from the default 0.75 to a higher value. This reduces memory usage but might slow down execution.

  • Ngrams can be encoded using primitives. Since this project uses only up to fivegrams (5 chars), most of the ngrams (and for some languages even ngrams of all lengths) can be encoded as JVM primitives using bitwise operations, e.g.:

    • Unigrams as Byte or Char
    • Bigrams as Short or Int
    • Trigrams as Int or Long
    • Quadrigrams as Int or Long
    • Fivegrams as Long or in the worst case as String object. Note that at least for fivegrams the binary encoding should probably be offset based, so one char is the base code point and the the remaining bits of the Long encode the offsets of the other chars to the base char. This allows encoding alphabets such as Georgian where each char is > Long.SIZE_BITS / 5.

    This might even increase execution speed since it avoids hashCode() and equals(...) calls when looking up frequencies (speed-up, if any, has to be tested though).

  • Reduce frequency accuracy for in-memory models and model files from 64-bit Double to 32-bit. This can have a big impact on memory usage, saving more than 100MB with all models preloaded. However, instead of using a 32-bit Float to store the frequency, a custom 32-bit encoding can (and maybe should) be used since Float 'wastes' some bits for the sign (frequency will never be negative) and the exponent (frequency will never be >= 1.0), though this might decrease language detection speed due to the decoding overhead.

  • Remove Korean fivegrams (and quadrigrams?). The Korean language models are quite large, additionally due to the large range of Korean code points a great majority (> 1.000.000 fivegrams (?)) cannot be encoded with the primitive encoding approach outlined above. Chinese and Japanse don't seem to have quadrigram and fivegram models as well, not sure if this is due to how the languages work, but maybe it would be acceptable to drop them for Korean as well; also because detection of Korean seems to be rather unambiguous.

Runtime performance

  • Remove Alphabet. The Alphabet class can probably removed, Character.UnicodeScript seems to be an exact substitute and might allow avoiding some indirection, e.g. only lookup UnicodeScript for a Char once and then compare it with expected ones instead of having each Alphabet look up UnicodeScript.
  • Avoid creation of Ngram objects. Similar to the primitive encoding described in "In-memory models" above, Ngram objects created as part of splitting up the text can be avoided as well (with a different encoding). A Kotlin inline class can be used to still get type safety and have some convenience functions. Primitive encoding can only support trigrams reliably without too much overhead / too complicated encoding, but that is probably fine because since d0f7a7c at most trigrams will be used for longer texts.
  • Instead of accessing lazy frequency lookup in every iteration, it might be faster to access it once at the beginning and then directly use it instead (though this could also be premature optimization).

Conclusion

With some / all of these suggestions applied memory usage can be reduced and execution speed can be increased without affecting accuracy. However, some of the suggestions might be premature optimization, and they only work for 16-bit Char but not for supplementary code points (> 16-bit) (but the current implementation, mainly Ngram creation, seems to have that limitation as well).

I have implemented some of these optimizations and some other minor improvements in https://github.com/Marcono1234/lingua/tree/experimental/performance. However, these changes are pretty experimental: The Git history is not very nice to look at; in some commits I fixed bugs I introduced before or reverted changes again. Additionally the unit tests and model file writing are broken. Some of the changes might also be premature optimization. Though maybe it is interesting nonetheless, it appears the memory usage with all languages being preloaded went down to about 640MB (Edit: 920MB, made a mistake in the binary encoding) on AdoptOpenJDK 11.

@Marcono1234 Marcono1234 mentioned this issue Jun 1, 2021
@pemistahl pemistahl added this to the Lingua 1.2.0 milestone Nov 25, 2021
@pemistahl pemistahl changed the title Improve performance Improve performance and reduce memory consumption Nov 26, 2021
@Lundez
Copy link

Lundez commented Dec 27, 2021

I also noted the large memory usage, and how large the .jar is.
I see two alternatives:

  1. Allow us to have a slim jar where we download the resources from GitHub dynamically and cache locally in the filesystem (like ~/.lingua or something)
  2. Use a trie-structure instead of ngram spaced apart per value
  • This would reduce the size of the file by plenty through re-using the same letters for multiple ngrams
  • This would also mean that JSON isn't really well fitted but rather some kind of new-line custom system should be used, or special chars.
  • Trading readability/simplicity for space in the .jar is something I think makes sense, and you'd trade it anyhow if you move into a binary format.

@matthewlowry
Copy link

Hey folks just want to chime in on this and say that option (1) mentioned by @Lundez is a great idea that would greatly benefit some people I'm sure, but for some it would be a complete deal breaker making the library unusable.

For the production environments we deploy lingua into, having the library try to fetch things from the Internet at runtime and having the library writing to some local filesystem location would make the library unusable for us. Both these things would fail because of the locked-down deployment environment.

@fvasco
Copy link

fvasco commented Apr 1, 2022

Hello,
I am new to Lingua but we are interested in this product.
Unfortunately, memory issue is a no-go for us.

So I try to contribute to this task, I hope this becomes helpful.
I remember you that the JVM is a very complex project, so memory issue is easy to spot, instead, performance issue can be really hard to fix, so I suggest some benchmark using JMH.
Moreover, using lesser memory helps performance (it reduces memory lookup).

Instead of storing the model data in JSON format, a binary format could be used matching the in-memory format

I absolutely agree.
This project contains the code AND the data.
There is no real reason to deploy both in the same JAR.
A different project can translate JSON into a binary format and save these in a JAR.
This project can simply declare an external dependency.

Instead the data could be written manually to a DataOutputStream

A proprietary format can lead to the best performance.
Should be possible to consider the ProtoBuf file format, but the data format is simple, so a custom serialization routine should be enough.

Object2DoubleOpenHashMap load factor can increased from the default 0.75 to a higher value.

The gain should be irrelevant.

Ngrams can be encoded using primitives.

The JVM stores fields sequentially, so 1 Int field or 4 Byte fields require the same amount of memory.
Define a Ngrams as a sealed class with five implementations, instead, forces the JVM to switch from a monomorphic to a slower polymorphic invocation. Moreover, sealed class cannot be inlined.
Switching to value class may be enough.
Finally, String's equals and hashCode are already well optimized.

Reduce frequency accuracy for in-memory models and model files from 64-bit Double to 32-bit

I agree with this.

I elaborated on your consideration, maybe 16-bit precision can be enough (the smallest positive frequency is ~2.3E-41).

The following POC contains only binary operators, so conversions should require few CPU cycles.

value class Frequency(val rawBits: Short) {

    constructor(float: Float) : this(float.toBits().shr(14).toShort())

    fun toFloat() = Float.fromBits(rawBits.toInt().shl(14))
    override fun toString() = toFloat().toString()
}

This class requires the same memory as a Float, but it can reduce the memory usage when embedded in another instance (ie: non-nullable field) due to Java's object alignment.
Maybe a simple Float is enough.

Instead of accessing lazy frequency lookup in every iteration...

lazy requires a memory barrier, generally, this may impact a multi-CPU architecture.

@Marcono1234
Copy link
Contributor Author

Marcono1234 commented Apr 1, 2022

Object2DoubleOpenHashMap load factor can increased from the default 0.75 to a higher value.

The gain should be irrelevant.

Are you sure about this? These maps also make up a (large?) potion of the memory usage, especially when using a custom binary encoding. It is been some time since I last experiented with this, but if I recall correctly increasing the load factor reduced memory usage, possibly at the cost of slower lookup. It would probably still be benefitial to investigate this further.

@fvasco
Copy link

fvasco commented Apr 2, 2022

Are you sure about this?

No, I not am.
I think that inlining String can reduce required memory significantly.

I create a POC for your future investigation.
This is a dense map of sorted Ngrams. Lookup is performed using the binary search.
This kind of map can contains Ngrams with the same length, but it is trivial to create a class that can handle multiple length using the composition.

class RelativeFrequencies(val ngramLength: Int) {

    private var ngrams = charArrayOf()
    private var frequencies = floatArrayOf()

    val size get() = frequencies.size

    fun load(count: Int, data: Iterator<Pair<String, Float>>) {
        ngrams = CharArray(count * ngramLength)
        frequencies = FloatArray(count)
        repeat(count) { i ->
            val (ngram, frequency) = data.next()
            val base = i * ngramLength
            ngram.forEachIndexed { p, c -> ngrams[base + p] = c }
            frequencies[i] = frequency
        }
    }

    operator fun get(ngram: String): Float? {
        var low = 0
        var high = size - 1
        while (low <= high) {
            val middle = (low + high) / 2
            val diff = compareNgram(middle, ngram)
            if (diff < 0) low = middle + 1
            else if (diff > 0) high = middle - 1
            else return frequencies[middle]
        }
        return null
    }

    private fun compareNgram(pos: Int, ngram: String): Int {
        val base = pos * ngramLength
        repeat(ngramLength) { i ->
            val diff = ngrams[base + i].compareTo(ngram[i])
            if (diff != 0) return diff
        }
        return 0
    }
}

fun main() {
    val rf = RelativeFrequencies(ngramLength = 2)
    rf.load(count = 3, listOf("aa" to 0.1F, "ab" to 0.2F, "bb" to 0.3F).iterator())
    println(rf["aa"])
    println(rf["ab"])
    println(rf["bb"])
    println(rf["zz"])
}

@fvasco
Copy link

fvasco commented Apr 3, 2022

I made a little test to understand the memory requirement of my previous POC. I attach it.
Compressed data size (JSON/binary) is almost the same, deserialization time is different.

All languages should require 370MB of RAM.

import kotlinx.serialization.json.Json
import kotlinx.serialization.json.jsonObject
import kotlinx.serialization.json.jsonPrimitive
import java.io.DataInputStream
import java.io.DataOutputStream
import java.io.File
import java.util.*
import java.util.concurrent.Callable
import java.util.concurrent.ForkJoinPool

fun main() {
    // content of https://github.com/pemistahl/lingua/tree/main/src/main/resources/language-models
    val jsonModelsDir = File("language-models")
    require(jsonModelsDir.isDirectory)
    val datModelsDir = File("language-models.dat")
    // serialize to binary
    if (datModelsDir.mkdir()) translateModel(jsonModelsDir, datModelsDir)

    // read en from binary
    val en = DataInputStream(File(datModelsDir, "en").inputStream().buffered())
        .use(RelativeFrequencies.Companion::read)
    println("${en["pc"]} == ${2868.0/1994813.0}")
}


fun translateModel(jsonModelsDir: File, datModelsDir: File) {
    val executor = ForkJoinPool.commonPool()
    jsonModelsDir.listFiles().forEach { dir ->
        val out = DataOutputStream(File(datModelsDir, dir.name).outputStream().buffered())
        listOf("unigrams.json", "bigrams.json", "trigrams.json", "quadrigrams.json", "fivegrams.json")
            .map { File(dir, it).readText(Charsets.UTF_8) }
            .map { content ->
                executor.submit(Callable {
                    val json = Json.parseToJsonElement(content).jsonObject
                    val map = TreeMap<String, Float>()
                    json.getValue("ngrams").jsonObject.forEach { (key, value) ->
                        val frequency = key.split('/').let { (n, d) -> (n.toDouble() / d.toDouble()).toFloat() }
                        value.jsonPrimitive.content.split(' ').forEach { ngram ->
                            map[ngram] = frequency
                        }
                    }
                    return@Callable map
                })
            }.map { it.get() }
            .forEach { map ->
                out.writeInt(map.size)
                map.keys.forEach(out::writeChars)
                map.values.forEach(out::writeFloat)
            }
        out.close()
    }
}


class RelativeFrequencies private constructor(private val data: Array<Entries>) {

    operator fun get(ngram: String): Float = data[ngram.length - 1][ngram]

    private class Entries(private val chars: CharArray, private val frequencies: FloatArray) {

        val size get() = frequencies.size

        operator fun get(ngram: String): Float {
            var low = 0
            var high = size - 1
            while (low <= high) {
                val middle = (low + high) / 2
                val diff = compareNgram(middle, ngram)
                if (diff < 0) low = middle + 1
                else if (diff > 0) high = middle - 1
                else return frequencies[middle]
            }
            return 0F
        }

        private fun compareNgram(pos: Int, ngram: String): Int {
            val base = pos * ngram.length
            repeat(ngram.length) { i ->
                val diff = chars[base + i].compareTo(ngram[i])
                if (diff != 0) return diff
            }
            return 0
        }
    }

    companion object {
        fun read(dataInputStream: DataInputStream): RelativeFrequencies {
            val data = Array(5) { n ->
                val size = dataInputStream.readInt()
                val chars = CharArray(size * (n + 1)) { dataInputStream.readChar() }
                val float = FloatArray(size) { dataInputStream.readFloat() }
                Entries(chars, float)
            }
            return RelativeFrequencies(data)
        }
    }
}

@fvasco
Copy link

fvasco commented Apr 4, 2022

I create a pull request (#127) containing my previous considerations.
Changes are compatible with the current branch, no need to delay until the 1.2 version.

@fvasco
Copy link

fvasco commented Jun 8, 2022

It is nice to read that, can you share your results, please?
How your changes impact performances?

@pemistahl
Copy link
Owner

@fvasco Below are the results of @sigpwned's benchmark on my local machine.

Lingua 1.1.1

Benchmark                                             Mode  Cnt           Score            Error   Units
DetectBenchmark.detect                               thrpt    9           0,226 ±          0,015   ops/s
DetectBenchmark.detect:·gc.alloc.rate                thrpt    9        1440,158 ±         71,897  MB/sec
DetectBenchmark.detect:·gc.alloc.rate.norm           thrpt    9  6941870628,148 ±  739778357,281    B/op
DetectBenchmark.detect:·gc.churn.G1_Eden_Space       thrpt    9        1447,234 ±        284,984  MB/sec
DetectBenchmark.detect:·gc.churn.G1_Eden_Space.norm  thrpt    9  6980486940,444 ± 1565341146,012    B/op
DetectBenchmark.detect:·gc.churn.G1_Old_Gen          thrpt    9           0,004 ±          0,011  MB/sec
DetectBenchmark.detect:·gc.churn.G1_Old_Gen.norm     thrpt    9       19778,074 ±      52401,285    B/op
DetectBenchmark.detect:·gc.count                     thrpt    9          35,000                   counts
DetectBenchmark.detect:·gc.time                      thrpt    9         123,000                       ms

Lingua 1.2.1 - high accuracy mode

Benchmark                                                 Mode  Cnt           Score           Error   Units
DetectBenchmark.detect                                   thrpt    9           0,224 ±         0,012   ops/s
DetectBenchmark.detect:·gc.alloc.rate                    thrpt    9        1101,767 ±        55,676  MB/sec
DetectBenchmark.detect:·gc.alloc.rate.norm               thrpt    9  5340373454,815 ±  18262910,503    B/op
DetectBenchmark.detect:·gc.churn.G1_Eden_Space           thrpt    9        1071,996 ±        50,790  MB/sec
DetectBenchmark.detect:·gc.churn.G1_Eden_Space.norm      thrpt    9  5196975672,889 ± 149634866,361    B/op
DetectBenchmark.detect:·gc.churn.G1_Old_Gen              thrpt    9           0,005 ±         0,017  MB/sec
DetectBenchmark.detect:·gc.churn.G1_Old_Gen.norm         thrpt    9       23758,815 ±     86756,686    B/op
DetectBenchmark.detect:·gc.churn.G1_Survivor_Space       thrpt    9           0,128 ±         0,644  MB/sec
DetectBenchmark.detect:·gc.churn.G1_Survivor_Space.norm  thrpt    9      621378,370 ±   3132558,155    B/op
DetectBenchmark.detect:·gc.count                         thrpt    9          27,000                  counts
DetectBenchmark.detect:·gc.time                          thrpt    9          99,000                      ms

Lingua 1.2.1 - low accuracy mode

Benchmark                                             Mode  Cnt           Score           Error   Units
DetectBenchmark.detect                               thrpt    9           0,346 ±         0,034   ops/s
DetectBenchmark.detect:·gc.alloc.rate                thrpt    9         598,360 ±        57,249  MB/sec
DetectBenchmark.detect:·gc.alloc.rate.norm           thrpt    9  1889947916,222 ±  20526931,530    B/op
DetectBenchmark.detect:·gc.churn.G1_Eden_Space       thrpt    9         601,178 ±        80,553  MB/sec
DetectBenchmark.detect:·gc.churn.G1_Eden_Space.norm  thrpt    9  1897573034,667 ± 130989635,302    B/op
DetectBenchmark.detect:·gc.churn.G1_Old_Gen          thrpt    9           0,001 ±         0,001  MB/sec
DetectBenchmark.detect:·gc.churn.G1_Old_Gen.norm     thrpt    9        2965,333 ±      1849,230    B/op
DetectBenchmark.detect:·gc.count                     thrpt    9         101,000                  counts
DetectBenchmark.detect:·gc.time                      thrpt    9          58,000                      ms

The new low accuracy mode is the most performant one as it uses trigrams only. But the accuracy is worse for short text.

@sigpwned
Copy link

sigpwned commented Jun 9, 2022

This is super interesting work, @pemistahl. I'm not terribly familiar with the build and release process you use for lingua, but would it be useful to have a set of benchmarks integrated into the build and/or CI/CD for the library? If so, I can certainly take a look at how that might work.

@pemistahl
Copy link
Owner

@sigpwned Sure, feel free to open a PR. Having a benchmark integrated into the project is not a bad idea. I find JMH a bit complicated, so any help is appreciated.

@Marcono1234
Copy link
Contributor Author

Over the past months I have been tinkering with Lingua performance optimizations, and I think I have now got it to a somewhat maintainable state:
https://github.com/Marcono1234/tiny-lingua

Would be great if you could give it a try and let me know what you think!

The changes are pretty extensive and the code is not easily maintainable anymore so I assume you, @pemistahl, won't be interested in most of these changes. But if you notice any particular change you want to include in Lingua, feel free to let me know (or apply the change yourself). I might also try to backport some of the less extensive changes to Lingua (if you are interested).

@pemistahl
Copy link
Owner

Hi @Marcono1234, thank you for your hard work that you have done now and over the past months. You have been of great help for this project. :) May I ask you what motivates you to contribute so much to my work? Is it just for fun, for your own learning purposes or because you need a language detector for your own work?

I will gladly study your changes and certainly backport some of them. But if some of the code turns out to be too complicated to maintain, I will most probably refrain from merging it.

Of all your changes, the greatest effect has the binary format, I suppose. Am I right? This is something I would probably adopt. Json is obviously not the best fit here.

@Marcono1234
Copy link
Contributor Author

May I ask you what motivates you to contribute so much to my work? Is it just for fun, for your own learning purposes or because you need a language detector for your own work?

Originally I became aware of Lingua and started tinkering around with it because its memory usage was too high for our use case (mojira/arisa-kt#656 (comment)). But I think soon afterwards it was mostly for my own learning purposes and for experimenting with Kotlin. Then I stopped for multiple months and continued with it this year after the discussion here became more active again.

Though as warning, my fork is definitely still experimental and might contain bugs, and most likely contains premature optimization. Some of the code also originates from the time when Lingua had more language model files (before you removed them in e9757c7), so the code might be redundant or could be simplified (Marcono1234#1).

Of all your changes, the greatest effect has the binary format, I suppose. Am I right? This is something I would probably adopt. Json is obviously not the best fit here.

That would mostly address loading speed and size of the JAR, but not detection speed or runtime memory consumption (which is certainly a good start nonetheless). And it can probably be implemented without having to adjust any of the language detection code. However, you can most likely not reuse much of the code from my fork because it is strongly tied to how the language models are represented in memory.

Interestingly the JAR file of my fork is not much smaller despite using binary encoding (69.7 MB compared to Lingua's 76.7 MB). Most likely the ZIP compression of the JAR file accounts for most of the redundancy in JSON. 7-Zip says the language-models folder in the JAR of my fork is ~100 MB whereas Lingua's is ~200 MB. One advantage of the current structure used by Lingua's JSON models is that grouping by frequency values reduces redundancy. For the binary models that might allow further decreasing the size, but I did not implement it for my fork because it would most likely make loading the models more complicated in my specific case. But for Lingua I think grouping by frequency should still be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants