-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast double parser #935
Fast double parser #935
Conversation
…ParserOptions, plus tests
@@ -19,24 +19,31 @@ import kotlin.annotation.AnnotationTarget.VALUE_PARAMETER | |||
* {@include [Indent]} | |||
* | |||
*/ | |||
@ExcludeFromSources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These documentation interfaces can be excluded from the sources as they are only ever accessed from kdoc.
8c81603
to
fdcaf15
Compare
private val logger = KotlinLogging.logger {} | ||
|
||
// (lowercase) strings that are recognized to represent infinity and NaN in doubles in all locales | ||
private val INFINITIES = arrayOf("∞", "inf", "infinity", "infty") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you collect this information? Are some standards somewhere available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already had "inf" and "infinity". I added "infty" because it's used in the math world, like LaTeX, and finally, I supposed the infinity symbol itself should also be recognizable. Any others we should add?
* | ||
* If [ParserOptions.useFastDoubleParser] is enabled, it will try to parse the input with an _EXPERIMENTAL_ | ||
* fast double parser, [FastDoubleParser](https://github.com/wrandelshofer/FastDoubleParser). | ||
* If not, or if it fails, it will use [NumberFormat] to parse the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great approach!
.withPlusSign(setOf('+')) | ||
.withDecimalSeparator(DecimalFormatSymbols::getDecimalSeparator.fromLocalWithFallBack()) | ||
.withGroupingSeparator(DecimalFormatSymbols::getGroupingSeparator.fromLocalWithFallBack()) | ||
.withExponentSeparator(DecimalFormatSymbols::getExponentSeparator.fromLocalWithFallBack()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean set(get.fromLocalWithFallBack)
construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can read it as a sentence :) "::getDecimalSeparator.fromLocalWithFallBack()
". This means it calls localDecimalFormatSymbols.getDecimalSeparator()
first and then adds the fallback character from fallbackDecimalFormatSymbols.getDecimalSeparator()
. All these characters are then provided to the NumberFormatSymbols
builder here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it work on the function reference so that the function can be called from different DecimalFormatSymbols
instances
in NANS -> Double.NaN | ||
|
||
else -> { | ||
// not thread safe; must be created here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is thread safety important here? Or this is a reason to create an Instance of Number Format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deephaven holds a single instance of this FastDoubleParser class but can parse columns in parallel. Previously I put numberFormat
outside the function, but this caused bugs, so I put it inside here and it works fine :). I put the warning here because you might assume you could save an object creation by putting val numberFormat = NumberFormat.getInstance(locale)
inside the class and reusing it, but no, you cannot.
} | ||
}.also { | ||
if (it == null) { | ||
logger.debug { "Could not parse '$this' as Double with NumberFormat with locale '$locale'." } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we could also add some logs about experimental fast double parsing like a marker for our logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KLogger already provides a marker. It recognizes the file/class it's called in. So this will look something like: DEBUG org.jetbrains.kotlinx.dataframe.impl.io.FastDoubleParser - Could not parse 'hello' as Double with NumberFormat with locale 'en_US'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just search for "FastDoubleParser" :)
"100123.35", | ||
"-204,235.23", | ||
"1.234e3", | ||
"3e-04", // failed with old double parser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it fail, if we disable fast parsing? Or they aligned now in the terms of supported functionality, locales?
And one just is faster than another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, it will fail in the old one. We cannot make the old one behave the same, as it's not supported by NumberFormat.
|
||
@Test | ||
fun `can fast parse german locale`() { | ||
val parser = FastDoubleParser(ParserOptions(locale = Locale.GERMANY, useFastDoubleParser = true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vielen Dank im Voraus!
|
||
@Test | ||
fun `can fast parse french locale`() { | ||
val parser = FastDoubleParser(ParserOptions(locale = Locale.FRANCE, useFastDoubleParser = true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to think: could we write a test, which just generates checks for all possible locales and constants are obtained from Double values -> String -> parseDouble again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"all" locales? XD I can try, but that's like hundreds of locales and may be system-dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are some locales like Arabic where NumberFormat itself fails to add "-" in front of negative numbers :/ It's fixed in later Java versions, but not 8 yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this was actually a really good idea! It made me catch two separate bugs which I reported to FastDoubleParser.
- Minus signs in RTL languages, like Arabic, are (in Java 8) marked with the Arabic letter mark instead of, you know, a minus sign... FastDoubleParser cannot handle this, because the character is a format symbol, not a normal one.
NumberFormat
does handle it, so to align our behavior with theirs, I now manually check for format symbol minus signs and replace them with '-' before parsing. - In Finnish and other langauges, they use
exponentSeparator='e'
andnan="epäluku"
, which breaks as both contain the letter 'e'. I've reported this and luckily our fallback method catches it.
fdcaf15
to
57d492f
Compare
…r "-" in RTL languages with NumberFormat
# Conflicts: # core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/util/deprecationMessages.kt
Thanks for the review :) I'll merge it for now so I can continue with the csv parser |
Fixes #607
Created under csv umbrella issue: #827
Introduces FastDoubleParser as optional alternative double parser. It can be enabled as experimental option (for now) in
ParserOptions
.The configurable version of this double parser was created for us :) and the way I implemented it, it can solve some of the locale-fallback-issues we were facing before.
Check the tests I created in this PR for more some examples.
It's also faster than
NumberFormat
!FastDoubleParser is also used by Deephaven CSV, which helps with our new csv implementation.