Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes nulls in framecols and improves column creation situation #925

Merged
merged 16 commits into from
Oct 31, 2024

Conversation

Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Oct 17, 2024

Fixes #593

This PR has turned into a collection of a couple smaller fixes:

  • a debug=true check has been added to FrameColumnImpl to catch cases where null leaks in the values like before.
  • DataColumn.Companion cleaning
    • All functions gained KDocs explaining these functions do no conversions or checks and just instantiate the columns
    • create() was renamed to createUnsafe() for the same reason. Together with .toColumnKind(), giving the type AnyFrame? will now instantiate a value column instead of a frame column.
    • createFrameColumn() with startIndices has been deprecated. This is not the "normal" way to instantiate a FrameColumn and thus belongs somewhere else, in this case, in chunked.kt
  • createColumn() and guessColumnType() have been merged into createColumnGuessingType()
    • All cases of createColumn were already covered by guessColumnType() aside from Iterable<DataColumn> -> ColumnGroup
    • allColsMakesColGroup argument was added to createColumnGuessingType() to keep this behavior optionally. It is used by columnOf for instance, but doesn't make sense everywhere. guessValueType() was modified to handle this too.
    • We now have just one place where type guessing/unifying happens
    • I added tests to check the behavior is unchanged.
  • dataFrameOf constructors had some fixes to align behavior with other column creation functions, added tests

TODO: check whether other uses of DataColumn.create(Unsafe) need to be swapped to createWithTypeInference/createColumnGuessingType

@Jolanrensen Jolanrensen force-pushed the nulls-in-framecols branch 2 times, most recently from 115033a to 48fc9ff Compare October 18, 2024 15:16
…from before the PR. Added allColsMakesColGroup argument for createColumnGuessingType() and guessValueType() so the old behavior of createColumn() is now controlled in the same place as all other conversions
@Jolanrensen Jolanrensen marked this pull request as ready for review October 21, 2024 12:49
@Jolanrensen Jolanrensen mentioned this pull request Oct 22, 2024
24 tasks
# Conflicts:
#	core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/constructors.kt
#	plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/DataFrameAdapter.kt
Copy link
Contributor

Generated sources will be updated after merging this PR.
Please inspect the changes in here.

…e check from guessing type. May be unexpected behavior. Fixed tests
@Jolanrensen Jolanrensen requested review from zaleslaw and koperagen and removed request for zaleslaw October 28, 2024 12:26
sequenceOf(columnOf("a"), columnOf(1)),
allColsMakesRow = true,
) shouldBe typeOf<DataRow<*>>()
guessValueType(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split two guessValueType?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it would need to be split in 4 because we also have listifyValues: Boolean. I think in this case it's good to have all logic of values -> column type of the entire library handled by one function guessValueType(). This makes it easier to debug and reason about.

// Checks for nulls in the `values` list.
// This only runs with `kotlin.dataframe.debug=true` in gradle.properties.
if (BuildConfig.DEBUG) {
require(!values.anyNull()) { "FrameColumn cannot null values." }
Copy link
Collaborator

@zaleslaw zaleslaw Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if we apply this check not only during DEBUG mode?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then each time we create a frame column, we'll have to check all values for nulls, which at worst is O(n). For efficiency it may be best to at least have a zero-checks path available for library calls.

* Returns the value type of the given [values] sequence.
* Returns the guessed value type of the given [values] sequence.
*
* This function analyzes all [values] once and returns the expected column type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we cover in tests the situation when one column in file or in List has different types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*
* This is generally safe, as [T] can be inferred, and more efficient than [createWithTypeInference].
*
* Be careful when casting occurs; Values in [values] are NOT checked to adhere to the given/inferred type [T],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unsafe because of type? For me createUnsafe creates weird reference to something like sun.misc.Unsafe. Could we rename it somehow to the createWithoutTypeChecking or createAndFailFast

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createAndFailFast

haha, another title could be createRaw, but createWithoutTypeChecking might also be clear. It's unsafe in the sense that it solely uses the given type T to decide which type of column to instantiate. It doesn't check any of the values, which makes it fast, but, well, unsafe.

Copy link
Collaborator Author

@Jolanrensen Jolanrensen Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, on the other hand, then we can have DataColumn.createWithoutTypeInference(..., infer = Infer.Type). That's gonna be confusing haha. We need a title that says it will choose createValueColumn, createColumnGroup, or createFrameColumn based on the type T

Copy link
Collaborator Author

@Jolanrensen Jolanrensen Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think I'm gonna go with:

  • createByInference
  • createByType

Copy link
Collaborator

@zaleslaw zaleslaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've suggested just rename or answer some my questions before merging

@Jolanrensen
Copy link
Collaborator Author

@zaleslaw Thanks! I'll have a look. I'll also check the binary compatibility now that plugin is merged. I suspect I broke some stuff.

@Jolanrensen
Copy link
Collaborator Author

Alright, thanks for the reviews :) they were most helpful! I think I covered all now, so I'll merge so we won't get further out of sync with other branches. Let me know if I missed something major, then I'll fix it :)

@Jolanrensen Jolanrensen merged commit 2c68cf4 into master Oct 31, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NPE in FrameColumnImpl.schema property
3 participants