Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Java API Table.readJSON to return the output from libcudf read_json directly #17180

Open
wants to merge 3 commits into
base: branch-24.12
Choose a base branch
from

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Oct 25, 2024

With this PR, Table.readJSON will return the output from libcudf read_json directly without the need of reordering the columns to match with the input schema, as well as generating all-nulls columns for the ones in the input schema that do not exist in the JSON data. This is because libcudf read_json already does these thus we no longer have to do it.

Depends on:

Partially contributes to NVIDIA/spark-rapids#11560.

Note that since the Java API Table.readJSON no longer needs to generate all-nulls columns, the function signature is changed and will cause spark-rapids build to break.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS breaking Breaking change labels Oct 25, 2024
@ttnghia ttnghia self-assigned this Oct 25, 2024
@ttnghia ttnghia requested a review from a team as a code owner October 25, 2024 04:58
@ttnghia ttnghia requested a review from revans2 October 25, 2024 05:03
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this change. My biggest problem is that this is a breaking change.

I put in a number of changes into the Spark Plugin to provide the emptyRowCount and fully removing those changes is not simple. So either please make sure that this provides some backwards compatibility so we can make the change in a few steps or can we please have the other PR ready to go so that there is very little down time between the two.

This reverts commit a82fdb699a13008b878deaab18ae85a440cf05af.
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia added non-breaking Non-breaking change and removed breaking Breaking change labels Oct 25, 2024
@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 25, 2024

Changed to non-breaking as the old Java methods are not removed in this PR. We can remove them later on when all the plugin code complete their adaptation.

@ttnghia
Copy link
Contributor Author

ttnghia commented Oct 25, 2024

After this (with #17029), the overhead of reordering columns is significantly reduced (above is before this, and below is with this):

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants