Rewrite Java API `Table.readJSON` to return the output from libcudf `read_json` directly #17180

ttnghia · 2024-10-25T04:58:02Z

With this PR, Table.readJSON will return the output from libcudf read_json directly without the need of reordering the columns to match with the input schema, as well as generating all-nulls columns for the ones in the input schema that do not exist in the JSON data. This is because libcudf read_json already does these thus we no longer have to do it.

Depends on:

add optional column_order to schema_element #17029

Partially contributes to NVIDIA/spark-rapids#11560.

Note that since the Java API Table.readJSON no longer needs to generate all-nulls columns, the function signature is changed and will cause spark-rapids build to break.

revans2

I love this change. My biggest problem is that this is a breaking change.

I put in a number of changes into the Spark Plugin to provide the emptyRowCount and fully removing those changes is not simple. So either please make sure that this provides some backwards compatibility so we can make the change in a few steps or can we please have the other PR ready to go so that there is very little down time between the two.

This reverts commit a82fdb699a13008b878deaab18ae85a440cf05af.

Signed-off-by: Nghia Truong <[email protected]>

ttnghia · 2024-10-25T18:24:28Z

Changed to non-breaking as the old Java methods are not removed in this PR. We can remove them later on when all the plugin code complete their adaptation.

ttnghia · 2024-10-25T19:56:46Z

After this (with #17029), the overhead of reordering columns is significantly reduced (above is before this, and below is with this):

Remove gatherJSONColumns

4d9a9e0

ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS breaking Breaking change labels Oct 25, 2024

ttnghia self-assigned this Oct 25, 2024

ttnghia requested a review from a team as a code owner October 25, 2024 04:58

ttnghia mentioned this pull request Oct 25, 2024

Fix build error due to changing the signature of cudf Table.readJSON NVIDIA/spark-rapids#11655

Draft

ttnghia requested a review from revans2 October 25, 2024 05:03

revans2 approved these changes Oct 25, 2024

View reviewed changes

ttnghia added 2 commits October 25, 2024 08:56

Revert "Auxiliary commit to revert individual files from 4d9a9e0"

764a7a2

This reverts commit a82fdb699a13008b878deaab18ae85a440cf05af.

Deprecate Java methods

6e978cc

Signed-off-by: Nghia Truong <[email protected]>

revans2 approved these changes Oct 25, 2024

View reviewed changes

ttnghia added non-breaking Non-breaking change and removed breaking Breaking change labels Oct 25, 2024

ttnghia mentioned this pull request Oct 25, 2024

add optional column_order to schema_element #17029

Draft

3 tasks

revans2 approved these changes Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite Java API `Table.readJSON` to return the output from libcudf `read_json` directly #17180

Rewrite Java API `Table.readJSON` to return the output from libcudf `read_json` directly #17180

ttnghia commented Oct 25, 2024 •

edited

Loading

revans2 left a comment

ttnghia commented Oct 25, 2024

ttnghia commented Oct 25, 2024 •

edited

Loading

Rewrite Java API Table.readJSON to return the output from libcudf read_json directly #17180

Are you sure you want to change the base?

Rewrite Java API Table.readJSON to return the output from libcudf read_json directly #17180

Conversation

ttnghia commented Oct 25, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

ttnghia commented Oct 25, 2024

ttnghia commented Oct 25, 2024 • edited Loading

Rewrite Java API `Table.readJSON` to return the output from libcudf `read_json` directly #17180

Rewrite Java API `Table.readJSON` to return the output from libcudf `read_json` directly #17180

ttnghia commented Oct 25, 2024 •

edited

Loading

ttnghia commented Oct 25, 2024 •

edited

Loading