Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading multi-source compressed JSONL files #17161

Open
wants to merge 11 commits into
base: branch-24.12
Choose a base branch
from

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Oct 23, 2024

Description

Addresses #17068
Addresses #12299

This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.

TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 23, 2024
@shrshi shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2024
@@ -41,6 +42,56 @@ namespace cudf::io::json::detail {

namespace {

class compressed_host_buffer_source final : public datasource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting approach
could we wrap a datasource instead of passing a host buffer?

Copy link
Contributor Author

@shrshi shrshi Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the other approach I was thinking of. We can have a owning buffer as a member of the compressed_host.. class in that case. I think it will have a similar memory requirement but the interface to compressed_host_buffer_source class will be cleaner.

@shrshi shrshi marked this pull request as ready for review October 25, 2024 18:21
@shrshi shrshi requested a review from a team as a code owner October 25, 2024 18:21
@@ -560,5 +560,69 @@ size_t decompress(compression_type compression,
}
}

size_t estimate_uncompressed_size(compression_type compression, host_span<uint8_t const> src)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an estimate or the exact size?
Have you measured the performance impact from the additional work we do to get the decompressed size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this function returns the exact size. I'll change the name to get_uncompressed_size.
I'll post the performance impact of this change shortly.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the approach!
Some suggestions, main one is the "memoization" in the new source type.

if (comptype == compression_type::GZIP || comptype == compression_type::ZIP ||
comptype == compression_type::SNAPPY) {
_decompressed_ch_buffer_size = estimate_uncompressed_size(_comptype, _ch_buffer);
_decompressed_buffer.resize(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _decompressed_buffer is already empty at this point


size_t host_read(size_t offset, size_t size, uint8_t* dst) override
{
auto decompressed_hbuf = decompress(_comptype, _ch_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always want to save the result in _decompressed_buffer. We cannot guarantee any read pattern and saving the decompressed buffer on first read avoid repeated calls to decompress.

}
// in create_batched_cudf_table, we need the compressed source size to actually be the
// uncompressed source size for correct batching
return create_batched_cudf_table(compressed_sources, reader_opts, stream, mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the right name for this one is :)
So far I got read_json_impl and read_uncompressed_json (since we wrap compressed sources and pass that).


device_span<char> ingest_raw_input(device_span<char> buffer,
host_span<std::unique_ptr<datasource>> sources,
compression_type compression,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is now unused (as it should be)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants