-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading multi-source compressed JSONL files #17161
base: branch-24.12
Are you sure you want to change the base?
Conversation
@@ -41,6 +42,56 @@ namespace cudf::io::json::detail { | |||
|
|||
namespace { | |||
|
|||
class compressed_host_buffer_source final : public datasource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting approach
could we wrap a datasource instead of passing a host buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was the other approach I was thinking of. We can have a owning buffer as a member of the compressed_host..
class in that case. I think it will have a similar memory requirement but the interface to compressed_host_buffer_source
class will be cleaner.
@@ -560,5 +560,69 @@ size_t decompress(compression_type compression, | |||
} | |||
} | |||
|
|||
size_t estimate_uncompressed_size(compression_type compression, host_span<uint8_t const> src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an estimate or the exact size?
Have you measured the performance impact from the additional work we do to get the decompressed size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this function returns the exact size. I'll change the name to get_uncompressed_size
.
I'll post the performance impact of this change shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the approach!
Some suggestions, main one is the "memoization" in the new source type.
if (comptype == compression_type::GZIP || comptype == compression_type::ZIP || | ||
comptype == compression_type::SNAPPY) { | ||
_decompressed_ch_buffer_size = estimate_uncompressed_size(_comptype, _ch_buffer); | ||
_decompressed_buffer.resize(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think _decompressed_buffer
is already empty at this point
|
||
size_t host_read(size_t offset, size_t size, uint8_t* dst) override | ||
{ | ||
auto decompressed_hbuf = decompress(_comptype, _ch_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we always want to save the result in _decompressed_buffer
. We cannot guarantee any read pattern and saving the decompressed buffer on first read avoid repeated calls to decompress
.
} | ||
// in create_batched_cudf_table, we need the compressed source size to actually be the | ||
// uncompressed source size for correct batching | ||
return create_batched_cudf_table(compressed_sources, reader_opts, stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the right name for this one is :)
So far I got read_json_impl
and read_uncompressed_json
(since we wrap compressed sources and pass that).
|
||
device_span<char> ingest_raw_input(device_span<char> buffer, | ||
host_span<std::unique_ptr<datasource>> sources, | ||
compression_type compression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is now unused (as it should be)
Description
Addresses #17068
Addresses #12299
This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.
TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.
Checklist