Reading multi-source compressed JSONL files #17161

shrshi · 2024-10-23T23:58:59Z

Description

Addresses #17068
Addresses #12299

This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.

TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule · 2024-10-24T23:02:15Z

cpp/src/io/json/read_json.cu

@@ -41,6 +42,56 @@ namespace cudf::io::json::detail {

 namespace {

+class compressed_host_buffer_source final : public datasource {


interesting approach
could we wrap a datasource instead of passing a host buffer?

Yes, that was the other approach I was thinking of. We can have a owning buffer as a member of the compressed_host.. class in that case. I think it will have a similar memory requirement but the interface to compressed_host_buffer_source class will be cleaner.

vuule · 2024-10-25T18:37:55Z

cpp/src/io/comp/uncomp.cpp

@@ -560,5 +560,69 @@ size_t decompress(compression_type compression,
  }
 }

+size_t estimate_uncompressed_size(compression_type compression, host_span<uint8_t const> src)


Is this an estimate or the exact size?
Have you measured the performance impact from the additional work we do to get the decompressed size?

Good point, this function returns the exact size. I'll change the name to get_uncompressed_size.
I'll post the performance impact of this change shortly.

vuule

I love the approach!
Some suggestions, main one is the "memoization" in the new source type.

vuule · 2024-10-25T19:13:19Z

cpp/src/io/json/read_json.cu

+    if (comptype == compression_type::GZIP || comptype == compression_type::ZIP ||
+        comptype == compression_type::SNAPPY) {
+      _decompressed_ch_buffer_size = estimate_uncompressed_size(_comptype, _ch_buffer);
+      _decompressed_buffer.resize(0);


I think _decompressed_buffer is already empty at this point

vuule · 2024-10-25T20:01:42Z

cpp/src/io/json/read_json.cu

+
+  size_t host_read(size_t offset, size_t size, uint8_t* dst) override
+  {
+    auto decompressed_hbuf = decompress(_comptype, _ch_buffer);


I think we always want to save the result in _decompressed_buffer. We cannot guarantee any read pattern and saving the decompressed buffer on first read avoid repeated calls to decompress.

vuule · 2024-10-25T20:11:39Z

cpp/src/io/json/read_json.cu

+  }
+  // in create_batched_cudf_table, we need the compressed source size to actually be the
+  // uncompressed source size for correct batching
+  return create_batched_cudf_table(compressed_sources, reader_opts, stream, mr);


I'm not sure what the right name for this one is :)
So far I got read_json_impl and read_uncompressed_json (since we wrap compressed sources and pass that).

vuule · 2024-10-25T20:12:11Z

cpp/src/io/json/read_json.cu

+
+device_span<char> ingest_raw_input(device_span<char> buffer,
+                                   host_span<std::unique_ptr<datasource>> sources,
+                                   compression_type compression,


This parameter is now unused (as it should be)

shrshi added 3 commits October 18, 2024 20:11

partial work

3517b26

compressed input datasource

1cc6f46

formatting

1f46223

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 23, 2024

shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2024

shrshi added 6 commits October 24, 2024 00:57

improving the datasoruce

334ef06

cleanup

839bdda

slow path for some compression formats

42a4b1b

merge

cff583b

cleanup

c3b6cb3

remove include

e116fa7

vuule reviewed Oct 24, 2024

View reviewed changes

shrshi added 2 commits October 25, 2024 18:19

pr feedback

3cd7c1d

Merge branch 'branch-24.12' into gzip-read-json-bug

dc7471c

shrshi marked this pull request as ready for review October 25, 2024 18:21

shrshi requested a review from a team as a code owner October 25, 2024 18:21

shrshi requested review from karthikeyann and kingcrimsontianyu October 25, 2024 18:21

vuule reviewed Oct 25, 2024

View reviewed changes

vuule requested changes Oct 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading multi-source compressed JSONL files #17161

Reading multi-source compressed JSONL files #17161

shrshi commented Oct 23, 2024 •

edited by vuule

Loading

vuule Oct 24, 2024

shrshi Oct 24, 2024 •

edited

Loading

vuule Oct 25, 2024

shrshi Oct 25, 2024

vuule left a comment

vuule Oct 25, 2024

vuule Oct 25, 2024

vuule Oct 25, 2024

vuule Oct 25, 2024

		@@ -41,6 +42,56 @@ namespace cudf::io::json::detail {

		namespace {

		class compressed_host_buffer_source final : public datasource {

Reading multi-source compressed JSONL files #17161

Are you sure you want to change the base?

Reading multi-source compressed JSONL files #17161

Conversation

shrshi commented Oct 23, 2024 • edited by vuule Loading

Description

Checklist

vuule Oct 24, 2024

Choose a reason for hiding this comment

shrshi Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

vuule Oct 25, 2024

Choose a reason for hiding this comment

shrshi Oct 25, 2024

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

vuule Oct 25, 2024

Choose a reason for hiding this comment

vuule Oct 25, 2024

Choose a reason for hiding this comment

vuule Oct 25, 2024

Choose a reason for hiding this comment

vuule Oct 25, 2024

Choose a reason for hiding this comment

shrshi commented Oct 23, 2024 •

edited by vuule

Loading

shrshi Oct 24, 2024 •

edited

Loading