Add an example to demonstrate multithreaded `read_parquet` pipelines #16828

mhaseeb123 · 2024-09-18T02:53:53Z

Description

Closes #16717. This PR adds a new example to read multiple parquet files using multiple threads.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/examples/parquet_io/parquet_io.cpp

cpp/examples/parquet_io/parquet_io_multithreaded.cpp

GregoryKimball · 2024-09-26T22:43:15Z

Thanks @mhaseeb123 for putting this together. After a little hacking I was able to generate a nice pipelined profile. I removed the concats and the writes and only kept the first read. I also had to add an initialization for thread_count to prevent garbage thread values.

Here is nice pipelining

From this command:
/nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --cuda-um-cpu-page-faults=true --cuda-um-gpu-page-faults=true --gpu-metrics-device=4 --output=/nfs/20240924_iothreads/prof2_4 --env-var KVIKIO_COMPAT_MODE=on,KVIKIO_NTHREADS=8 ./parquet_io_multithreaded /raid/tpch/gqe-dbgen-1/lineitem/lineitem_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/part/part_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/orders/orders_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/lineitem/lineitem_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/part/part_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/orders/orders_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/lineitem/lineitem_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/part/part_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/orders/orders_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/lineitem/lineitem_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/part/part_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/orders/orders_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/lineitem/lineitem_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/part/part_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/orders/orders_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/lineitem/lineitem_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/part/part_1.snappy.parquet,/raid/tpch/gqe-dbgen-1/orders/orders_1.snappy.parquet /raid/tmp DEFAULT SNAPPY 4 YES

Based on this command, you can see why I would like a "repetitions" CLI argument 😆

GregoryKimball · 2024-09-26T23:08:44Z

Thanks @mhaseeb123 for the continued development and discussions around this example.

I would also like to consider introducing an io_type CLI argument similar to the way the nvbenchmarks work. We could take the file information and copy it to pageable host buffers, pinned host buffers, or device buffers, based on the io_type requested by the runner.
Also I'm seeing poor pipelining in the first set of read_parquet calls, as each thread launches its IO at the same time, and we end up starting compute later than if each thread completed IO one at a time. I'll open a separate issue about this topic. ([FEA] Add synchronization for IO between read_parquet calls on different threads #16936)

mhaseeb123 · 2024-09-27T01:09:57Z

Based on this command, you can see why I would like a "repetitions" CLI argument 😆

OMG that command line is such a monstrosity. All done in the update though. Still working on the io_type

vuule

the final batch of nits :)

cpp/examples/parquet_io/io_source.cpp

cpp/examples/parquet_io/io_source.hpp

cpp/examples/parquet_io/common_utils.cpp

cpp/examples/parquet_io/parquet_io_multithreaded.cpp

…m/mhaseeb123/cudf into fea-parquet-multithreaded-example

copy-pr-bot · 2024-10-04T23:47:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/examples/parquet_io/parquet_io_multithreaded.cpp

…m/mhaseeb123/cudf into fea-parquet-multithreaded-example

GregoryKimball · 2024-10-11T00:07:34Z

@lamarrr Would you please share your review?

bdice

CI changes look fine to me.

cpp/examples/parquet_io/parquet_io_multithreaded.cpp

mhaseeb123 · 2024-10-11T18:53:30Z

/merge

Add the new multithreaded parquet example

ff2480b

mhaseeb123 self-assigned this Sep 18, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 18, 2024

mhaseeb123 added the 2 - In Progress Currently a work in progress label Sep 18, 2024

github-actions bot added the CMake CMake build issue label Sep 18, 2024

mhaseeb123 added non-breaking Non-breaking change feature request New feature or request labels Sep 18, 2024

mhaseeb123 and others added 8 commits September 18, 2024 02:55

Set the default output path to the current path

d06f7f2

Style fix

c13a408

Use stream pool for parquet write as well

12adeeb

Add more details to the example

a8ae50a

Minor improvements

6679f89

Minor improvement

e04602c

Minor improvements

21ce7c7

Merge branch 'branch-24.10' into fea-parquet-multithreaded-example

b649530

mhaseeb123 changed the title ~~Add an example to demostrate read/write parquet using multiple threads.~~ Add an example to demonstrate read/write parquet using multiple threads. Sep 19, 2024

mhaseeb123 and others added 3 commits September 19, 2024 00:36

Move the vector to concatenate tables

b8b8bb9

Minor improvement

188ce11

Merge branch 'branch-24.10' into fea-parquet-multithreaded-example

1827654

mhaseeb123 commented Sep 23, 2024

View reviewed changes

cpp/examples/parquet_io/parquet_io.cpp Outdated Show resolved Hide resolved

mhaseeb123 changed the base branch from branch-24.10 to branch-24.12 September 23, 2024 21:41

mhaseeb123 added 2 commits September 23, 2024 14:41

Merge branch 'branch-24.12' into fea-parquet-multithreaded-example

e14916f

Merge branch 'branch-24.12' into fea-parquet-multithreaded-example

a528eb3

GregoryKimball reviewed Sep 26, 2024

View reviewed changes

cpp/examples/parquet_io/parquet_io_multithreaded.cpp Outdated Show resolved Hide resolved

GregoryKimball reviewed Sep 26, 2024

View reviewed changes

cpp/examples/parquet_io/parquet_io_multithreaded.cpp Outdated Show resolved Hide resolved

GregoryKimball reviewed Sep 26, 2024

View reviewed changes

cpp/examples/parquet_io/parquet_io_multithreaded.cpp Outdated Show resolved Hide resolved

Make multithreaded parquet io example more sophisticated

990f2bb

vuule reviewed Oct 4, 2024

View reviewed changes

mhaseeb123 added 2 commits October 4, 2024 23:47

Nits from code reviews

1a04409

Merge branch 'fea-parquet-multithreaded-example' of https://github.co…

2fb523f

…m/mhaseeb123/cudf into fea-parquet-multithreaded-example

Merge branch 'branch-24.12' into fea-parquet-multithreaded-example

8be6710

mhaseeb123 requested a review from vuule October 7, 2024 18:25

mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 7, 2024

vuule approved these changes Oct 7, 2024

View reviewed changes

Merge branch 'branch-24.12' into fea-parquet-multithreaded-example

2a6db5d

mhaseeb123 requested a review from davidwendt October 7, 2024 20:53

mhaseeb123 added 3 commits October 7, 2024 14:35

Merge branch 'branch-24.12' into fea-parquet-multithreaded-example

cc6242c

Minor arg setting

3a59027

Merge branch 'branch-24.12' into fea-parquet-multithreaded-example

60e5d75

davidwendt reviewed Oct 10, 2024

View reviewed changes

cpp/examples/parquet_io/parquet_io_multithreaded.cpp Outdated Show resolved Hide resolved

mhaseeb123 added 4 commits October 11, 2024 00:05

Adjust spacing

7cfd7ae

Merge branch 'branch-24.12' into fea-parquet-multithreaded-example

d1fbad8

Apply suggestion

d9102f0

Merge branch 'fea-parquet-multithreaded-example' of https://github.co…

174e6c9

…m/mhaseeb123/cudf into fea-parquet-multithreaded-example

GregoryKimball requested a review from lamarrr October 11, 2024 00:09

bdice approved these changes Oct 11, 2024

View reviewed changes

davidwendt approved these changes Oct 11, 2024

View reviewed changes

lamarrr approved these changes Oct 11, 2024

View reviewed changes

cpp/examples/parquet_io/parquet_io_multithreaded.cpp Outdated Show resolved Hide resolved

mhaseeb123 added 2 commits October 11, 2024 18:48

Minor

b61f18e

Merge branch 'branch-24.12' into fea-parquet-multithreaded-example

803e8c9

mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 11, 2024

rapids-bot bot merged commit be1dd32 into rapidsai:branch-24.12 Oct 11, 2024
101 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an example to demonstrate multithreaded `read_parquet` pipelines #16828

Add an example to demonstrate multithreaded `read_parquet` pipelines #16828

mhaseeb123 commented Sep 18, 2024 •

edited

Loading

GregoryKimball commented Sep 26, 2024 •

edited

Loading

GregoryKimball commented Sep 26, 2024 •

edited

Loading

mhaseeb123 commented Sep 27, 2024

vuule left a comment

copy-pr-bot bot commented Oct 4, 2024

GregoryKimball commented Oct 11, 2024

bdice left a comment

mhaseeb123 commented Oct 11, 2024

Add an example to demonstrate multithreaded read_parquet pipelines #16828

Add an example to demonstrate multithreaded read_parquet pipelines #16828

Conversation

mhaseeb123 commented Sep 18, 2024 • edited Loading

Description

Checklist

GregoryKimball commented Sep 26, 2024 • edited Loading

GregoryKimball commented Sep 26, 2024 • edited Loading

mhaseeb123 commented Sep 27, 2024

vuule left a comment

Choose a reason for hiding this comment

copy-pr-bot bot commented Oct 4, 2024

GregoryKimball commented Oct 11, 2024

bdice left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Oct 11, 2024

Add an example to demonstrate multithreaded `read_parquet` pipelines #16828

Add an example to demonstrate multithreaded `read_parquet` pipelines #16828

mhaseeb123 commented Sep 18, 2024 •

edited

Loading

GregoryKimball commented Sep 26, 2024 •

edited

Loading

GregoryKimball commented Sep 26, 2024 •

edited

Loading