Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross-compiled CUDA builds running out of disk space #1114

Closed
h-vetinari opened this issue Jul 10, 2023 · 20 comments · Fixed by #1122
Closed

cross-compiled CUDA builds running out of disk space #1114

h-vetinari opened this issue Jul 10, 2023 · 20 comments · Fixed by #1122

Comments

@h-vetinari
Copy link
Member

I don't know what change caused this (perhaps something in the CUDA setup...), but since about a month or so, cross-compiling CUDA consistently blows through the disk space of the azure workers, failing the job in an often un-restartable way.

I've tried fixing this in various ways (#1075, #1081, 7c26712, a8ca8f7, 555a42c). The problem exists on both ppc & aarch, but for ppc at least, the various fixes seem to have mostly settled things, but for aarch it's still failing 9 times out of 10.

(note, the only reason I disabled aws-sdk-cpp is that jobs started failing again with migrating to a new version that had some more features enabled, and a footprint around 40MB; this is being tackled in conda-forge/google-cloud-cpp-feedstock#141).

CC @conda-forge/cuda-compiler @jakirkham @isuruf

PS. By chance I've seen that qt also has disk space problems, and worked around this by disabling the use of precompiled headers. Arrow has an option ARROW_USE_PRECOMPILED_HEADERS, but it's already off by default.

@h-vetinari
Copy link
Member Author

However it does kinda seem related to PCH's, in that the failure looks like:

[73/134] Building CXX object CMakeFiles/gandiva.dir/cmake_pch.hxx.gch
FAILED: CMakeFiles/gandiva.dir/cmake_pch.hxx.gch 
$BUILD_PREFIX/bin/aarch64-conda-linux-gnu-c++ -Dgandiva_EXPORTS -I$SRC_DIR/python/pyarrow/src -I$SRC_DIR/python/build/temp.linux-aarch64-cpython-311/pyarrow/src -isystem $PREFIX/include/python3.11 -isystem /home/conda/feedstock_root/build_artifacts/apache-arrow_1688985855110/_build_env/venv/lib/python3.11/site-packages/numpy/core/include -Wno-noexcept-type  -Wall -fno-semantic-interposition -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem $PREFIX/include -fdebug-prefix-map=$SRC_DIR=/usr/local/src/conda/pyarrow-12.0.1 -fdebug-prefix-map=$PREFIX=/usr/local/src/conda-prefix -isystem /usr/local/cuda/targets/sbsa-linux/include -fdiagnostics-color=always  -fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized -O3 -DNDEBUG -O2 -ftree-vectorize -std=c++17 -fPIC -Winvalid-pch -x c++-header -include $SRC_DIR/python/build/temp.linux-aarch64-cpython-311/CMakeFiles/gandiva.dir/cmake_pch.hxx -MD -MT CMakeFiles/gandiva.dir/cmake_pch.hxx.gch -MF CMakeFiles/gandiva.dir/cmake_pch.hxx.gch.d -o CMakeFiles/gandiva.dir/cmake_pch.hxx.gch -c $SRC_DIR/python/build/temp.linux-aarch64-cpython-311/CMakeFiles/gandiva.dir/cmake_pch.hxx.cxx
In file included from $SRC_DIR/python/pyarrow/src/arrow/python/platform.h:28,
                 from $SRC_DIR/python/pyarrow/src/arrow/python/pch.h:24,
                 from $SRC_DIR/python/build/temp.linux-aarch64-cpython-311/CMakeFiles/gandiva.dir/cmake_pch.hxx:5,
                 from <command-line>:
$PREFIX/include/python3.11/datetime.h:264:1: fatal error: cannot write PCH file: No space left on device
  264 | }
      | ^
compilation terminated.

CC @kou @pitrou

@h-vetinari
Copy link
Member Author

This is now permanently blowing up our cross-compiled CUDA builds (both aarch & PPC) on 12.x & 11.x. On 10.x at least, the build passes (with the ~same fixes as mentioned in the OP, in particular with google-cloud-cpp disabled).

@isuruf
Copy link
Member

isuruf commented Jul 11, 2023

Arrow has an option ARROW_USE_PRECOMPILED_HEADERS, but it's already off by default.

You are looking at arrow C++ sources, but the error is in pyarrow.

@jakirkham
Copy link
Member

My hunch is that something has changed about the Azure images, which causes the amount of stuff included in them to increase (not exactly sure what changed)

Have seen this in a couple other cross-compilation CUDA builds. Though I think that is coincidental as there is simply more stuff being downloaded in those cases. Have seen disk space issues in at least one job that doesn't do any cross-compilation (though is CUDA related)

Have poked around a little bit with du & tree in this PR ( conda-forge/cudatoolkit-feedstock#93 ), but haven't had a lot of time to do it (and haven't yet found anything that would be easy to remove). Though maybe that is a good starting point for anyone wanting to investigate this further

@h-vetinari
Copy link
Member Author

You are looking at arrow C++ sources, but the error is in pyarrow.

I was just collecting potentially related information; that particular option is already off by default anyway, so it wasn't a serious candidate.

@h-vetinari
Copy link
Member Author

h-vetinari commented Jul 14, 2023

Have seen this in a couple other cross-compilation CUDA builds. Though I think that is coincidental as there is simply more stuff being downloaded in those cases.

Yeah, the cross-compilation infra for CUDA 11 needs to download and unpack a bunch of artefacts (see conda-forge/conda-forge-ci-setup-feedstock#210). Would it make sense to try to move these builds to CUDA 12? Having any builds restricted to CUDA >=12 would still be better than having no builds at all.

@h-vetinari
Copy link
Member Author

Would it make sense to try to move these builds to CUDA 12? Having any builds restricted to CUDA >=12 would still be better than having no builds at all.

Giving this a shot in #1120

@jakirkham
Copy link
Member

Sure that seems like a reasonable approach 👍

Happy to look over things there if you need another pair of eyes 🙂

@isuruf
Copy link
Member

isuruf commented Jul 19, 2023

I was just collecting potentially related information; that particular option is already off by default anyway, so it wasn't a serious candidate.

If you look at pyarrow sources, you'll see that it's not 'already off by default anyway'.

@h-vetinari
Copy link
Member Author

If you look at pyarrow sources, you'll see that it's not 'already off by default anyway'.

Can you be more specific what you're referring to? I gave a direct link to an option that's off by default (I didn't claim it applied to pyarrow either...). In pyarrow, I don't find something using the substring PRECOMPILED_HEADER; and the only occurrence for pch does not have a switch.

@isuruf
Copy link
Member

isuruf commented Jul 20, 2023

and the only occurrence for pch does not have a switch.

Exactly. Turn that off with a patch, and this issue will probably go away.

@h-vetinari
Copy link
Member Author

That falls under the category "not obvious to me" - I can't tell if things are still expected to work without this (given that there's no option to toggle), and I'm not in the habit of patching out things I don't understand (for example, I'm confused why headers -- something pretty lightweight -- would blow through the disk space).

But I'm happy to try it, thanks for the pointer.

@isuruf
Copy link
Member

isuruf commented Jul 20, 2023

Precompiled headers are not lightweight. They are heavy.

$ cat pch.h
#include <stdio.h>
$ g++ pch.h -o pch.h.gch
$ file pch.h.gch 
pch.h.gch: GCC precompiled header (version 014) for C++
$ ls -alh pch.h.gch
-rw-rw-r-- 1 isuru isuru 2.2M Jul 20 14:49 pch.h.gch

@jakirkham
Copy link
Member

Maybe we should ask someone from the Arrow team to chime in?

@h-vetinari
Copy link
Member Author

2.2M

Everything is relative of course, but I don't think 2.2MB will be the reason for us running out of disk-space on the agent.

Maybe we should ask someone from the Arrow team to chime in?

Sure. I think it's more the "fault" of our infra rather than arrow itself, but removing pyarrow's precompiled headers (i.e. viability resp. potential impact) would be good to check. Hoping you could weigh in @kou @pitrou @jorisvandenbossche @assignUser

@isuruf
Copy link
Member

isuruf commented Jul 20, 2023

Everything is relative of course, but I don't think 2.2MB will be the reason for us running out of disk-space on the agent.

That's just a simple C header generating 2.2MB. Template heavy C++ headers can go up to several GBs.

@h-vetinari
Copy link
Member Author

OK, thanks, finally I can see why this would be related. I still don't know why it would blow up so hard, but that's something I can investigate later.

@assignUser
Copy link

My hunch is that something has changed about the Azure images, which causes the amount of stuff included in them to increase (not exactly sure what changed)

I can add to that suspicion as some of the space heavy arrow doc builds we run on azure have started failing due to lack of space recently and we don't understand why.

Regarding the pch: my understanding is that pchs are useful to speed up build times on repeated re-builds (e.g. local development). Which is not really the case here iirc the ci setup (matrix build so each job only builds once?). So it should be fine to patch that out but it should probably also be an arrow issue to add an option for pch in pyarrow? @jorisvandenbossche

@h-vetinari
Copy link
Member Author

h-vetinari commented Jul 21, 2023

Thanks for the info @assignUser!

For now, even patching out the pch didn't work (see #1122), we've now removed some unnecessary caching in our images, also to no avail.

I'm now looking at trimming some more fat in our cross-cuda setup, which I noticed is not deleting the CUDA .rpm files, which amounts to roughly 2GB of stuff.

@jakirkham
Copy link
Member

Indeed thanks for the feedback! 🙏

My hunch is that something has changed about the Azure images, which causes the amount of stuff included in them to increase (not exactly sure what changed)

I can add to that suspicion as some of the space heavy arrow doc builds we run on azure have started failing due to lack of space recently and we don't understand why.

Here are some ideas of things we might remove from the Azure images ( conda-forge/conda-smithy#1747 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants