Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve test compilation speed and space usage by linking tests against libopen_spiel.so #1126

Closed

Conversation

jthemphill
Copy link

@jthemphill jthemphill commented Oct 16, 2023

This is an alternative approach to #1124. Compared to the status quo, it cuts over 500MB off of the compiled build artifacts, cuts three minutes off the single-threaded full build time, and removes a lot of thinking about which libraries you need to link your game or test against.

Before:

$ make clean ; time make -j8
real    5m25.136s
user    34m56.492s
sys     2m8.369s

$ du -sh games
629M    games

After:

$ ninja clean ; time ninja -j8
[757/757] Linking CXX shared module python/pyspiel.so

real    4m35.704s
user    31m10.678s
sys     1m51.133s

$ du -sh
111M    .

$ du -sh games
34M     games

Previously, we conditionally generated libopen_spiel.so. Now, we always generate it. And when creating a new executable, you'll want to use add_open_spiel_executable() instead of add_executable(...${OPEN_SPIEL_OBJECTS}).

@jthemphill jthemphill changed the title Link all game dependencies into libgames.a Improve game test compilation speed and space usage by building libgames.so Oct 17, 2023
@jthemphill
Copy link
Author

It turns out a number of tests (battleship_test, dark_chess_test , universal_poker_test, etc...) depend on running a game's static initializers so that the game is registered in the OpenSpiel engine. It also turns out that linking against a .a static library will cause none of these initializers to run😑 Changing libgames.a to a shared library, libgames.so, fixes the problem and further reduces the size of each test to 900kB each. The tests now depend on libgames.so existing at runtime, but... they're tests, they don't have to be portable.

@lanctot
Copy link
Collaborator

lanctot commented Oct 17, 2023

The failed wheels test is our fault from an update yesterday, we'll fix that. Apologies.

@lanctot
Copy link
Collaborator

lanctot commented Oct 17, 2023

The failed wheels test is our fault from an update yesterday, we'll fix that. Apologies.

Ok, fixed. You can pull from master to get the changes, and your next merge should at least fix the wheels test.

@jthemphill
Copy link
Author

@lanctot: If you agree with the direction I've taken here, I think this is ready for another CI run. If you disagree and don't want a shared library here, please let me know!

@lanctot
Copy link
Collaborator

lanctot commented Oct 20, 2023

@lanctot: If you agree with the direction I've taken here, I think this is ready for another CI run. If you disagree and don't want a shared library here, please let me know!

Hey @jthemphill sorry I've had a busy week and haven't had a chance to take a deep look yet. Was going to mention, too, that I'm about to go on vacation for 10 days. So I'll get back to you start of November.

If this doesn't lead to any internal maintenance issues (we have some internal code built on top of OpenSpiel that is not open-sourced) nor affects the binary wheels that we distribute for Python users, or e.g. breaks our native Windows build, then I am hugely in favor of the change. I just need to be sure it doesn't disrupt anything.

Thanks for doing this! We don't normally get contributions that take this form.. takes some rare skills to first notice it and even more so to fix it.

@lanctot
Copy link
Collaborator

lanctot commented Oct 20, 2023

BTW so far it's looking like that wouldn't be the case, so I'm not worried! But I just have to make sure before we import it.

@jthemphill
Copy link
Author

Thank you very much, and have an awesome vacation!

add_test(ultimate_tic_tac_toe_test ultimate_tic_tac_toe_test)
target_link_libraries(games PRIVATE bridge_double_dummy_solver)

macro(add_game_executable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much cleaner
such one line
so neat!
wow

:)

Copy link
Author

@jthemphill jthemphill Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR lands, I have another speedup in the pipeline which precompiles spiel.h and friends. This macro makes it easy to add target_precompile_headers("${_NAME}" REUSE_FROM open_spiel_core) to each target😉

@jthemphill
Copy link
Author

jthemphill commented Oct 21, 2023

nfg_games_test failed CI on macos-12 with error

Spiel Fatal Error: /Users/runner/work/open_spiel/open_spiel/open_spiel/games/nfg_game/nfg_game_test.cc:62 CHECK_TRUE(matrix_game != nullptr)

But the test passes locally, and stepping through the code, it's not clear why this would be a nullptr at all, on any OS. Opening a PR to debug this issue further.

@jthemphill
Copy link
Author

jthemphill commented Oct 21, 2023

I think nfg_game_test failed because dynamic_cast<MatrixGame>() failed, and I think that's because matrix_game.cc was linked in twice...

# Built with `-fsanitize=address -fsanitize=undefined`
$ games/nfg_game_test 
=================================================================
==43819==ERROR: AddressSanitizer: odr-violation (0x55b4d98283e0):
  [1] size=320 'vtable for open_spiel::matrix_game::MatrixState' /home/jhemphill/oss/open_spiel/open_spiel/matrix_game.cc
  [2] size=320 'vtable for open_spiel::matrix_game::MatrixState' /home/jhemphill/oss/open_spiel/open_spiel/matrix_game.cc
These globals were registered at these points:
  [1]:
    #0 0x55b4d95957da in __asan_register_globals (/home/jhemphill/oss/open_spiel/build/games/nfg_game_test+0x1da7da) (BuildId: 6d7a0b8c5cad382e91a67328dba4709b33c28695)
    #1 0x55b4d968af8b in asan.module_ctor matrix_game.cc

  [2]:
    #0 0x55b4d95957da in __asan_register_globals (/home/jhemphill/oss/open_spiel/build/games/nfg_game_test+0x1da7da) (BuildId: 6d7a0b8c5cad382e91a67328dba4709b33c28695)
    #1 0x7fc0deeeedeb in asan.module_ctor matrix_game.cc

==43819==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0
SUMMARY: AddressSanitizer: odr-violation: global 'vtable for open_spiel::matrix_game::MatrixState' at /home/jhemphill/oss/open_spiel/open_spiel/matrix_game.cc
==43819==ABORTING

...and I that's because open_spiel_core is implicitly linked into every single target we define, via link_libraries():

link_libraries(open_spiel_core

From the docs:

Link libraries to all targets added later. Specify libraries or flags to use when linking any targets created later in the current directory or below by commands such as add_executable() or add_library(). Note: The target_link_libraries() command should be preferred whenever possible. Library dependencies are chained automatically, so directory-wide specification of link libraries is rarely needed.

I'll replace this link_libraries() call with target_link_libraries() calls in the correct places, which will clear the way for creating new libraries without ODR violations.

@jthemphill jthemphill changed the title Improve game test compilation speed and space usage by building libgames.so Improve test compilation speed and space usage by linking tests against libopen_spiel.so Oct 29, 2023
@jthemphill jthemphill force-pushed the games-ball-of-mud branch 2 times, most recently from fa417b7 to b85bd80 Compare October 29, 2023 23:03
@jthemphill
Copy link
Author

jthemphill commented Nov 2, 2023

Can you rerun the failed action? The failure complains about a sha256 mismatch in jaxlib.whl, but I downloaded it, ran a sha256, and found that the sha256 on my machine matches the one they expect.

https://github.com/google-deepmind/open_spiel/actions/runs/6686304408/job/18167290661?pr=1126

@lanctot
Copy link
Collaborator

lanctot commented Nov 5, 2023

I started the test from the last commit but.. yeah, just as I suspected, they are now failing because of a JAX compatibility issue. I suspect what happened was that they finally got rid of Python 3.7 (or some other change) on the test machines just very recently because everything was working fine as of two days ago.

The real fix is rather involved and will take me a week or two given my schedule, but I have a short-term fix for now that temporarily disables all the JAX tests (#1134) that I'll put through on Monday.

But the other thing is.. this PR is seemingly getting quite complicated with the external library dependencies.. do you think it is still worth the effort? You don't anticipate it causing any issues down the road?

@jthemphill
Copy link
Author

this PR is seemingly getting quite complicated with the external library dependencies.. do you think it is still worth the effort? You don't anticipate it causing any issues down the road?

I think the core idea of linking against a shared library works and works well. This PR's transformation is mechanical - every instance of add_executable() becomes add_open_spiel_executable(), and then every link to ${OPEN_SPIEL_OBJECTS} and $<TARGET_OBJECTS:test> can be removed, because those are already in the open_spiel shared library. That's all the PR is now - I abandoned more complex approaches from before.

I wish I had locally tested more by enabling all the different compilation flags and compiling with each of the libraries, instead of relying on CI to throw the PR back to me every time I missed some compilation flag combination. I think CI will pass this time if the jaxlib wheel issue is fixed, but I understand that the size and CI churn of this PR feels risky. I think deciding if you want to go forward with this is your call as the maintainer.

@jthemphill jthemphill closed this Nov 15, 2023
@lanctot
Copy link
Collaborator

lanctot commented Nov 15, 2023

Closed on purpose? (The issues with the jax tests were fixed last week.)

Did I forget to press the button? Apologies, I've been busy fixing something else for an impending release in the next few days. I was planning to get back to this eventually, just got a bit busy with work and other maintenance stuff.

@jthemphill
Copy link
Author

Yeah, I was clearing my PR queue, and based on your last comment I just wasn't sure if you want to go forward with this PR in this form.

I'm still willing to make this green if you agree with the direction.

@jthemphill jthemphill reopened this Nov 15, 2023
@lanctot
Copy link
Collaborator

lanctot commented Nov 15, 2023

Yeah, I was clearing my PR queue, and based on your last comment I just wasn't sure if you want to go forward with this PR in this form.

Ahh, yeah fair enough. But: don't read too much into it yet. I tend to adopt "if it ain't broke, don't fix it" as a default because I need to minimize time spent on maintenance because I have very little time to devote to it and just mostly can't afford the time to fix things when they break.

I'm still willing to make this green if you agree with the direction.

Yeah, apologies for the lack of response up to now. I've just had a few other things I had to tend to (vacation, sickness, work travel, impending release fixes). But I'll get to it soon, promise!

@jthemphill
Copy link
Author

Thank you for all the time you've spent on this PR, and I hope you and your family are well. I know how hard it is to maintain an OSS project within a large company, and I don't want you to feel pressured to accept a PR if you think doing so will make your job harder!

@lanctot
Copy link
Collaborator

lanctot commented Apr 4, 2024

Hi @jthemphill,

Thanks for the effort you put into this. Unfortunately with withering resources dedicated to maintenance, it is unrealistic that we will be able to import this, so I will close it for now.

If you work on it more, please feel free to reopen.

@lanctot lanctot closed this Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants