Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Mesa: Investigate if using Clang with ThinLTO improves the performance #327

Open
ptr1337 opened this issue Aug 31, 2024 · 6 comments
Assignees

Comments

@ptr1337
Copy link
Member

ptr1337 commented Aug 31, 2024

mesa is currently compiled without LTO due bugs in GCC 14.
We could starting to consider, to enable ThinLTO and a clang built mesa.

Anyone needs to benchmark with AMD or Intel cards, if this improves the performance

@1Naim 1Naim changed the title [RCF] Mesa: Investigate if using Clang with ThinLTO improves the performance [RFC] Mesa: Investigate if using Clang with ThinLTO improves the performance Aug 31, 2024
@ptr1337
Copy link
Member Author

ptr1337 commented Aug 31, 2024

Seems problematic to compile with ThinLTO.

ld.lld: error: version script assignment of 'global' to symbol 'driver_descriptor' failed: symbol not defined
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

This can be mitigated with -Wl,--undefined-version, also see here:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/8003

@ptr1337
Copy link
Member Author

ptr1337 commented Aug 31, 2024

diff --git a/mesa/mesa/PKGBUILD b/mesa/mesa/PKGBUILD
index b46a1df..39d27c8 100644
--- a/mesa/mesa/PKGBUILD
+++ b/mesa/mesa/PKGBUILD
@@ -214,10 +214,23 @@ build() {
     -D vulkan-layers=device-select,intel-nullhw,overlay
   )
 
+  # Set Clang as compiler
+  export AR=llvm-ar
+  export CC=clang
+  export CXX=clang++
+  export NM=llvm-nm
+  export RANLIB=llvm-ranlib
+
+  export CFLAGS+=" -flto=thin"
+  export CXXFLAGS+=" -flto=thin"
+  export LDFLAGS+=" -Wl,--undefined-version -fuse-ld=lld"
   # Build only minimal debug info to reduce size
   CFLAGS+=" -g1"
   CXXFLAGS+=" -g1"
 
+  # LTO needs more open files
+  ulimit -n 4096
+
   # Inject subproject packages
   export MESON_PACKAGE_CACHE_DIR="$srcdir"
 

Diff to get ThinLTO correclty working. WIll provide some test packages

@ptr1337
Copy link
Member Author

ptr1337 commented Aug 31, 2024

@1Naim @SoulHarsh007
Testing Packages can be found here:
https://archive.cachyos.org/mesa-thinlto/

@1Naim
Copy link
Member

1Naim commented Aug 31, 2024

With a Radeon 660M (iGPU), I did not notice any performance gains from ThinLTO. However, there is a sizeable reduction in package size (notably in lib32-mesa, but approx. 28MB total).

With that said, there is a need in testing this in much more hardware and environments so we are sure there are no issues with this build. Example case of building with ThinLTO can be seen in mesa/#8003, due note that this was built with gcc + LTO so the issues might not necessarily be present when building with clang + LTO.

@1Naim
Copy link
Member

1Naim commented Aug 31, 2024

Seems problematic to compile with ThinLTO.

ld.lld: error: version script assignment of 'global' to symbol 'driver_descriptor' failed: symbol not defined
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

This can be mitigated with -Wl,--undefined-version, also see here: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8003

FYI this error comes from building with the llvm stack in general, not just when building with ThinLTO.

@ms178
Copy link

ms178 commented Sep 29, 2024

That linked bug report brings back some memories. :) The encountered error should be reported though as there was some work to fix building Mesa without having to use -Wl,--undefined-version - see: MR 25551 and MR 26268.

I've been compiling Mesa with Clang + FullLTO and occasional ThinLTO for quite a while with more aggressive flags (see below). Recently, the performance differences weren't that big any longer against an optimized GCC-14 build. Clang was still a bit better overall on my hardware (Haswell-EP/Raptor Lake and 6950 XT). For me personally, the biggest advantage in using Clang is for PGO as the instrumented Mesa is very slow with GCC. Compile times were faster with GCC though.

I was also experimenting with using Polly. While that didn't improve the average FPS, it did have some positive effects on the perceived smoothness. Unfortunately the used Polly flags caused some bugs which impacted Chrome's VAAPI and website rendering.

My Clang CFLAGS without PGO and Polly just for reference, feel free to pick and test the ones you like:

export CC=clang
export CXX=clang++
export CC_LD=lld
export CXX_LD=lld
export AR=llvm-ar
export NM=llvm-nm
export STRIP=llvm-strip
export OBJCOPY=llvm-objcopy
export OBJDUMP=llvm-objdump
export READELF=llvm-readelf
export RANLIB=llvm-ranlib
export HOSTCC=clang
export HOSTCXX=clang++
export HOSTAR=llvm-ar
export CPPFLAGS="-D_FORTIFY_SOURCE=0"
export CFLAGS="-O3 -march=native -mtune=native -mllvm -inline-threshold=1500 -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -fno-math-errno -fno-trapping-math -falign-functions=32 -funroll-loops -fno-semantic-interposition -fcf-protection=none -mharden-sls=none -fomit-frame-pointer -mprefer-vector-width=256 -flto -fwhole-program-vtables -fsplit-lto-unit -mllvm -adce-remove-loops -mllvm -enable-ext-tsp-block-placement=1 -mllvm -enable-gvn-hoist -mllvm -enable-dfa-jump-thread -fdata-sections -ffunction-sections -fno-unique-section-names -fsplit-machine-functions -fno-plt -mtls-dialect=gnu2 -w"
export CXXFLAGS="${CFLAGS} -Wp,-U_GLIBCXX_ASSERTIONS"
export LDFLAGS="-Wl,--lto-CGO3 -Wl,--gc-sections -Wl,--icf=all -Wl,--lto-O3,-O3,-Bsymbolic-functions,--as-needed -fcf-protection=none -mharden-sls=none -Wl,-mllvm -Wl,-extra-vectorizer-passes -Wl,-mllvm -Wl,-enable-cond-stores-vec -Wl,-mllvm -Wl,-slp-vectorize-hor-store -Wl,-mllvm -Wl,-enable-loopinterchange -Wl,-mllvm -Wl,-enable-loop-distribute -Wl,-mllvm -Wl,-enable-unroll-and-jam -Wl,-mllvm -Wl,-enable-loop-flatten -Wl,-mllvm -Wl,-unroll-runtime-multi-exit -Wl,-mllvm -Wl,-aggressive-ext-opt -Wl,-mllvm -Wl,-enable-interleaved-mem-accesses -Wl,-mllvm -Wl,-enable-masked-interleaved-mem-accesses -march=native -flto -fwhole-program-vtables -fuse-ld=lld -Wl,-zmax-page-size=0x200000 -Wl,-mllvm -Wl,-adce-remove-loops -Wl,-mllvm -Wl,-enable-ext-tsp-block-placement=1 -Wl,-mllvm -Wl,-enable-gvn-hoist=1 -Wl,-mllvm -Wl,-enable-dfa-jump-thread=1 -Wl,--push-state -Wl,-whole-archive -lmimalloc -Wl,--pop-state -lpthread -lstdc++ -lm -ldl -Wl,-z,now -Wl,-z,relro -Wl,-z,pack-relative-relocs -Wl,--hash-style=gnu -Wl,--undefined-version"
export CCLDFLAGS="$LDFLAGS"
export CXXLDFLAGS="$LDFLAGS"
export ASFLAGS="-D__AVX__=1 -D__AVX2__=1 -D__FMA__=1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants