Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to mke sure to run TDW with GPUs in Linux Server? #549

Open
LuZeking opened this issue Apr 10, 2023 · 15 comments
Open

How to mke sure to run TDW with GPUs in Linux Server? #549

LuZeking opened this issue Apr 10, 2023 · 15 comments

Comments

@LuZeking
Copy link

Hi,

I followed the process install.md (remote Linux part), all the steps seem to work. But somehow when I use the Xservers to do simulation, only the CPU was used though I have 2 GPUs, which makes the simulate super slow

e.g. nvidia-smi:
image

xorg config files:
`# xorg-1-tdw.conf

nvidia-xconfig: X configuration file generated by nvidia-xconfig

nvidia-xconfig: version 515.86.01

Section "Files"
EndSection

Section "InputDevice"
# generated from default
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/psaux"
Option "Emulate3Buttons" "no"
Option "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
# generated from default
Identifier "Keyboard0"
Driver "kbd"
EndSection

Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
Option "DPMS"
EndSection

Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "NVIDIA TITAN RTX"
BusID "PCI:179:0:0"
EndSection

xorg-2-tdw.conf

nvidia-xconfig: X configuration file generated by nvidia-xconfig

nvidia-xconfig: version 515.86.01

Section "Files"
EndSection

Section "InputDevice"
# generated from default
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/psaux"
Option "Emulate3Buttons" "no"
Option "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
# generated from default
Identifier "Keyboard0"
Driver "kbd"
EndSection

Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
Option "DPMS"
EndSection

Section "Device"
Identifier "Device1"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "NVIDIA TITAN RTX"
BusID "PCI:23:0:0"
EndSection
`

Finally, I checked the Player.log:
image
image

I was struggling to deal with this and have no idea to solve it now, could you do me a favour?

@alters-mit
Copy link
Member

@LuZeking In your xorg file, the NVIDIA device is Device1. Try changing it to Device0 or, better yet, export DISPLAY=:1.0

@LuZeking
Copy link
Author

@LuZeking In your xorg file, the NVIDIA device is Device1. Try changing it to Device0 or, better yet, export DISPLAY=:1.0

Thanks for your kind reply:)! But I am a bit confused here due to my lack of experience with this kind of setting. Here I have 2 GPUs, as shown above in 2 xorg files, one was set as Identifier Device0 and another was set as Identifier Device1, you mean we need to change these two both as Identifier Device0?

Moreover, setting export DISPLAY=:1.0 also didn't works for me: In remote linux HPC, echo $DISPLAY gives localhost:10.0, while in my laptop, vcxsrv display number was set to 0, and only this setting can launch the simulation window successfully (but still without GPUs), either I change the $DISPLAY in remote or display number to any other number, it won't launch the window even.

Did I set any configuration wrong?

@LuZeking
Copy link
Author

Or it may be the reason that CUDA and Cudnn versions did meet the TDW requirement? I see the example in install.md use the CUDA 9.0.

@alters-mit
Copy link
Member

CUDA is probably irrelevant.

Try this:

  1. Make backups of your xorg files.
  2. Kill the X server
  3. sudo nvidia-xconfig -a --use-display-device=None --virtual=256x256
  4. sudo /usr/bin/X :0&

@LuZeking
Copy link
Author

Hello Alters,
Thanks for your suggestions, I have tried these commands already. Step 3 created one big /etc/X11/xorg.conf. And after step 4, I run the simulation again, unfortunately, it was still running on CPU rather than GPUs.

The /etc/X11/xorg.conf content is:
` Driver "kbd"
EndSection

Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
Option "DPMS"
EndSection

Section "Monitor"
Identifier "Monitor1"
VendorName "Unknown"
ModelName "Unknown"
Option "DPMS"
EndSection

Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "NVIDIA TITAN RTX"
BusID "PCI:179:0:0"
EndSection

Section "Device"
Identifier "Device1"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "NVIDIA TITAN RTX"
BusID "PCI:23:0:0"
EndSection

Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
Option "UseDisplayDevice" "None"
SubSection "Display"
Virtual 256 256
Depth 24
EndSubSection
EndSection

Section "Screen"
Identifier "Screen1"
Device "Device1"
Monitor "Monitor1"
DefaultDepth 24
Option "UseDisplayDevice" "None"
SubSection "Display"
Virtual 256 256
Depth 24
EndSubSection
EndSection`

Could you find sth else that may cause this problem?

@alters-mit
Copy link
Member

Try these options:

./TDW.x86_64 -force-glcore

or

./TDW.x86_64 -force-glcore45

or

./TDW.x86_64 -force-device-index 0

@LuZeking
Copy link
Author

Hi Esther, Thank you for your kindness and help:). Unfortunately, the commands you suggested still failed. Do you have any other suggestions?

@alters-mit
Copy link
Member

Can you send me the exact shell command you're using to launch TDW.x86_64? Maybe it's just not formatted correctly...

@alters-mit
Copy link
Member

Also, try these:

  1. DISPLAY=:1.0 ./TDW.x86_64
  2. DISPLAY=:0.1 ./TDW.x86_64 (I don't think this will work but it's worth trying)
  3. DISPLAY=:1.0 ./TDW.x86_64 -force-device-index 1

@alters-mit
Copy link
Member

If that doesn't work, please send the list of devices and screens on the server. I'm not sure how exactly to do this but this post might help. I'm interested in seeing if it outputs the same device/screen indices that are in the xorg file. https://askubuntu.com/a/123096

@LuZeking
Copy link
Author

Hi Esther, Thank you for such detailed suggestions! I have tried them all, but unfortunately it still not works. The simulation window was launched (but still without GPUs) only when I set DISPLAY as follows:

  1. DISPLAY=localhost:10.0 ./tdw/TDW.x86_64
  2. DISPLAY=localhost:10.0 ./tdw/TDW.x86_64 -force-device-index 1

Then I command xrandr --query to list the connected screen:
Screen 0: minimum 0 x 0, current 3840 x 1152, maximum 32767 x 32767 default connected primary 3840x1152+0+0 1016mm x 304mm 3840x1152 0.00

And this means it is the same device/screen indices configured in the xorg file, right?
I also noticed that the window can only be launched with the DISPLAY value set to localhost:10.0 and the laptop's vcxsrv display number is set to 0. Setting DISPLAY to :0.0 or :1.0 will result in a launch failure.
Do you have any other suggestions or ideas?

@alters-mit
Copy link
Member

Hi @LuZeking Sorry I've been slow to respond.

I wonder why the display includes localhost. Can you tell me more about your setup? What machine is the controller on? What machine is the build running on? Are you trying to do X11 forwarding?

You might also have better luck with our Docker container.

@LuZeking
Copy link
Author

Hi Esther, thanks for your reply.

My setup is Windows 11 for my laptop, and Ubuntu 20.04/22.04 for the remote High-Performance Computer (HPC) managed by Slurm (tried 2 different HPCs with Ubuntu 20.04/22.04).

The controller is running on my laptop, and the build is running on the HPC. And Yes, I am trying to do X11 forwarding, to use the GPUs in HPC but run the TDW window in my laptop through X11.

BTW, in https://github.com/threedworld-mit/tdw/blob/master/Documentation/lessons/remote/x11_forwarding.md, I only found the macOs setup there, does this means it won't work on Windows? actually, it works for the simple controller to print "hello world", but when it comes to running a complex one like ur5 that need a GPU, it is too slow or just crashes.

And I will try the docker container soon, which needs root permutation to do, so I need to ask HPC Admin for help.

@alters-mit
Copy link
Member

Sorry for not responding to this for a while.

I had not realized you're trying to forward the X11 port. We only have Mac instructions because we haven't tried doing it yet on Windows. One of our users provided the Mac instructions. If you manage to find a solution for Windows, we can add it to the documentation.

In the meantime, can't you just run the TDW window on the HPC's own X?

@deathpoker
Copy link

Hello,I connected the monitor directly to the HPC and run the TDW window on the HPC's own X,But on Player.log, it shows that the cpu is used, not the gpu. I did not use the nohup command to create a new virtual monitor because using a virtual monitor would make my monitor black.Here's my Player.log:

Mono path[0] = '/data1/user/lpy/TDW/TDW_Data/Managed'
Mono config path = '/data1/user/lpy/TDW/TDW_Data/MonoBleedingEdge/etc'
Preloaded 'libOni.so'
Preloaded 'libaudiopluginresonanceaudio.so'
Unable to preload the following plugins:
libflexUtils.so
Display 0 'VGA 24"': 1920x1080 (primary device).
Desktop is 1920 x 1080 @ 60 Hz
Initialize engine version: 2020.3.24f1 (79c78de19888)
Plugins: Couldn't open OculusXRPlugin, error: OculusXRPlugin: cannot open shared object file: No such file or directory
[Subsystems] Discovering subsystems at path /data1/user/lpy/TDW/TDW_Data/UnitySubsystems
[Subsystems] No descriptors matched for examples in UnitySubsystems/OculusXRPlugin/UnitySubsystemsManifest.json.
[Subsystems] 1 'inputs' descriptors matched in UnitySubsystems/OculusXRPlugin/UnitySubsystemsManifest.json
[Subsystems] 1 'displays' descriptors matched in UnitySubsystems/OculusXRPlugin/UnitySubsystemsManifest.json
[Subsystems] No descriptors matched for meshings in UnitySubsystems/OculusXRPlugin/UnitySubsystemsManifest.json.
[Subsystems] No descriptors matched for examples in UnitySubsystems/WindowsMRXRSDK/UnitySubsystemsManifest.json.
[Subsystems] 1 'inputs' descriptors matched in UnitySubsystems/WindowsMRXRSDK/UnitySubsystemsManifest.json
[Subsystems] 1 'displays' descriptors matched in UnitySubsystems/WindowsMRXRSDK/UnitySubsystemsManifest.json
[Subsystems] 1 'meshings' descriptors matched in UnitySubsystems/WindowsMRXRSDK/UnitySubsystemsManifest.json
Forcing GfxDevice: OpenGL Core
GfxDevice: creating device client; threaded=1
Renderer: llvmpipe (LLVM 12.0.0, 256 bits)
Vendor: Mesa/X.org
Version: 4.5 (Core Profile) Mesa 21.2.6
GLES: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants