Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[render_vtk] Remove X server requirement #21700

Open
jwnimmer-tri opened this issue Jul 9, 2024 · 27 comments
Open

[render_vtk] Remove X server requirement #21700

jwnimmer-tri opened this issue Jul 9, 2024 · 27 comments
Assignees
Labels
component: geometry perception How geometry appears in color, depth, and label images (via the RenderEngine API) priority: high type: feature request

Comments

@jwnimmer-tri
Copy link
Collaborator

(This was split out of #21050.)

Is your feature request related to a problem? Please describe.

There's no real reason for an X server to be a strict requirement for this renderer. It makes it harder to deploy to headless environments for containerized environments.

Describe the solution you'd like

Build VTK and RenderEngineVtk with EGL support, and provide a RenderEngineVtkParams config option to select EGL instead of GLX.

Describe alternatives you've considered

Users can do xvfb but that does non-HW-accelerated rendering and would be a bottleneck.

@jwnimmer-tri jwnimmer-tri added type: feature request component: geometry perception How geometry appears in color, depth, and label images (via the RenderEngine API) labels Jul 9, 2024
@jwnimmer-tri
Copy link
Collaborator Author

See #21050 (comment) for a WIP idea.

@jwnimmer-tri
Copy link
Collaborator Author

Build VTK and RenderEngineVtk with EGL support, ...

What we're looking for from Kitware is this part. It's OK to hard-code the rendering to EGL during that initial prototyping.

... and provide a RenderEngineVtkParams config option to select EGL instead of GLX.

If this part is too high of a learning curve for Kitware people, TRI people can collaborate on adding the params flag and wiring it up.

@sankhesh
Copy link
Contributor

Thank you for clarifying @jwnimmer-tri. I'll hope to have the initial build available for prototyping soon. We can then iterate over the flags required for switching between engines.

@jwnimmer-tri
Copy link
Collaborator Author

jwnimmer-tri commented Aug 7, 2024

With EGL, setting show_window = True will be a no-op, because our EGL will always be offscreen.

Then, get all the tests passing and open a draft PR.

For for the RenderEngineVtkParams config flag for which backend to use, we should allow it to be blank, in which case Drake gets to choose a default (for now -- GLX). On macOS, the new param must be blank because neither GLX nor EGL are wired up for macOS within Drake. (We won't have a "Cocoa" string option.)

@jwnimmer-tri
Copy link
Collaborator Author

This issue is becoming ever-more important to TRI. Now that we're back after the holiday, anything we can do to help move this faster would be appreciated.

@sankhesh
Copy link
Contributor

sankhesh commented Sep 5, 2024

@jwnimmer-tri Trying to track down why the test SimpleClone fails with EGL. Based on the renderings, the base_case yields expected renders as in the image below:

w_2

However, things fall apart after Clone() is called. The clone renderer has the right background color and I also verified that the sphere actor gets added to the renderer but it doesn't get rendered for some reason. The rendered image for the clone looks like:

w_5

Any insight as to why there might be a difference in the clone renderer?

@jwnimmer-tri
Copy link
Collaborator Author

I think TRI would be happy to help give ideas and/or debug it (and I'll \CC @SeanCurtis-TRI who will probably have better answers than me), but surely we can't do that in the absence of any code? Please open a (draft) pull request with the code, so we have somewhere to start from. It's fine if the code is unfinished / unclean for now.

@SeanCurtis-TRI
Copy link
Contributor

Without having seen the code, the most probable cause is that there is some property in an actor (or mapper or some such thing) that isn't currently being copied. The VTK artifcacts don't have clean copy semantics (as far as I know). As such, the cloning mechanism inside Drake explicitly constructs new instances, mapping over a subset of values into the clone. If some property, important for EGL, is omitted in that explicit creation/copy act, then the cloned version will have insufficient data. So, you should look at the cloning code and see if there is some obviously missing property.

@sankhesh
Copy link
Contributor

sankhesh commented Sep 5, 2024

I've pushed my debug branch here: master...sankhesh:drake:egl-rebased

Yes, I also think that there's something weird going on with the ShallowCopy. I realized that there are two props in the scene - a plane and a sphere. I did a couple of experiments and realized that for some odd reason, the order in which the cloned props are added to the viewport matters. In the latest changes on my branch referenced above, I tricked it to add the sphere after the plane and that seems to work. This is abnormal behavior for vtk. The order of prop addition shouldn't matter in a 3D view. And I know that this is not a EGL or vtkEGLRenderWindow limitation since the baseline case renders the two props fine.

The branch above also writes out the rendered views as png files so I could introspect.

@SeanCurtis-TRI
Copy link
Contributor

Let me know if you definitively want me to explore. I'll wait til I hear from you to avoid duplication.

@sankhesh
Copy link
Contributor

sankhesh commented Sep 5, 2024

Yes please take a look. I'm trying to decipher the vtk usage as I debug but it would be much faster for you. If you don't see anything major, I can try reproducing it with a minimal example outside of drake.

@SeanCurtis-TRI
Copy link
Contributor

I typically do the "reproduce outside of Drake" when wrestling with VTK. I find it very helpful to convince myself that my problems are born of Drake and not VTK (it usually is, but I've been able to submit enough fixes to VTK to remain convinced it's a worthwhile endeavor).

I'll take a look at your branch later today.

@SeanCurtis-TRI
Copy link
Contributor

Ok, I've pullsed your branch. From the earlier post, it looks like things have moved a bit. Obviously, now the sphere appears. But it isn't illuminated in the same way. I'm getting the following images. Can you confirm that you're getting the same?

Baseline:
w_0

CLone:
w_1

@SeanCurtis-TRI
Copy link
Contributor

BTW My first guess is that this isn't a material or geometry problem. I'm guessing it's a light problem. I'll let you know how it goes.

@SeanCurtis-TRI
Copy link
Contributor

Hmmmm....when I reconfigure the render engine to have a directional light shining from the side, the two images differ even more:

baseline

clone

So, something more fundamental is happening. So, what would account for the darker rim and the reversed direction?

@SeanCurtis-TRI
Copy link
Contributor

SeanCurtis-TRI commented Sep 5, 2024

And yet another pertubation -- if I displace the camera a bit (so the sphere is no longer centered), I get:

Baseline
baseline

Clone
clone

The mystery deepens.

  • Note: I also turned the camera diagonally, but that was an uninteresting change. It didn't cause anything different to manifest relative to the "straight-from-the-side" version did.

@SeanCurtis-TRI
Copy link
Contributor

SeanCurtis-TRI commented Sep 5, 2024

The face culling we're seeing in the clone happens in the depth image as well. (Presumably it happens in the label image, but beacuse we can see through and see the other side, it doesn't appear in the rendered label image.

The curious thing is that the image as the faces get "removed", what is revealed is the correct rendering. (Same for the depth image.) So, it's like the mapper is getting something backwards...all normals flipped? But then as the relationship between camera and geometry changes, some faces get culled away and we see through to the correct faces. (Note: I tried toggling two-sided shading in the renderers, and it made no difference to the rendered output.)

@SeanCurtis-TRI
Copy link
Contributor

If I delete the original engine prior to rendering, the cloned rendering comes out exactly right. The clone shares some data with the original, either it's too much or not enough.

@SeanCurtis-TRI
Copy link
Contributor

FYI This feels like it might be related to #20002. Clones fighting each other. The fight with the EGL renderer may be different than the fight with the OpenGL renderer, but fights are still happening? It may simply be that the current test that is failing now was insufficiently sophisticated to reveal the failure. Furthermore, our usage may also be contributing not falling in the trap even though Drake has an original RenderEngine (the model) and a clone (in the context) active during simulation. We only ever render on one of those render engines, so we don't observe conflict.

@SeanCurtis-TRI
Copy link
Contributor

Ok, it's definitely interference between the two render engines.

I changed the test to do the following:

  1. Create original render engine.
  2. Clone it.
  3. Render from clone
  4. Render from original
  5. Render from original again
  6. Render from clone again

The first rendering is correct, the second is messed up, the third is correct again. Surprisingly, the fourth is also correct.

The same thing happens if I flip the render engines in steps 3-6.

So, it seems like one RenderEngine is updating the VTK state for itself, the other accepts it (in some sense) even though it isn't quite appropriate for it. But after one rendering, it gets a chance to correct for itself. But those corrections appear to stick? I'm not sure if that's completely true or if there's some sequence of further renderings that would show further corruption of the image.

I'm going to try this same test on the original OpenGl implementation to see if we get a similar outcome. Stay tuned...

@SeanCurtis-TRI
Copy link
Contributor

Nope. It doesn't happen with the OpenGL-backed RenderEngineGl. So, while we know there's some form of interference between the other kind (#20002) it's not the same source as what's effecting us with the egl-backed implementation.

Let me know if you'd like me to push my current experiment to your branch.

@sankhesh
Copy link
Contributor

sankhesh commented Sep 6, 2024

Hi @SeanCurtis-TRI Thank you for your help investigating this. I am still trying to wrap my head around the rendering pipeline in drake. My assumption was that each run is a single render engine - so when I run the vtk tests, vtk is solely responsible for rendering. Based on your last comment about interference, that understanding seems wrong. How do two render engines play with each other?

@sankhesh
Copy link
Contributor

sankhesh commented Sep 6, 2024

Ok, I've pullsed your branch. From the earlier post, it looks like things have moved a bit. Obviously, now the sphere appears. But it isn't illuminated in the same way.

Note that in the latest change, I trick the logic to add the cloned sphere after the cloned plane to the renderer. Without this trick, the sphere doesn't show up. I did a poor job explaining this in #21700 (comment) but essentially, there is something fundamentally flawed with the cloned renderer itself. The order of actor addition shouldn't determine depth/visibility/occlusion - point coordinates and opacity should.

@sankhesh
Copy link
Contributor

sankhesh commented Sep 6, 2024

Let me know if you'd like me to push my current experiment to your branch.

Yes, please free to.

@SeanCurtis-TRI
Copy link
Contributor

  • RE: pushing my current experiment to your branch
    • I don't believe I have permissions. I've tried pushing and get the following response:
 ! [remote rejected]       egl-rebased -> egl-rebased (permission denied)
error: failed to push some refs to 'github.com:sankhesh/drake.git'

So, the two independent pipelines (which can be configured independently), nevertheless share data. We've got two possible problems:

  1. Something is being persisted within the data (vtkActor, vtkMapper, vtkSomething (TM)). (I've seen this in the past where a render pass would mutate the lights while doing its own work, messing up calculations for other render passes).
  2. Drake's tracking of those items is creating the entanglement.

I would recommend taking the operations on the VTK code out of Drake and running it directly against VTK. BUild the original pipeline, clone it in the same way, perform some renderings and see if they interfere. It'll be tedious. The best case scenario is that you observe interference outside of Drake (then we know it's something in VTK).

If we don't observe interference, there are two possibilities:

  • The extracted test didn't extract enough.
  • Drake is interfering with VTK's natural operations.

I'd propose we do the "extraction experiment" first and see if we're lucky. If it doesn't reveal a corresponding issue, then we can look at the test and persuade all parties that it captured everything Drake is doing with VTK. After that, we can look at what Drake is doing on top of it.

@sankhesh
Copy link
Contributor

@SeanCurtis-TRI Just a quick update that I am still looking into this issue and I think it is just a manifestation of #20002. WIth EGL it seems to happen all the time.

@jwnimmer-tri
Copy link
Collaborator Author

jwnimmer-tri commented Oct 14, 2024

The PR #22025 mostly finishes this work. After that, only a few loose ends remain:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: geometry perception How geometry appears in color, depth, and label images (via the RenderEngine API) priority: high type: feature request
Projects
Status: In Progress
Development

No branches or pull requests

4 participants