Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestMLPotential.py fails with nnpops implementation #25

Closed
dominicrufa opened this issue Mar 3, 2022 · 68 comments
Closed

TestMLPotential.py fails with nnpops implementation #25

dominicrufa opened this issue Mar 3, 2022 · 68 comments

Comments

@dominicrufa
Copy link
Contributor

dominicrufa commented Mar 3, 2022

I'm not too familiar with torch tracebacks, but it seems like Torch isn't robust to the placement of arrays onto different devices:

ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lila/home/rufad/github/openmm-ml/test/TestMLPotential.py", line 27, in testCreateMixedSystem
    mixedEnergy = mixedContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/openmmml/models/anipotential/___torch_mangle_14.py", line 36, in forward
      _5 = torch.mul(boxvectors1, 10.)
      pbc0 = self.pbc
      _6, energy1, = (model0).forward(_4, _5, pbc0, )
                      ~~~~~~~~~~~~~~~ <--- HERE
      energy = energy1
    energyScale = self.energyScale
  File "code/__torch__/NNPOps/OptimizedTorchANI.py", line 19, in forward
    species_aevs = (aev_computer).forward(species_coordinates0, cell, pbc, )
    neural_networks = self.neural_networks
    species_energies = (neural_networks).forward(species_aevs, )
                        ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    energy_shifter = self.energy_shifter
    species_energies0 = (energy_shifter).forward(species_energies, None, None, )
  File "code/__torch__/NNPOps/BatchedNN.py", line 10, in forward
    species_aev: Tuple[Tensor, Tensor]) -> __torch__.NNPOps.EnergyShifter.SpeciesEnergies:
    _0 = getattr(self, "0")
    return (_0).forward(species_aev, )
            ~~~~~~~~~~~ <--- HERE
  def __len__(self: __torch__.NNPOps.BatchedNN.TorchANIBatchedNN) -> int:
    return 1
  File "code/__torch__/NNPOps/BatchedNN.py", line 33, in forward
    layer0_weights = self.layer0_weights
    layer0_biases = self.layer0_biases
    vectors0 = ops.NNPOpsBatchedNN.BatchedLinear(vectors, layer0_weights, layer0_biases)
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    vectors1 = __torch__.torch.nn.functional.celu(vectors0, 0.10000000000000001, False, )
    layer2_weights = self.layer2_weights

Traceback of TorchScript, original code (most recent call last):
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/models/anipotential.py", line 135, in forward
                    self.pbc = self.pbc.to(positions.device)
                    boxvectors = boxvectors.to(torch.float32)
                    _, energy = self.model((self.species, positions), cell=10.0*boxvectors, pbc=self.pbc)
                                ~~~~~~~~~~ <--- HERE

                return energy * self.energyScale # Hartree --> kJ/mol
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/OptimizedTorchANI.py", line 53, in forward
        species_coordinates = self.species_converter(species_coordinates)
        species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
        species_energies = self.neural_networks(species_aevs)
                           ~~~~~~~~~~~~~~~~~~~~ <--- HERE
        species_energies = self.energy_shifter(species_energies)

  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 122, in forward
    def forward(self, species_aev: Tuple[Tensor, Tensor]) -> SpeciesEnergies:
        return self[0].forward(species_aev)
               ~~~~~~~~~~~~~~~ <--- HERE
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 99, in forward
        vectors = aev.unsqueeze(-2).unsqueeze(-1)

        vectors = batchedLinear(vectors, self.layer0_weights, self.layer0_biases) # Linear 0
                  ~~~~~~~~~~~~~ <--- HERE
        vectors = F.celu(vectors, alpha=0.1)                                      # CELU   1
        vectors = batchedLinear(vectors, self.layer2_weights, self.layer2_biases) # Linear 2
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper__bmm)


----------------------------------------------------------------------
Ran 1 test in 32.967s

FAILED (errors=1)

@peastman , any idea what is going wrong here? or perhaps @raimis knows what is wrong.

alternatively, if i try to run this without GPUs, it throws a runtime error:

======================================================================
ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lila/home/rufad/github/openmm-ml/test/TestMLPotential.py", line 17, in testCreateMixedSystem
    mixedSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=False)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/mlpotential.py", line 265, in createMixedSystem
    self._impl.addForces(topology, newSystem, atomList, forceGroup, **args)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/models/anipotential.py", line 91, in addForces
    model = OptimizedTorchANI(model, species).to(device)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 616, in _apply
    self._buffers[key] = fn(buf)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

do we generally want to make this package robust to the platform type? or only to CUDA?

@dominicrufa
Copy link
Contributor Author

also, when I add the interpolate=True argument to the createMixedSystem and equip to a Context, it fails with

Traceback (most recent call last):
  File "/lila/home/rufad/nnpops/run.py", line 45, in <module>
    context.getState(getEnergy=True).getPotentialEnergy()
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)

when i call context.getState(getEnergy=True).getPotentialEnergy()
which i don't know how to debug; however, when i leave interpolate=False, I can pull the state and the potential energy without issue.

@peastman
Copy link
Member

peastman commented Mar 4, 2022

It looks like a case where the model is on one device and the input tensor is on a different one.

alternatively, if i try to run this without GPUs, it throws a runtime error:

What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms.

@dominicrufa
Copy link
Contributor Author

What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms.

running on a computer without a GPU.

It looks like a case where the model is on one device and the input tensor is on a different one.

right. I am just trying to figure out why this is the case/how to fix. the platform is Reference in the test. Is this an error you see if you run it locally?

@peastman
Copy link
Member

peastman commented Mar 4, 2022

TestMLPotential.py passes when I run it locally. Perhaps the problem is that you have CUDA installed (so PyTorch tries to use it), but you don't have any CUDA compatible GPUs (so it fails when it tries)?

@dominicrufa
Copy link
Contributor Author

dominicrufa commented Mar 7, 2022

sorry, i should clarify; i am trying to run TestMLPotential with the nnpops mixin here and am observing the aforementioned errors. I'm not sure if this is an edge case, but I'm not sure how to solve the issue.

@dominicrufa
Copy link
Contributor Author

but you don't have any CUDA compatible GPUs (so it fails when it tries)?

i definitely have both, and i can make nnpops implementation run without observing this issue. it only appears if i set interpolate=True in the createMixedSystem

@peastman
Copy link
Member

peastman commented Mar 7, 2022

I think we're talking about different things, since a few different errors are described above. I was referring to the No CUDA GPUs are available error.

@dominicrufa
Copy link
Contributor Author

I think we're talking about different things, since a few different errors are described above. I was referring to the No CUDA GPUs are available error.

yes, i don't disagree that is the case. This is not the blocking issue, just a passing observation (which you correctly clarified).

I am primarily concerned with integrating the nnpop-equipped TorchANI force with the createMixedSystem with the interpolate=True argument; the above issue is not a concern since i cannot test nnpops without a GPU anyway.

@dominicrufa
Copy link
Contributor Author

dominicrufa commented Mar 7, 2022

  1 #!/usr/bin/env python
  2 import torch
  3 import torchani
  4 from NNPOps import OptimizedTorchANI
  5 from openmmtools.testsystems import HostGuestExplicit
  6 from openmmml.mlpotential import MLPotential
  7 from simtk import openmm, unit
  8 import time
  9 import numpy as np
 10 from simtk.openmm import LangevinMiddleIntegrator
 11
 12 temperature = 298.15 * unit.kelvin
 13 frictionCoeff = 1. / unit.picosecond
 14 stepSize = 1. * unit.femtoseconds
 15 hgv = HostGuestExplicit(constraints=None)
 16
 17 potential = MLPotential('ani2x')
 18 system = potential.createMixedSystem(hgv.topology, system = hgv.system, atoms = list(range(126,156)), implementation='nnpops', interpolate=True)
 19 print(f"done making system")
 20 _int = LangevinMiddleIntegrator(temperature, frictionCoeff, stepSize)
 21 context = openmm.Context(system, _int)
 22 context.setPositions(hgv.positions)
 23 # query and print out the global parameters:
 24 swig_params = context.getParameters()
 25 print(f"context parameters:")
 26 for i in swig_params:
 27     print(i, swig_params[i])
 28 context.getState(getEnergy=True).getPotentialEnergy()

@peastman , if i reduce the problem to this code snippet, pull main into this PR (so that i can use nnpops on gpu), then this snippet works with interpolate=False, but not with True. if i set to True, i see:

Traceback (most recent call last):
  File "/lila/home/rufad/nnpops/run.py", line 45, in <module>
    context.getState(getEnergy=True).getPotentialEnergy()
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)

and i'm not sure how to debug this.

@dominicrufa dominicrufa changed the title TestMLPotential.py fails TestMLPotential.py fails with nnpops implementation Mar 7, 2022
@peastman
Copy link
Member

peastman commented Mar 7, 2022

Let me make sure I understand. This error happens when all of the following are true:

  • You use interpolate=True.
  • You use the optimized implementation from NNPOps.
  • You use the CUDA platform.

If any one of those is not true, it works. Is that correct?

How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces CUDA_ERROR_INVALID_HANDLE.

@dominicrufa
Copy link
Contributor Author

@peastman

Let me make sure I understand. This error happens when all of the following are true:

correct.

If any one of those is not true, it works. Is that correct?

I don't know. i haven't tried all of the permutations; however, i need the latter two points to be True for my use cases. when the latter two points are True and the first is True, it fails. however, when the latter two are True and the first is False, it works.

How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces CUDA_ERROR_INVALID_HANDLE.

the precise issue is not the problem; after more digging, and showing this, I realized there is an edge case associated with running nnpops-implemented System with the interpolate argument. perhaps these should be different issues? the main thing is that i cannot seem to even use these two functionalities together, which is a prerequisite to performing the energy-matching assertion in the test you wrote

@peastman
Copy link
Member

peastman commented Mar 8, 2022

Here's the error I get when running your example.

Traceback (most recent call last):
  File "test.py", line 21, in <module>
    context = openmm.Context(system, _int)
  File "/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/openmm/openmm.py", line 5125, in __init__
    this = _openmm.new_Context(*args)
openmm.OpenMMException: Unknown device: 87. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
Exception raised from DeviceTypeName at /tmp/pip-req-build-d1tk7kuo/c10/core/DeviceType.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6a (0x7f39ccd45dba in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd8 (0x7f39ccd42338 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::DeviceTypeName[abi:cxx11](c10::DeviceType, bool) + 0x309 (0x7f39ccd22169 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: torch::jit::Unpickler::readInstruction() + 0x1d53 (0x7f3a141cb0a3 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::jit::Unpickler::run() + 0xa9 (0x7f3a141cb599 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::jit::Unpickler::parse_ivalue() + 0x2f (0x7f3a141cb7cf in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&) + 0x42c (0x7f3a1416faac in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x325cda5 (0x7f3a1416fda5 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x32603cb (0x7f3a141733cb in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1c0 (0x7f3a14174560 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc7 (0x7f3a141812f7 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: TorchPlugin::TorchForceImpl::initialize(OpenMM::ContextImpl&) + 0x65 (0x7f398e74b1e5 in /usr/local/openmm/lib/libOpenMMTorch.so)
frame #12: OpenMM::ContextImpl::initialize() + 0x422 (0x7f39909b6a52 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #13: OpenMM::Context::Context(OpenMM::System const&, OpenMM::Integrator&, OpenMM::ContextImpl&) + 0xf8 (0x7f39909b1228 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #14: OpenMM::ContextImpl::createLinkedContext(OpenMM::System const&, OpenMM::Integrator&) + 0x31 (0x7f39909b4341 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #15: OpenMM::CustomCVForceImpl::initialize(OpenMM::ContextImpl&) + 0x3b2 (0x7f39909c5482 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #16: OpenMM::ContextImpl::initialize() + 0x422 (0x7f39909b6a52 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #17: OpenMM::Context::Context(OpenMM::System const&, OpenMM::Integrator&) + 0x78 (0x7f39909b0fa8 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #18: <unknown function> + 0x159676 (0x7f3990fca676 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/openmm/_openmm.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #36: __libc_start_main + 0xe7 (0x7f3a4fcf1c87 in /lib/x86_64-linux-gnu/libc.so.6)

Notice the message "Unknown device: 87" near the top. Each time I run it, there's a different number. That makes me think it might be a problem with uninitialized memory somewhere. I'm not sure where it's getting the number from though. The error happens in the first line of TorchForceImpl::initialize():

module = torch::jit::load(owner.getFile());

@peastman
Copy link
Member

peastman commented Mar 8, 2022

The above was using the main branch, so it actually wasn't using the NNPOps optimized version. Strange...

@dominicrufa
Copy link
Contributor Author

dominicrufa commented Mar 8, 2022

that's especially strange. I haven't encountered that. (unintentionally closed issue); i can't tell if this is a version issue, but all of my packaged come from conda
omm_dev.txt
I'm going to play around with this a bit more before i forfeit.

@peastman
Copy link
Member

peastman commented Mar 8, 2022

I think this may be an issue with incompatible versions of pytorch. Investigating...

@peastman
Copy link
Member

peastman commented Mar 8, 2022

I was compiling OpenMM-Torch against a version of libtorch downloaded from https://pytorch.org, and I think it was incompatible with the one from conda. I needed to do that because the conda version was missing the CMake files needed to compile against it. I updated to the newest conda package (PyTorch 1.10.0), and now it does include the CMake files. But when I try to compile against it, all the test cases fail to build with the errors

/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: undefined reference to `std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream()@GLIBCXX_3.4.26'
/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: undefined reference to `std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::basic_stringstream()@GLIBCXX_3.4.26'
collect2: error: ld returned 1 exit status

@jchodera
Copy link
Member

jchodera commented Mar 9, 2022

The packages installed to build pytorch can differ from the packages installed to run it when you just conda install the package. Is it possible that you need to install some of those to build things with pytorch?

@peastman
Copy link
Member

peastman commented Mar 9, 2022

I don't think so. The link errors refer to standard C++ functions. Usually that indicates a binary incompatibility of some sort, either libraries were compiled with different ABIs or different versions of libstdc++.

@jchodera
Copy link
Member

I was thinking that it might be trying to use your system libraries instead of the conda-forge built libraries installed via the packages appearing in the build: dependencies that don't appear in the run: dependencies.

@dominicrufa
Copy link
Contributor Author

@peastman : i was playing around with the nnpops-implementation, and discovered that the error thrown here might somehow be a consequence of placing the TorchForce into a CustomCVForce as you did here.

If I set interpolate=False and replace your ANIForce implementation with

         class ANIForce(torch.nn.Module):
101
102             def __init__(self, model, species, atoms):
103                 super(ANIForce, self).__init__()
104                 self.model = model
105                 self.species = species
106                 self.energyScale = torchani.units.hartree2kjoulemol(1)
107
108                 if atoms is None:
109                     self.indices = None
110                 else:
111                     self.indices = torch.tensor(atoms, dtype=torch.int64)
112
113                 self.model = model
114                 self.pbc = torch.tensor([True, True, True], dtype=torch.bool)
115
116             def forward(self, positions, boxvectors: Optional[torch.Tensor] = None, scale : Optional[torch.Tensor] = None):

and add a scale GlobalParameter like this:

149         force = openmmtorch.TorchForce(filename)
150         force.setForceGroup(forceGroup)
151         if topology.getPeriodicBoxVectors() is not None:
152             force.setUsesPeriodicBoundaryConditions(True)
153         force.addGlobalParameter('scale', 1.)
154         system.addForce(force)

i can manipulate the global parameter and make calls to the state.getPotentialEnergy() without seeing the openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400).

I'm not sure how easy it would be to find the root cause of the CustomCVForce error, but I wonder if the createMixedSystem function here might be modified to not place the TorchForce into the CustomCVForce, and just leave it as a separate force (with the scale GlobalParameter still equipped). It's a temporary workaround, but functionally, it would be no different, I don't think.

your thoughts?

@jchodera
Copy link
Member

@peastman: Since it will take a while to establish why putting a TorchForce inside of CustomCVForce throws an OpenMMException. Could you make the change @dominicrufa suggests now so we can start using openmm-ml while this is being debugged?

@peastman
Copy link
Member

@dominicrufa could you post the output of conda list in your environment? Also, what are CUDA_SDK_ROOT_DIR and CUDA_TOOLKIT_ROOT_DIR set to in CMake?

@dominicrufa
Copy link
Contributor Author

@peastman, my conda list is in this comment.

Also, what are CUDA_SDK_ROOT_DIR and CUDA_TOOLKIT_ROOT_DIR set to in CMake?

if you are referring to my omm installation, I am using a nightly build from conda-forge; i'm not building from source.

@peastman
Copy link
Member

I'm referring to the OpenMM-Torch plugin. Do you build it from source or install with conda?

@dominicrufa
Copy link
Contributor Author

conda. everything is installed with conda. openmmtorch will pin the conda-forge release of openmm. once everything but omm is installed, you have to force install the omnia-dev version of omm so it plays nicely with openmm-torch

@jchodera
Copy link
Member

Is there an issue with the build environments of openmm from omnia-dev not being fully matched with the conda-forge build infrastructure? Or do we think this issue is independent from build version incompatibilities?

@dominicrufa
Copy link
Contributor Author

@peastman, is this specifically what is throwing the openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400) error? would i still be seeing this if the context were being restored correctly?

@peastman
Copy link
Member

Correct. If I manually restore the context, the error goes away. But if I then follow with a second energy evaluation, we get a CUDA error inside PyTorch.

@dominicrufa
Copy link
Contributor Author

@peastman , if it is indeed a pytorch bug, would it make more sense to use this hack in the meantime since the time horizon for the pytorch bugfix is an unknown? I only say this because this issue is blocking for me. if you'd prefer to avoid the hack, I'll open a PR fixing the problem with the hack (for reference's sake) as a temporary workaround that i can integrate into my downstream workflow.

@jchodera
Copy link
Member

It looks to me like this may involve a bug in PyTorch. It seems to be messing up the CUDA context.

Perhaps the NVIDIA folks like @dmclark17 might be able to help us here since it involves a few community codes?

@dmclark17
Copy link

Sure—I can do some investigating and try to reproduce on my end.

I've tried various ways of restoring the context. They fix the CUDA error coming from OpenMM code, but then lead to CUDA errors in PyTorch code.

I'm still getting up to speed on how contexts are being handled here—have you tried popping the current context before the PyTorch code and then pushing it afterwards?

@peastman
Copy link
Member

Contexts are handled by the ContextSelector class. It pushes the context in its constructor and pops it in the destructor. To use it, you create an instance as a local variable. The context is current from that line to the end of the enclosing block.

Here is the method where the problem occurs.

https://github.com/openmm/openmm-torch/blob/84f7d884ec0d9d72a57a769046bdddd1d62b8fc2/platforms/cuda/src/CudaTorchKernels.cpp#L80-L158

There are ContextSelectors to set the context for two short blocks, one in lines 97-101 and another in lines 145-150. It does not set a context at the point where the PyTorch model is invoked (either line 114 or 119). And usually that works.

But in fails when the TorchForce is inside a CustomCVForce. In that case, this whole method is called from https://github.com/openmm/openmm/blob/c7af17c8ba2b6c3667e5126b494d1972b1b6d254/platforms/common/src/CommonKernels.cpp#L5389. The invoking method has already placed a context onto the stack, and PyTorch removes it.

This does suggest a workaround: possibly we could modify the implementation of CustomCVForce to not have a context set when it calls calcForcesAndEnergy(). That might work as long as nothing at an even higher level has set a context. But of course, the whole point of having a stack of contexts is so that you don't have to worry about that.

@peastman
Copy link
Member

The workaround is in openmm/openmm#3533.

@dmclark17
Copy link

Thanks for the explanation!

I'm trying to create a standalone reproducer to make sure I understand and can communicate the issue. I am loading in a simple model that multiplies an input tensor by two. I created it using the following:

import torch

class TestModule(torch.nn.Module):
    def forward(self, input):
        return 2 * input

module = torch.jit.script(TestModule())
module.save('model.pt')

The C++ code looks like this:

#include <cuda.h>

#include <torch/torch.h>
#include <torch/script.h>

#include <stdio.h>

void printContext(const char *msg) {
  CUcontext context;
  CUresult res = cuCtxGetCurrent(&context);
  printf("Context %d. Code %d. %s\n", context, res, msg);
}

int main() {
  cuInit(0);

  CUcontext ctx, myContext;
  CUdevice dev;
  CUresult res;

  cuDeviceGet(&dev, 0);
  cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
  printContext("After creation");

  torch::jit::script::Module module = torch::jit::load("../model.pt");
  module.to(at::kCUDA);
  printContext("After loading torchscript");

  std::vector<torch::jit::IValue> inputs;
  inputs.push_back(torch::ones({1,}).to(at::kCUDA));
  at::Tensor output = module.forward(inputs).toTensor();
  printContext("After run");
}

I am seeing the following output:

Context 1471272896. Code 0. After creation
Context 1471272896. Code 0. After loading torchscript
Context 1471272896. Code 0. After run

In this case, it doesn't seem like PyTorch is changing the context. On the other hand, if there isn't a current context when the JIT module was executed, it seems like PyTorch is creating a new context and leaving it on the stack. It doesn't seem like this is the expected behavior given the error observed with OpenMM-torch. Do you have any ideas on how to make the example more realistic? Thanks!

@peastman
Copy link
Member

If you move the lines that load the module up to the top of main(), you can reproduce the problem. That matches what happens in OpenMM: the module gets loaded while creating the System, and cuInit() gets called later when you create the Context. The following version also adds a call to cuCtxPushCurrent() to even better match what happens in the real code.

int main() {
  torch::jit::script::Module module = torch::jit::load("../model.pt");
  module.to(at::kCUDA);

  cuInit(0);

  CUcontext ctx, myContext;
  CUdevice dev;
  CUresult res;

  cuDeviceGet(&dev, 0);
  cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
  printContext("After creation");

  cuCtxPushCurrent(ctx);
  printContext("After push");

  std::vector<torch::jit::IValue> inputs;
  inputs.push_back(torch::ones({1,}).to(at::kCUDA));
  at::Tensor output = module.forward(inputs).toTensor();
  printContext("After run");
}

Here is the output I get.

Context 319176160. Code 0. After creation
Context 319176160. Code 0. After push
Context 319165568. Code 0. After run

@dmclark17
Copy link

Interesting—I am not able to reproduce that on my end; with that ordering, I am seeing:

Context 255. Code 3. After loading torchscript. Expected error code 3 for not initialized
Context 1472714976. Code 0. After creation
Context 1472714976. Code 0. After push
Context 1472714976. Code 0. After run

I think I'm using the CUDA toolkit that conda installed with PyTorch—I'm not sure if that could be causing the difference.

@peastman
Copy link
Member

What version of PyTorch do you have? I was testing with 1.9.1.

@dmclark17
Copy link

I am using 1.11.0 and linking to the libtorch that comes with the conda installation. I will try using 1.9.1!

@dominicrufa
Copy link
Contributor Author

@peastman, can you merge the workaround with openmm's main, or are we anticipating a PyTorch bug fix?

@peastman
Copy link
Member

Merged. We should still figure out what's going on with PyTorch, but it should fix the immediate problem.

What version of PyTorch were you using when you encountered the problem?

@dominicrufa
Copy link
Contributor Author

pytorch                   1.10.0          cuda112py39h3ad47f5_1    conda-forge
pytorch-gpu               1.10.0          cuda112py39h0bbbad9_1    conda-forge

@peastman, were you able to see the problem with nnpops equipped, specifically?
if so, would you be able to push your modifications and commit to main of this repo? otherwise, I can do it if you can review it afterward.

@peastman
Copy link
Member

were you able to see the problem with nnpops equipped, specifically?

Yes.

would you be able to push your modifications and commit to main of this repo?

I didn't make any changes to code in this repo.

@dominicrufa
Copy link
Contributor Author

@peastman , which pull request did you use to reproduce the problem?

@peastman
Copy link
Member

The one you said to use, #21.

@dominicrufa
Copy link
Contributor Author

right, yes. sorry for the confusion. i think it just needs to be rebased with main and merged to main so that the functionality for equipping the TorchANI force is equippable with nnpops. but I don't have write permissions to that PR. i can pull it into my PR and rebase/request a merge in to main if you'd prefer.

@dmclark17
Copy link

What version of PyTorch do you have? I was testing with 1.9.1.

I am able to reproduce the issue with 1.9.0:

Context 255. Code 3. After loading torchscript. Expected error code 3 for not initialized
Context 1470519632. Code 0. After creation
Context 1470519632. Code 0. After push
Context 1470509040. Code 0. After run

I am not seeing anything about CUDA contexts in the 1.11.0 release notes.

@dmclark17
Copy link

I've been looking into the difference between PyTorch 1.9 and 1.11, and it seems like 1.9 is calling cudaSetDevice(0) when the JIT module is called—this is initializing the primary context. However, this API call is absent in 1.11, which explains why it doesn't reproduce issue in the standalone example. I'll see if I can find the responsible code change.

Would it be possible to try to reproduce the original bug with PyTorch 1.11 to see if it is fixed? I need to use #21 to reproduce, correct?

@jchodera
Copy link
Member

@dominicrufa : Was this fixed?

@dominicrufa
Copy link
Contributor Author

closing as this is fixed with main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants