`TestMLPotential.py` fails with `nnpops` implementation #25

dominicrufa · 2022-03-03T21:36:29Z

I'm not too familiar with torch tracebacks, but it seems like Torch isn't robust to the placement of arrays onto different devices:

ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lila/home/rufad/github/openmm-ml/test/TestMLPotential.py", line 27, in testCreateMixedSystem
    mixedEnergy = mixedContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/openmmml/models/anipotential/___torch_mangle_14.py", line 36, in forward
      _5 = torch.mul(boxvectors1, 10.)
      pbc0 = self.pbc
      _6, energy1, = (model0).forward(_4, _5, pbc0, )
                      ~~~~~~~~~~~~~~~ <--- HERE
      energy = energy1
    energyScale = self.energyScale
  File "code/__torch__/NNPOps/OptimizedTorchANI.py", line 19, in forward
    species_aevs = (aev_computer).forward(species_coordinates0, cell, pbc, )
    neural_networks = self.neural_networks
    species_energies = (neural_networks).forward(species_aevs, )
                        ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    energy_shifter = self.energy_shifter
    species_energies0 = (energy_shifter).forward(species_energies, None, None, )
  File "code/__torch__/NNPOps/BatchedNN.py", line 10, in forward
    species_aev: Tuple[Tensor, Tensor]) -> __torch__.NNPOps.EnergyShifter.SpeciesEnergies:
    _0 = getattr(self, "0")
    return (_0).forward(species_aev, )
            ~~~~~~~~~~~ <--- HERE
  def __len__(self: __torch__.NNPOps.BatchedNN.TorchANIBatchedNN) -> int:
    return 1
  File "code/__torch__/NNPOps/BatchedNN.py", line 33, in forward
    layer0_weights = self.layer0_weights
    layer0_biases = self.layer0_biases
    vectors0 = ops.NNPOpsBatchedNN.BatchedLinear(vectors, layer0_weights, layer0_biases)
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    vectors1 = __torch__.torch.nn.functional.celu(vectors0, 0.10000000000000001, False, )
    layer2_weights = self.layer2_weights

Traceback of TorchScript, original code (most recent call last):
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/models/anipotential.py", line 135, in forward
                    self.pbc = self.pbc.to(positions.device)
                    boxvectors = boxvectors.to(torch.float32)
                    _, energy = self.model((self.species, positions), cell=10.0*boxvectors, pbc=self.pbc)
                                ~~~~~~~~~~ <--- HERE

                return energy * self.energyScale # Hartree --> kJ/mol
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/OptimizedTorchANI.py", line 53, in forward
        species_coordinates = self.species_converter(species_coordinates)
        species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
        species_energies = self.neural_networks(species_aevs)
                           ~~~~~~~~~~~~~~~~~~~~ <--- HERE
        species_energies = self.energy_shifter(species_energies)

  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 122, in forward
    def forward(self, species_aev: Tuple[Tensor, Tensor]) -> SpeciesEnergies:
        return self[0].forward(species_aev)
               ~~~~~~~~~~~~~~~ <--- HERE
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/NNPOps/BatchedNN.py", line 99, in forward
        vectors = aev.unsqueeze(-2).unsqueeze(-1)

        vectors = batchedLinear(vectors, self.layer0_weights, self.layer0_biases) # Linear 0
                  ~~~~~~~~~~~~~ <--- HERE
        vectors = F.celu(vectors, alpha=0.1)                                      # CELU   1
        vectors = batchedLinear(vectors, self.layer2_weights, self.layer2_biases) # Linear 2
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper__bmm)


----------------------------------------------------------------------
Ran 1 test in 32.967s

FAILED (errors=1)

@peastman , any idea what is going wrong here? or perhaps @raimis knows what is wrong.

alternatively, if i try to run this without GPUs, it throws a runtime error:

======================================================================
ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/lila/home/rufad/github/openmm-ml/test/TestMLPotential.py", line 17, in testCreateMixedSystem
    mixedSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=False)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/mlpotential.py", line 265, in createMixedSystem
    self._impl.addForces(topology, newSystem, atomList, forceGroup, **args)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmmml-1.0-py3.9.egg/openmmml/models/anipotential.py", line 91, in addForces
    model = OptimizedTorchANI(model, species).to(device)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 616, in _apply
    self._buffers[key] = fn(buf)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

do we generally want to make this package robust to the platform type? or only to CUDA?

The text was updated successfully, but these errors were encountered:

dominicrufa · 2022-03-03T21:51:43Z

also, when I add the interpolate=True argument to the createMixedSystem and equip to a Context, it fails with

Traceback (most recent call last):
  File "/lila/home/rufad/nnpops/run.py", line 45, in <module>
    context.getState(getEnergy=True).getPotentialEnergy()
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)

when i call context.getState(getEnergy=True).getPotentialEnergy()
which i don't know how to debug; however, when i leave interpolate=False, I can pull the state and the potential energy without issue.

peastman · 2022-03-04T17:37:53Z

It looks like a case where the model is on one device and the input tensor is on a different one.

alternatively, if i try to run this without GPUs, it throws a runtime error:

What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms.

dominicrufa · 2022-03-04T18:21:18Z

What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms.

running on a computer without a GPU.

It looks like a case where the model is on one device and the input tensor is on a different one.

right. I am just trying to figure out why this is the case/how to fix. the platform is Reference in the test. Is this an error you see if you run it locally?

peastman · 2022-03-04T23:48:48Z

TestMLPotential.py passes when I run it locally. Perhaps the problem is that you have CUDA installed (so PyTorch tries to use it), but you don't have any CUDA compatible GPUs (so it fails when it tries)?

dominicrufa · 2022-03-07T20:20:18Z

sorry, i should clarify; i am trying to run TestMLPotential with the nnpops mixin here and am observing the aforementioned errors. I'm not sure if this is an edge case, but I'm not sure how to solve the issue.

dominicrufa · 2022-03-07T20:22:03Z

but you don't have any CUDA compatible GPUs (so it fails when it tries)?

i definitely have both, and i can make nnpops implementation run without observing this issue. it only appears if i set interpolate=True in the createMixedSystem

peastman · 2022-03-07T20:35:44Z

I think we're talking about different things, since a few different errors are described above. I was referring to the No CUDA GPUs are available error.

dominicrufa · 2022-03-07T20:46:18Z

I think we're talking about different things, since a few different errors are described above. I was referring to the No CUDA GPUs are available error.

yes, i don't disagree that is the case. This is not the blocking issue, just a passing observation (which you correctly clarified).

I am primarily concerned with integrating the nnpop-equipped TorchANI force with the createMixedSystem with the interpolate=True argument; the above issue is not a concern since i cannot test nnpops without a GPU anyway.

dominicrufa · 2022-03-07T22:06:09Z

  1 #!/usr/bin/env python
  2 import torch
  3 import torchani
  4 from NNPOps import OptimizedTorchANI
  5 from openmmtools.testsystems import HostGuestExplicit
  6 from openmmml.mlpotential import MLPotential
  7 from simtk import openmm, unit
  8 import time
  9 import numpy as np
 10 from simtk.openmm import LangevinMiddleIntegrator
 11
 12 temperature = 298.15 * unit.kelvin
 13 frictionCoeff = 1. / unit.picosecond
 14 stepSize = 1. * unit.femtoseconds
 15 hgv = HostGuestExplicit(constraints=None)
 16
 17 potential = MLPotential('ani2x')
 18 system = potential.createMixedSystem(hgv.topology, system = hgv.system, atoms = list(range(126,156)), implementation='nnpops', interpolate=True)
 19 print(f"done making system")
 20 _int = LangevinMiddleIntegrator(temperature, frictionCoeff, stepSize)
 21 context = openmm.Context(system, _int)
 22 context.setPositions(hgv.positions)
 23 # query and print out the global parameters:
 24 swig_params = context.getParameters()
 25 print(f"context parameters:")
 26 for i in swig_params:
 27     print(i, swig_params[i])
 28 context.getState(getEnergy=True).getPotentialEnergy()

@peastman , if i reduce the problem to this code snippet, pull main into this PR (so that i can use nnpops on gpu), then this snippet works with interpolate=False, but not with True. if i set to True, i see:

Traceback (most recent call last):
  File "/lila/home/rufad/nnpops/run.py", line 45, in <module>
    context.getState(getEnergy=True).getPotentialEnergy()
  File "/home/rufad/miniconda3/envs/openmm-dev/lib/python3.9/site-packages/openmm/openmm.py", line 14580, in getState
    state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)

and i'm not sure how to debug this.

peastman · 2022-03-07T22:25:56Z

Let me make sure I understand. This error happens when all of the following are true:

You use interpolate=True.
You use the optimized implementation from NNPOps.
You use the CUDA platform.

If any one of those is not true, it works. Is that correct?

How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces CUDA_ERROR_INVALID_HANDLE.

dominicrufa · 2022-03-07T22:35:22Z

@peastman

Let me make sure I understand. This error happens when all of the following are true:

correct.

If any one of those is not true, it works. Is that correct?

I don't know. i haven't tried all of the permutations; however, i need the latter two points to be True for my use cases. when the latter two points are True and the first is True, it fails. however, when the latter two are True and the first is False, it works.

How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces CUDA_ERROR_INVALID_HANDLE.

the precise issue is not the problem; after more digging, and showing this, I realized there is an edge case associated with running nnpops-implemented System with the interpolate argument. perhaps these should be different issues? the main thing is that i cannot seem to even use these two functionalities together, which is a prerequisite to performing the energy-matching assertion in the test you wrote

peastman · 2022-03-08T21:50:26Z

Here's the error I get when running your example.

Traceback (most recent call last):
  File "test.py", line 21, in <module>
    context = openmm.Context(system, _int)
  File "/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/openmm/openmm.py", line 5125, in __init__
    this = _openmm.new_Context(*args)
openmm.OpenMMException: Unknown device: 87. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
Exception raised from DeviceTypeName at /tmp/pip-req-build-d1tk7kuo/c10/core/DeviceType.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6a (0x7f39ccd45dba in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd8 (0x7f39ccd42338 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::DeviceTypeName[abi:cxx11](c10::DeviceType, bool) + 0x309 (0x7f39ccd22169 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: torch::jit::Unpickler::readInstruction() + 0x1d53 (0x7f3a141cb0a3 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::jit::Unpickler::run() + 0xa9 (0x7f3a141cb599 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::jit::Unpickler::parse_ivalue() + 0x2f (0x7f3a141cb7cf in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&) + 0x42c (0x7f3a1416faac in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x325cda5 (0x7f3a1416fda5 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x32603cb (0x7f3a141733cb in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::jit::load(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x1c0 (0x7f3a14174560 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0xc7 (0x7f3a141812f7 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: TorchPlugin::TorchForceImpl::initialize(OpenMM::ContextImpl&) + 0x65 (0x7f398e74b1e5 in /usr/local/openmm/lib/libOpenMMTorch.so)
frame #12: OpenMM::ContextImpl::initialize() + 0x422 (0x7f39909b6a52 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #13: OpenMM::Context::Context(OpenMM::System const&, OpenMM::Integrator&, OpenMM::ContextImpl&) + 0xf8 (0x7f39909b1228 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #14: OpenMM::ContextImpl::createLinkedContext(OpenMM::System const&, OpenMM::Integrator&) + 0x31 (0x7f39909b4341 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #15: OpenMM::CustomCVForceImpl::initialize(OpenMM::ContextImpl&) + 0x3b2 (0x7f39909c5482 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #16: OpenMM::ContextImpl::initialize() + 0x422 (0x7f39909b6a52 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #17: OpenMM::Context::Context(OpenMM::System const&, OpenMM::Integrator&) + 0x78 (0x7f39909b0fa8 in /usr/local/openmm/lib/libOpenMM.so.7.7)
frame #18: <unknown function> + 0x159676 (0x7f3990fca676 in /home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/openmm/_openmm.cpython-38-x86_64-linux-gnu.so)
<omitting python frames>
frame #36: __libc_start_main + 0xe7 (0x7f3a4fcf1c87 in /lib/x86_64-linux-gnu/libc.so.6)

Notice the message "Unknown device: 87" near the top. Each time I run it, there's a different number. That makes me think it might be a problem with uninitialized memory somewhere. I'm not sure where it's getting the number from though. The error happens in the first line of TorchForceImpl::initialize():

module = torch::jit::load(owner.getFile());

peastman · 2022-03-08T21:55:34Z

The above was using the main branch, so it actually wasn't using the NNPOps optimized version. Strange...

dominicrufa · 2022-03-08T22:28:12Z

that's especially strange. I haven't encountered that. (unintentionally closed issue); i can't tell if this is a version issue, but all of my packaged come from conda
omm_dev.txt
I'm going to play around with this a bit more before i forfeit.

peastman · 2022-03-08T22:54:00Z

I think this may be an issue with incompatible versions of pytorch. Investigating...

peastman · 2022-03-08T23:33:06Z

I was compiling OpenMM-Torch against a version of libtorch downloaded from https://pytorch.org, and I think it was incompatible with the one from conda. I needed to do that because the conda version was missing the CMake files needed to compile against it. I updated to the newest conda package (PyTorch 1.10.0), and now it does include the CMake files. But when I try to compile against it, all the test cases fail to build with the errors

/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: undefined reference to `std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream()@GLIBCXX_3.4.26'
/home/peastman/miniconda3/envs/openmm/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so: undefined reference to `std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::basic_stringstream()@GLIBCXX_3.4.26'
collect2: error: ld returned 1 exit status

jchodera · 2022-03-09T00:39:01Z

The packages installed to build pytorch can differ from the packages installed to run it when you just conda install the package. Is it possible that you need to install some of those to build things with pytorch?

peastman · 2022-03-09T17:41:36Z

I don't think so. The link errors refer to standard C++ functions. Usually that indicates a binary incompatibility of some sort, either libraries were compiled with different ABIs or different versions of libstdc++.

jchodera · 2022-03-10T23:35:04Z

I was thinking that it might be trying to use your system libraries instead of the conda-forge built libraries installed via the packages appearing in the build: dependencies that don't appear in the run: dependencies.

dominicrufa · 2022-03-11T20:11:29Z

@peastman : i was playing around with the nnpops-implementation, and discovered that the error thrown here might somehow be a consequence of placing the TorchForce into a CustomCVForce as you did here.

If I set interpolate=False and replace your ANIForce implementation with

         class ANIForce(torch.nn.Module):
101
102             def __init__(self, model, species, atoms):
103                 super(ANIForce, self).__init__()
104                 self.model = model
105                 self.species = species
106                 self.energyScale = torchani.units.hartree2kjoulemol(1)
107
108                 if atoms is None:
109                     self.indices = None
110                 else:
111                     self.indices = torch.tensor(atoms, dtype=torch.int64)
112
113                 self.model = model
114                 self.pbc = torch.tensor([True, True, True], dtype=torch.bool)
115
116             def forward(self, positions, boxvectors: Optional[torch.Tensor] = None, scale : Optional[torch.Tensor] = None):

and add a scale GlobalParameter like this:

149         force = openmmtorch.TorchForce(filename)
150         force.setForceGroup(forceGroup)
151         if topology.getPeriodicBoxVectors() is not None:
152             force.setUsesPeriodicBoundaryConditions(True)
153         force.addGlobalParameter('scale', 1.)
154         system.addForce(force)

i can manipulate the global parameter and make calls to the state.getPotentialEnergy() without seeing the openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400).

I'm not sure how easy it would be to find the root cause of the CustomCVForce error, but I wonder if the createMixedSystem function here might be modified to not place the TorchForce into the CustomCVForce, and just leave it as a separate force (with the scale GlobalParameter still equipped). It's a temporary workaround, but functionally, it would be no different, I don't think.

your thoughts?

jchodera · 2022-03-14T20:34:32Z

@peastman: Since it will take a while to establish why putting a TorchForce inside of CustomCVForce throws an OpenMMException. Could you make the change @dominicrufa suggests now so we can start using openmm-ml while this is being debugged?

peastman · 2022-03-15T19:52:15Z

@dominicrufa could you post the output of conda list in your environment? Also, what are CUDA_SDK_ROOT_DIR and CUDA_TOOLKIT_ROOT_DIR set to in CMake?

dominicrufa · 2022-03-15T20:50:24Z

@peastman, my conda list is in this comment.

Also, what are CUDA_SDK_ROOT_DIR and CUDA_TOOLKIT_ROOT_DIR set to in CMake?

if you are referring to my omm installation, I am using a nightly build from conda-forge; i'm not building from source.

peastman · 2022-03-15T21:33:26Z

I'm referring to the OpenMM-Torch plugin. Do you build it from source or install with conda?

dominicrufa · 2022-03-15T21:48:21Z

conda. everything is installed with conda. openmmtorch will pin the conda-forge release of openmm. once everything but omm is installed, you have to force install the omnia-dev version of omm so it plays nicely with openmm-torch

jchodera · 2022-03-15T23:54:04Z

Is there an issue with the build environments of openmm from omnia-dev not being fully matched with the conda-forge build infrastructure? Or do we think this issue is independent from build version incompatibilities?

dominicrufa · 2022-03-24T23:32:19Z

@peastman, is this specifically what is throwing the openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400) error? would i still be seeing this if the context were being restored correctly?

peastman · 2022-03-24T23:46:06Z

Correct. If I manually restore the context, the error goes away. But if I then follow with a second energy evaluation, we get a CUDA error inside PyTorch.

dominicrufa · 2022-03-24T23:59:29Z

@peastman , if it is indeed a pytorch bug, would it make more sense to use this hack in the meantime since the time horizon for the pytorch bugfix is an unknown? I only say this because this issue is blocking for me. if you'd prefer to avoid the hack, I'll open a PR fixing the problem with the hack (for reference's sake) as a temporary workaround that i can integrate into my downstream workflow.

jchodera · 2022-03-25T00:18:24Z

It looks to me like this may involve a bug in PyTorch. It seems to be messing up the CUDA context.

Perhaps the NVIDIA folks like @dmclark17 might be able to help us here since it involves a few community codes?

dmclark17 · 2022-03-25T01:17:32Z

Sure—I can do some investigating and try to reproduce on my end.

I've tried various ways of restoring the context. They fix the CUDA error coming from OpenMM code, but then lead to CUDA errors in PyTorch code.

I'm still getting up to speed on how contexts are being handled here—have you tried popping the current context before the PyTorch code and then pushing it afterwards?

peastman · 2022-03-25T03:42:36Z

Contexts are handled by the ContextSelector class. It pushes the context in its constructor and pops it in the destructor. To use it, you create an instance as a local variable. The context is current from that line to the end of the enclosing block.

Here is the method where the problem occurs.

https://github.com/openmm/openmm-torch/blob/84f7d884ec0d9d72a57a769046bdddd1d62b8fc2/platforms/cuda/src/CudaTorchKernels.cpp#L80-L158

There are ContextSelectors to set the context for two short blocks, one in lines 97-101 and another in lines 145-150. It does not set a context at the point where the PyTorch model is invoked (either line 114 or 119). And usually that works.

But in fails when the TorchForce is inside a CustomCVForce. In that case, this whole method is called from https://github.com/openmm/openmm/blob/c7af17c8ba2b6c3667e5126b494d1972b1b6d254/platforms/common/src/CommonKernels.cpp#L5389. The invoking method has already placed a context onto the stack, and PyTorch removes it.

This does suggest a workaround: possibly we could modify the implementation of CustomCVForce to not have a context set when it calls calcForcesAndEnergy(). That might work as long as nothing at an even higher level has set a context. But of course, the whole point of having a stack of contexts is so that you don't have to worry about that.

peastman · 2022-03-25T19:33:18Z

The workaround is in openmm/openmm#3533.

dmclark17 · 2022-03-26T00:33:54Z

Thanks for the explanation!

I'm trying to create a standalone reproducer to make sure I understand and can communicate the issue. I am loading in a simple model that multiplies an input tensor by two. I created it using the following:

import torch

class TestModule(torch.nn.Module):
    def forward(self, input):
        return 2 * input

module = torch.jit.script(TestModule())
module.save('model.pt')

The C++ code looks like this:

#include <cuda.h>

#include <torch/torch.h>
#include <torch/script.h>

#include <stdio.h>

void printContext(const char *msg) {
  CUcontext context;
  CUresult res = cuCtxGetCurrent(&context);
  printf("Context %d. Code %d. %s\n", context, res, msg);
}

int main() {
  cuInit(0);

  CUcontext ctx, myContext;
  CUdevice dev;
  CUresult res;

  cuDeviceGet(&dev, 0);
  cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
  printContext("After creation");

  torch::jit::script::Module module = torch::jit::load("../model.pt");
  module.to(at::kCUDA);
  printContext("After loading torchscript");

  std::vector<torch::jit::IValue> inputs;
  inputs.push_back(torch::ones({1,}).to(at::kCUDA));
  at::Tensor output = module.forward(inputs).toTensor();
  printContext("After run");
}

I am seeing the following output:

Context 1471272896. Code 0. After creation
Context 1471272896. Code 0. After loading torchscript
Context 1471272896. Code 0. After run

In this case, it doesn't seem like PyTorch is changing the context. On the other hand, if there isn't a current context when the JIT module was executed, it seems like PyTorch is creating a new context and leaving it on the stack. It doesn't seem like this is the expected behavior given the error observed with OpenMM-torch. Do you have any ideas on how to make the example more realistic? Thanks!

peastman · 2022-03-26T18:40:12Z

If you move the lines that load the module up to the top of main(), you can reproduce the problem. That matches what happens in OpenMM: the module gets loaded while creating the System, and cuInit() gets called later when you create the Context. The following version also adds a call to cuCtxPushCurrent() to even better match what happens in the real code.

int main() {
  torch::jit::script::Module module = torch::jit::load("../model.pt");
  module.to(at::kCUDA);

  cuInit(0);

  CUcontext ctx, myContext;
  CUdevice dev;
  CUresult res;

  cuDeviceGet(&dev, 0);
  cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
  printContext("After creation");

  cuCtxPushCurrent(ctx);
  printContext("After push");

  std::vector<torch::jit::IValue> inputs;
  inputs.push_back(torch::ones({1,}).to(at::kCUDA));
  at::Tensor output = module.forward(inputs).toTensor();
  printContext("After run");
}

Here is the output I get.

Context 319176160. Code 0. After creation
Context 319176160. Code 0. After push
Context 319165568. Code 0. After run

dmclark17 · 2022-03-26T22:00:04Z

Interesting—I am not able to reproduce that on my end; with that ordering, I am seeing:

Context 255. Code 3. After loading torchscript. Expected error code 3 for not initialized
Context 1472714976. Code 0. After creation
Context 1472714976. Code 0. After push
Context 1472714976. Code 0. After run

I think I'm using the CUDA toolkit that conda installed with PyTorch—I'm not sure if that could be causing the difference.

peastman · 2022-03-26T23:58:16Z

What version of PyTorch do you have? I was testing with 1.9.1.

dmclark17 · 2022-03-27T17:37:44Z

I am using 1.11.0 and linking to the libtorch that comes with the conda installation. I will try using 1.9.1!

dominicrufa · 2022-03-28T17:51:50Z

@peastman, can you merge the workaround with openmm's main, or are we anticipating a PyTorch bug fix?

peastman · 2022-03-28T18:13:32Z

Merged. We should still figure out what's going on with PyTorch, but it should fix the immediate problem.

What version of PyTorch were you using when you encountered the problem?

dominicrufa · 2022-03-28T18:22:26Z

pytorch                   1.10.0          cuda112py39h3ad47f5_1    conda-forge
pytorch-gpu               1.10.0          cuda112py39h0bbbad9_1    conda-forge

@peastman, were you able to see the problem with nnpops equipped, specifically?
if so, would you be able to push your modifications and commit to main of this repo? otherwise, I can do it if you can review it afterward.

peastman · 2022-03-28T18:27:18Z

were you able to see the problem with nnpops equipped, specifically?

Yes.

would you be able to push your modifications and commit to main of this repo?

I didn't make any changes to code in this repo.

dominicrufa · 2022-03-28T18:30:15Z

@peastman , which pull request did you use to reproduce the problem?

peastman · 2022-03-28T18:34:07Z

The one you said to use, #21.

dominicrufa · 2022-03-28T18:44:15Z

right, yes. sorry for the confusion. i think it just needs to be rebased with main and merged to main so that the functionality for equipping the TorchANI force is equippable with nnpops. but I don't have write permissions to that PR. i can pull it into my PR and rebase/request a merge in to main if you'd prefer.

dmclark17 · 2022-03-28T18:52:17Z

What version of PyTorch do you have? I was testing with 1.9.1.

I am able to reproduce the issue with 1.9.0:

Context 255. Code 3. After loading torchscript. Expected error code 3 for not initialized
Context 1470519632. Code 0. After creation
Context 1470519632. Code 0. After push
Context 1470509040. Code 0. After run

I am not seeing anything about CUDA contexts in the 1.11.0 release notes.

dmclark17 · 2022-03-30T19:20:51Z

I've been looking into the difference between PyTorch 1.9 and 1.11, and it seems like 1.9 is calling cudaSetDevice(0) when the JIT module is called—this is initializing the primary context. However, this API call is absent in 1.11, which explains why it doesn't reproduce issue in the standalone example. I'll see if I can find the responsible code change.

Would it be possible to try to reproduce the original bug with PyTorch 1.11 to see if it is fixed? I need to use #21 to reproduce, correct?

jchodera · 2022-07-12T17:11:56Z

@dominicrufa : Was this fixed?

dominicrufa · 2022-07-12T17:45:06Z

closing as this is fixed with main

dominicrufa changed the title ~~TestMLPotential.py fails~~ TestMLPotential.py fails with nnpops implementation Mar 7, 2022

dominicrufa closed this as completed Mar 8, 2022

dominicrufa reopened this Mar 8, 2022

peastman mentioned this issue Mar 25, 2022

Workaround for PyTorch bug openmm/openmm#3533

Merged

dominicrufa mentioned this issue Apr 8, 2022

NNPOps Integration #20

Closed

wiederm mentioned this issue May 16, 2022

TestMLPotential.py fails #27

Closed

peastman mentioned this issue Jun 1, 2022

CUDA driver error when using TorchForce combined with MTSLangevinIntegrator and reporters openmm/openmm#3588

Closed

dominicrufa mentioned this issue Jun 10, 2022

TorchForce inside CustomCVForce fails with nnpops implementation #32

Closed

dominicrufa closed this as completed Jul 12, 2022

TestMLPotential.py fails with nnpops implementation #25

TestMLPotential.py fails with nnpops implementation #25

Comments

dominicrufa commented Mar 3, 2022 • edited Loading

dominicrufa commented Mar 3, 2022

peastman commented Mar 4, 2022

dominicrufa commented Mar 4, 2022

peastman commented Mar 4, 2022

dominicrufa commented Mar 7, 2022 • edited Loading

dominicrufa commented Mar 7, 2022

peastman commented Mar 7, 2022

dominicrufa commented Mar 7, 2022

dominicrufa commented Mar 7, 2022 • edited Loading

peastman commented Mar 7, 2022

dominicrufa commented Mar 7, 2022

peastman commented Mar 8, 2022

peastman commented Mar 8, 2022

dominicrufa commented Mar 8, 2022 • edited Loading

peastman commented Mar 8, 2022

peastman commented Mar 8, 2022

jchodera commented Mar 9, 2022

peastman commented Mar 9, 2022

jchodera commented Mar 10, 2022

dominicrufa commented Mar 11, 2022

jchodera commented Mar 14, 2022

peastman commented Mar 15, 2022

dominicrufa commented Mar 15, 2022

peastman commented Mar 15, 2022

dominicrufa commented Mar 15, 2022

jchodera commented Mar 15, 2022

dominicrufa commented Mar 24, 2022

peastman commented Mar 24, 2022

dominicrufa commented Mar 24, 2022

jchodera commented Mar 25, 2022

dmclark17 commented Mar 25, 2022

peastman commented Mar 25, 2022

peastman commented Mar 25, 2022

dmclark17 commented Mar 26, 2022

peastman commented Mar 26, 2022

dmclark17 commented Mar 26, 2022

peastman commented Mar 26, 2022

dmclark17 commented Mar 27, 2022

dominicrufa commented Mar 28, 2022

peastman commented Mar 28, 2022

dominicrufa commented Mar 28, 2022

peastman commented Mar 28, 2022

dominicrufa commented Mar 28, 2022

peastman commented Mar 28, 2022

dominicrufa commented Mar 28, 2022

dmclark17 commented Mar 28, 2022

dmclark17 commented Mar 30, 2022

jchodera commented Jul 12, 2022

dominicrufa commented Jul 12, 2022

`TestMLPotential.py` fails with `nnpops` implementation #25

`TestMLPotential.py` fails with `nnpops` implementation #25

dominicrufa commented Mar 3, 2022 •

edited

Loading

dominicrufa commented Mar 7, 2022 •

edited

Loading

dominicrufa commented Mar 7, 2022 •

edited

Loading

dominicrufa commented Mar 8, 2022 •

edited

Loading