Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestMLPotential.py fails #27

Closed
wiederm opened this issue May 16, 2022 · 21 comments · Fixed by #28
Closed

TestMLPotential.py fails #27

wiederm opened this issue May 16, 2022 · 21 comments · Fixed by #28

Comments

@wiederm
Copy link

wiederm commented May 16, 2022

Sorry for the cross package issue --- I think this might involve openMM-torch, but I get the error executing the test script of openmm-ml, so I am posting here. Running the test script I get

======================================================================
ERROR: testCreateMixedSystem (__main__.TestMLPotential)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mwieder/openmm-ml/test/TestMLPotential.py", line 19, in testCreateMixedSystem
    mixedContext = mm.Context(mixedSystem, mm.VerletIntegrator(0.001), platform)
  File "/data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/openmm/openmm.py", line 16230, in __init__
    _openmm.Context_swiginit(self, _openmm.new_Context(*args))
openmm.OpenMMException: Specified a Platform for a Context which does not support all required kernels

I am not super sure where the problem originates from. I have built openMM-torch from source with the nightly build openMM and it seemed to have passed all the necessary tests. But when running make PythonInstall I get a lot of warnings (it runs successfully though):

[100%] Generating TorchPluginWrapper.cpp
/data/shared/software/python_env/anaconda3/envs/rew/include/swig/OpenMMSwigHeaders.i:2242: Warning 314: 'None' is a python keyword, renaming to '_None'
/data/shared/software/python_env/anaconda3/envs/rew/include/swig/OpenMMSwigHeaders.i:496: Warning 453: Can't apply (std::vector< double > &OUTPUT). No typemaps are defined.
/data/shared/software/python_env/anaconda3/envs/rew/include/swig/OpenMMSwigHeaders.i:503: Warning 453: Can't apply (OpenMM::Context &OUTPUT). No typemaps are defined.
/data/shared/software/python_env/anaconda3/envs/rew/include/swig/OpenMMSwigHeaders.i:538: Warning 453: Can't apply (std::vector< double > &OUTPUT). No typemaps are defined.
...

is this expected?

@peastman
Copy link
Member

Which platform are you using?

This probably means a plugin is failing to load, most likely because a dependent library can't be found. What is the value of Platform.getPluginLoadFailures()?

@wiederm
Copy link
Author

wiederm commented May 16, 2022

I am using the CUDA platform.
The output is:

'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: libtorch.so: cannot open shared object file: No such file or directory', 'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchOpenCL.so: libtorch.so: cannot open shared object file: No such file or directory', 'Error loading library /data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchReference.so: libtorch.so: cannot open shared object file: No such file or directory'

and
ldd libOpenMMTorchCUDA.so
shows

/data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /data/shared/software/python_env/anaconda3/envs/rew/lib/libOpenMM.so.7.7)
/data/shared/software/python_env/anaconda3/envs/rew/lib/plugins/libOpenMMTorchCUDA.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /data/shared/software/python_env/anaconda3/envs/rew/lib/libOpenMM.so.7.7)
	linux-vdso.so.1 (0x00007ffe07194000)
	libcudart.so.10.2 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.10.2 (0x00007f2e3f3bc000)
	libOpenMM.so.7.7 => not found
	libOpenMMCUDA.so => not found
	libOpenMMTorch.so (0x00007f2e3f1b2000)
	libtorch.so => not found
	libtorch_cpu.so => not found
	libtorch_cuda.so => not found
	libc10.so => not found
[...]

it seems a few shared objects can't be found.
I will investigate!

@peastman
Copy link
Member

See the last paragraph of openmm/openmm-torch#67. You need to add the pytorch lib directory to your LD_LIBRARY_PATH. Assuming you installed it with conda, that's probably something like /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torch/lib.

@wiederm
Copy link
Author

wiederm commented May 16, 2022

that solved it! thank you for your help!

@wiederm wiederm closed this as completed May 16, 2022
@wiederm
Copy link
Author

wiederm commented May 16, 2022

If I change the device from Reference to CUDA I see the following error:

  File "/data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
    inv_distances = reciprocal_cell.norm(2, -1)
    num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
    num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
                  ~~~~~~~~~~~ <--- HERE
    r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
    r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively


----------------------------------------------------------------------
Ran 1 test in 26.320s

FAILED (errors=1)

I think this is consistent with what has been reported here, and a fix has been merged here as far as I can tell.
Is that fix included in the omnia dev build?

@wiederm wiederm reopened this May 16, 2022
@peastman
Copy link
Member

Yes, the fix ought to be in the latest dev build.

@wiederm
Copy link
Author

wiederm commented May 16, 2022

just to make sure I do this correct:
I installed the dev build with:
conda install -c omina-dev openmm
and that's the version that is installed:

 # Name                    Version                   Build  Channel
openmm                    7.8             py39_cuda102_debug_1    omnia-dev
openmmml                  1.0                      pypi_0    pypi
openmmtorch               1.0                      pypi_0    pypi

@peastman
Copy link
Member

omnia-dev, not omina-dev. But otherwise, yes. The dev builds are broken at the moment, so the most recent one is from a few weeks ago. But that should still have the fix.

@wiederm
Copy link
Author

wiederm commented May 17, 2022

I think the fix might not be in the omnia-dev build I am using.
As far as I can tell the openmm-7.8 build for py39 and cuda102 was uploaded 2 months ago (march 17.).
While the fix was merged on march 28.
image

@wiederm
Copy link
Author

wiederm commented May 17, 2022

I tried to install the linux-64/openmm-7.8-py39_cuda110_1.tar.bz2 build with conda, but the usual commands fail to achieve this.
So, e.g. conda install -c omnia-dev openmm cudatoolkit=11.0 will still try to install py39_cuda102_debug_1.
image
Am I missing something here?

@peastman
Copy link
Member

It looks like for the last couple of months, it was only creating dev builds for CUDA 11. We really need to get them building again.

@wiederm
Copy link
Author

wiederm commented May 17, 2022

I have now compiled the openMM master branch and openmm-torch from source, but the error is still the same:

  File "/data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
    inv_distances = reciprocal_cell.norm(2, -1)
    num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
    num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
                  ~~~~~~~~~~~ <--- HERE
    r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
    r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively

when I compile from source and install using make install and make PythonInstall it takes the current state of the master branch, right?

@peastman
Copy link
Member

It ought to have the fix. Make sure you're really using the version you compiled, and that conda hasn't installed another copy automatically.

@wiederm
Copy link
Author

wiederm commented May 17, 2022

I think it is using the compiled version. I was careful not to install anything that would bring in openMM as a dependency. Also, the package build/channel tags indicate pypi, which I guess was used in make PythonInstall.
conda list openmm returns:

# packages in environment at /data/shared/software/python_env/anaconda3/envs/rew:
#
# Name                    Version                   Build  Channel
openmm                    7.7.0                    pypi_0    pypi
openmmml                  1.0                      pypi_0    pypi
openmmtorch               1.0                      pypi_0    pypi

I also double-checked that the correct openMM version is loaded in the script and it all points to the correct conda environment. Is there anything else that I can check?

@wiederm
Copy link
Author

wiederm commented May 18, 2022

I did some double-checking just to make sure that I am not using a different openMM version behind the scene.
With the conda environment activated in which I installed openMM from source openmm.__path__ points to /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/openmm. That's the correct path in the environment.
The file /data/shared/software/python_env/anaconda3/envs/rew/lib/python3.9/site-packages/openmm/version.py has the correct git_revision hash: fb0360604800bba836be24cd6e8adce8b22b258a (https://github.com/openmm/openmm/tree/fb0360604800bba836be24cd6e8adce8b22b258a).
I also incremented the version number to 7.7.1 in the Makefile and after compiling and installing I got the updated version number when calling openmm.__version__.
I think this all indicates that I am using the compiled openMM version, right?

@peastman
Copy link
Member

That sounds like you have the right version. I'd like to see if I can reproduce it. What versions of Pytorch and CUDA are you using?

@wiederm
Copy link
Author

wiederm commented May 20, 2022

To make matters a bit simpler I am now using the conda openmm_dev openMM package, but the error is still the same. I have confirmed that mm.__path__ points to the correct conda environment and the full_version tag is 7.7.0.dev-109f6b2.
I am using cudatoolkit=11.3 and pytorch=1.10, the exported conda environment & the pytest error report are attached. The libtorch c++ library I am using is libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcu113.zip

env_error.zip

@peastman
Copy link
Member

It's working for me. What version of the OpenMM-ML code are you using? I'm testing with the latest code from the main branch.

Can you post the complete output of running the test?

@wiederm
Copy link
Author

wiederm commented May 20, 2022

yes, I am also testing with the lastest code from the main branch.
I am installing with pip install git+https://github.com/openmm/openmm-ml.git.

And, just to make sure we are talking about the same thing: the test runs fine on Reference or CPU platform, but changing to CUDA returns the described error.

The full output is:

(rew-test) [mwieder@a7srv5 test 💡 ](main)$ pytest TestMLPotential.py 
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.9.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /home/mwieder/openmm-ml
collected 1 item                                                                                                                                                                                                                             

TestMLPotential.py F                                                                                                                                                                                                                   [100%]

================================================================================================================== FAILURES ==================================================================================================================
___________________________________________________________________________________________________ TestMLPotential.testCreateMixedSystem ____________________________________________________________________________________________________

self = <TestMLPotential.TestMLPotential testMethod=testCreateMixedSystem>

    def testCreateMixedSystem(self):
        pdb = app.PDBFile('alanine-dipeptide-explicit.pdb')
        ff = app.ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
        mmSystem = ff.createSystem(pdb.topology, nonbondedMethod=app.PME)
        potential = MLPotential('ani2x')
        mlAtoms = [a.index for a in next(pdb.topology.chains()).atoms()]
        mixedSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=False)
        interpSystem = potential.createMixedSystem(pdb.topology, mmSystem, mlAtoms, interpolate=True)
        # platform = mm.Platform.getPlatformByName('Reference')
        platform = mm.Platform.getPlatformByName('CUDA')
        mmContext = mm.Context(mmSystem, mm.VerletIntegrator(0.001), platform)
        mixedContext = mm.Context(mixedSystem, mm.VerletIntegrator(0.001), platform)
        interpContext = mm.Context(interpSystem, mm.VerletIntegrator(0.001), platform)
        mmContext.setPositions(pdb.positions)
        mixedContext.setPositions(pdb.positions)
        interpContext.setPositions(pdb.positions)
        mmEnergy = mmContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)
>       mixedEnergy = mixedContext.getState(getEnergy=True).getPotentialEnergy().value_in_unit(unit.kilojoules_per_mole)

TestMLPotential.py:31: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <openmm.openmm.Context; proxy of <Swig Object of type 'OpenMM::Context *' at 0x7fb7a2b5cd50> >, getPositions = False, getVelocities = False, getForces = False, getEnergy = True, getParameters = False
getParameterDerivatives = False, getIntegratorParameters = False, enforcePeriodicBox = False, groups = -1

    def getState(self, getPositions=False, getVelocities=False,
                 getForces=False, getEnergy=False, getParameters=False,
                 getParameterDerivatives=False, getIntegratorParameters=False,
                 enforcePeriodicBox=False, groups=-1):
        """Get a State object recording the current state information stored in this context.
    
        Parameters
        ----------
        getPositions : bool=False
            whether to store particle positions in the State
        getVelocities : bool=False
            whether to store particle velocities in the State
        getForces : bool=False
            whether to store the forces acting on particles in the State
        getEnergy : bool=False
            whether to store potential and kinetic energy in the State
        getParameters : bool=False
            whether to store context parameters in the State
        getParameterDerivatives : bool=False
            whether to store parameter derivatives in the State
        getIntegratorParameters : bool=False
            whether to store integrator parameters in the State
        enforcePeriodicBox : bool=False
            if false, the position of each particle will be whatever position
            is stored in the Context, regardless of periodic boundary conditions.
            If true, particle positions will be translated so the center of
            every molecule lies in the same periodic box.
        groups : set={0,1,2,...,31}
            a set of indices for which force groups to include when computing
            forces and energies. The default value includes all groups. groups
            can also be passed as an unsigned integer interpreted as a bitmask,
            in which case group i will be included if (groups&(1<<i)) != 0.
        """
        try:
    # is the input integer-like?
            groups_mask = int(groups)
        except TypeError:
            if isinstance(groups, set):
    # nope, okay, then it should be an set
                groups_mask = functools.reduce(operator.or_,
                        ((1<<x) & 0xffffffff for x in groups))
            else:
                raise TypeError('%s is neither an int nor set' % groups)
        if groups_mask >= 0x80000000:
            groups_mask -= 0x100000000
        types = 0
        if getPositions:
            types += State.Positions
        if getVelocities:
            types += State.Velocities
        if getForces:
            types += State.Forces
        if getEnergy:
            types += State.Energy
        if getParameters:
            types += State.Parameters
        if getParameterDerivatives:
            types += State.ParameterDerivatives
        if getIntegratorParameters:
            types += State.IntegratorParameters
>       state = _openmm.Context_getState(self, types, enforcePeriodicBox, groups_mask)
E       openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
E       Traceback of TorchScript, serialized code (most recent call last):
E         File "code/__torch__/openmmml/models/anipotential/___torch_mangle_14.py", line 34, in forward
E             _6 = torch.mul(boxvectors1, 10.)
E             pbc = self.pbc
E             _7, energy1, = (model0).forward(_5, _6, pbc, )
E                             ~~~~~~~~~~~~~~~ <--- HERE
E             energy = energy1
E           energyScale = self.energyScale
E         File "code/__torch__/torchani/models.py", line 32, in forward
E             pass
E           aev_computer = self.aev_computer
E           species_aevs = (aev_computer).forward(species_coordinates0, cell, pbc, )
E                           ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
E           neural_networks = self.neural_networks
E           species_energies = (neural_networks).forward(species_aevs, None, None, )
E         File "code/__torch__/torchani/aev.py", line 68, in forward
E               ops.prim.RaiseException("AssertionError: ")
E               cell3, pbc0 = _1, _1
E             shifts = _0(cell3, pbc0, 5.0999999999999996, )
E                      ~~ <--- HERE
E             triu_index0 = self.triu_index
E             aev1 = __torch__.torchani.aev.compute_aev(species, coordinates, triu_index0, (self).constants(), (7, 16, 112, 32, 896), (cell3, shifts), )
E         File "code/__torch__/torchani/aev.py", line 163, in compute_shifts
E         num_repeats = torch.to(_34, 4)
E         _35 = torch.new_zeros(num_repeats, annotate(List[int], []))
E         num_repeats0 = torch.where(pbc, num_repeats, _35)
E                        ~~~~~~~~~~~ <--- HERE
E         _36 = torch.item(torch.select(num_repeats0, 0, 0))
E         r1 = torch.arange(1, torch.add(_36, 1), dtype=None, layout=None, device=ops.prim.device(cell))
E       
E       Traceback of TorchScript, original code (most recent call last):
E         File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/openmmml/models/anipotential.py", line 111, in forward
E                       else:
E                           boxvectors = boxvectors.to(torch.float32)
E                           _, energy = self.model((self.species, 10.0*positions.unsqueeze(0)), cell=10.0*boxvectors, pbc=self.pbc)
E                                       ~~~~~~~~~~ <--- HERE
E                       return self.energyScale*energy
E         File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/models.py", line 106, in forward
E                   raise ValueError(f'Unknown species found in {species_coordinates[0]}')
E           
E               species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
E                              ~~~~~~~~~~~~~~~~~ <--- HERE
E               species_energies = self.neural_networks(species_aevs)
E               return self.energy_shifter(species_energies)
E         File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/aev.py", line 532, in forward
E                   assert (cell is not None and pbc is not None)
E                   cutoff = max(self.Rcr, self.Rca)
E                   shifts = compute_shifts(cell, pbc, cutoff)
E                            ~~~~~~~~~~~~~~ <--- HERE
E                   aev = compute_aev(species, coordinates, self.triu_index, self.constants(), self.sizes, (cell, shifts))
E           
E         File "/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/aev.py", line 114, in compute_shifts
E           inv_distances = reciprocal_cell.norm(2, -1)
E           num_repeats = torch.ceil(cutoff * inv_distances).to(torch.long)
E           num_repeats = torch.where(pbc, num_repeats, num_repeats.new_zeros(()))
E                         ~~~~~~~~~~~ <--- HERE
E           r1 = torch.arange(1, num_repeats[0].item() + 1, device=cell.device)
E           r2 = torch.arange(1, num_repeats[1].item() + 1, device=cell.device)
E       RuntimeError: Expected condition, x and y to be on the same device, but condition is on cpu and x and y are on cuda:0 and cuda:0 respectively

/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/openmm/openmm.py:9028: OpenMMException
------------------------------------------------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------------------------------------------------
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/resources/
/data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/resources/
------------------------------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------------------------------
WARNING  root:__init__.py:5 Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
============================================================================================================== warnings summary ==============================================================================================================
test/TestMLPotential.py::TestMLPotential::testCreateMixedSystem
  /data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torchani/__init__.py:55: UserWarning: Dependency not satisfied, torchani.ase will not be available
    warnings.warn("Dependency not satisfied, torchani.ase will not be available")

test/TestMLPotential.py::TestMLPotential::testCreateMixedSystem
  /data/shared/software/python_env/anaconda3/envs/rew-test/lib/python3.9/site-packages/torch/functional.py:1069: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1645049332358/work/aten/src/ATen/native/TensorShape.cpp:2156.)
    return _VF.cartesian_prod(tensors)  # type: ignore[attr-defined]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================================== short test summary info ===========================================================================================================
FAILED TestMLPotential.py::TestMLPotential::testCreateMixedSystem - openmm.OpenMMException: The following operation failed in the TorchScript interpreter.
======================================================================================================= 1 failed, 2 warnings in 28.61s =======================================================================================================

@peastman
Copy link
Member

Found it! This actually turned out to be unrelated to the fix in openmm/openmm#3533. The problem was that when we created the module, we didn't register species and pbc as parameters. Because of that, when we called to(device) on it to move the module to the GPU, those two didn't get moved.

The fix is in #28.

@wiederm
Copy link
Author

wiederm commented May 20, 2022

Thank you very much for your help and the quick fix!

@wiederm wiederm closed this as completed May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants