Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: checkpointing netgen-generated meshes deadlocks in parallel #3783

Open
pefarrell opened this issue Oct 1, 2024 · 0 comments
Open

BUG: checkpointing netgen-generated meshes deadlocks in parallel #3783

pefarrell opened this issue Oct 1, 2024 · 0 comments
Assignees
Labels

Comments

@pefarrell
Copy link
Contributor

Describe the bug

Saving a netgen-generated mesh to a checkpoint deadlocks in parallel.

Steps to Reproduce
Steps to reproduce the behavior:

  1. Consider the following code:
# Run with mpiexec -n 2 ...

from firedrake import *
from netgen.geom2d import SplineGeometry


use_netgen = True
if use_netgen:
    geo = SplineGeometry()
    geo.AddRectangle((0, 0), (1, 1))
    ngmesh = geo.GenerateMesh(maxh=1.0)
    mesh = Mesh(ngmesh)
else:
    mesh = UnitSquareMesh(1, 1)
with CheckpointFile("temp.h5", "w") as f:
    print("saving...", flush=True)
    f.save_mesh(mesh)
    print("done", flush=True)
# hangs if use_netgen
  1. Run mpiexec -n 2 python demo.py

Expected behavior
I expected the code to terminate successfully.

Error message
Here are the backtraces for the two processes.

#0  MPID_nem_queue_empty (qhead=0x141412fff180)
    at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/linux-gnu-c-opt/externalpackages/mpich-4.2.2/src/mpl/include/mpl_atomic_c11.h:104
#1  MPID_nem_mpich_blocking_recv (completions=<optimised out>, in_fbox=<synthetic pointer>, cell=<synthetic pointer>) at ./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:936
#2  MPIDI_CH3I_Progress (progress_state=progress_state@entry=0x7fff170644c4, is_blocking=is_blocking@entry=1) at src/mpid/ch3/channels/nemesis/src/ch3_progress.c:354
#3  0x000014141b43c847 in MPIR_Wait_state (request_ptr=request_ptr@entry=0x14141b6cf4a0 <MPIR_Request_direct>, status=status@entry=0x1, state=state@entry=0x7fff170644c4) at src/mpi/request/request_impl.c:736
#4  0x000014141b43c9cb in MPIR_Wait_impl (status=0x1, request_ptr=0x14141b6cf4a0 <MPIR_Request_direct>) at src/mpi/request/request_impl.c:760
#5  MPID_Wait (status=0x1, request_ptr=0x14141b6cf4a0 <MPIR_Request_direct>) at ./src/mpid/ch3/include/mpidpost.h:267
#6  MPIR_Wait (request_ptr=request_ptr@entry=0x14141b6cf4a0 <MPIR_Request_direct>, status=status@entry=0x1) at src/mpi/request/request_impl.c:779
#7  0x000014141b3f8c6b in MPIC_Wait (request_ptr=0x14141b6cf4a0 <MPIR_Request_direct>) at src/mpi/coll/helper_fns.c:90
#8  0x000014141b3f963e in MPIC_Sendrecv (sendbuf=sendbuf@entry=0x0, sendcount=sendcount@entry=0, sendtype=sendtype@entry=1275068685, dest=dest@entry=1, sendtag=sendtag@entry=1, recvbuf=recvbuf@entry=0x0, 
    recvcount=0, recvtype=1275068685, source=1, recvtag=1, comm_ptr=0x589aeb176908, status=0x7fff17064560, errflag=MPIR_ERR_NONE) at src/mpi/coll/helper_fns.c:307
#9  0x000014141b37506f in MPIR_Barrier_intra_dissemination (comm_ptr=0x589aeb176908, errflag=errflag@entry=MPIR_ERR_NONE) at src/mpi/coll/barrier/barrier_intra_k_dissemination.c:30
#10 0x000014141b37558c in MPIR_Barrier_intra_k_dissemination (comm=comm@entry=0x589aeb176908, k=<optimised out>, errflag=errflag@entry=MPIR_ERR_NONE) at src/mpi/coll/barrier/barrier_intra_k_dissemination.c:63
#11 0x000014141b3db08e in MPIR_Barrier_allcomm_auto (comm_ptr=comm_ptr@entry=0x589aeb176908, errflag=errflag@entry=MPIR_ERR_NONE) at src/mpi/coll/mpir_coll.c:27
#12 0x000014141b3db18b in MPIR_Barrier_impl (comm_ptr=0x589aeb176908, errflag=errflag@entry=MPIR_ERR_NONE) at src/mpi/coll/mpir_coll.c:85
#13 0x000014141b3db369 in MPID_Barrier (errflag=MPIR_ERR_NONE, comm=<optimised out>) at ./src/mpid/ch3/include/mpid_coll.h:20
#14 0x000014141b3756d5 in MPIR_Barrier_intra_smp (comm_ptr=comm_ptr@entry=0x589aeb176430, errflag=errflag@entry=MPIR_ERR_NONE) at src/mpi/coll/barrier/barrier_intra_smp.c:17
#15 0x000014141b3db07b in MPIR_Barrier_allcomm_auto (comm_ptr=comm_ptr@entry=0x589aeb176430, errflag=errflag@entry=MPIR_ERR_NONE) at src/mpi/coll/mpir_coll.c:39
#16 0x000014141b3db18b in MPIR_Barrier_impl (comm_ptr=comm_ptr@entry=0x589aeb176430, errflag=errflag@entry=MPIR_ERR_NONE) at src/mpi/coll/mpir_coll.c:85
#17 0x000014141b3db369 in MPID_Barrier (errflag=MPIR_ERR_NONE, comm=0x589aeb176430) at ./src/mpid/ch3/include/mpid_coll.h:20
#18 0x000014141b25f123 in internal_Barrier (comm=-1006632938) at src/binding/c/c_binding.c:7439
#19 PMPI_Barrier (comm=-1006632938) at src/binding/c/c_binding.c:7487
#20 0x00001414193073ac in H5AC__rsp__dist_md_write__flush (f=0x589aeb0ee390) at H5ACmpio.c:1702
#21 H5AC__run_sync_point (sync_point_op=<optimised out>, f=0x589aeb0ee390) at H5ACmpio.c:2164
#22 H5AC__run_sync_point (f=0x589aeb0ee390, sync_point_op=<optimised out>) at H5ACmpio.c:2099
#23 0x000014141930855f in H5AC__flush_entries (f=f@entry=0x589aeb0ee390) at H5ACmpio.c:2307
#24 0x000014141906fee8 in H5AC_dest (f=f@entry=0x589aeb0ee390) at H5AC.c:527
#25 0x000014141910d9b0 in H5F__dest (f=f@entry=0x589aeb0ee390, flush=flush@entry=true) at H5Fint.c:1275
#26 0x000014141910e7c3 in H5F_try_close (f=0x589aeb0ee390, was_closed=was_closed@entry=0x0) at H5Fint.c:2180
#27 0x000014141910eafc in H5F__close_cb (f=<optimised out>) at H5Fint.c:2009
#28 0x0000141419186d68 in H5I_dec_ref (id=72057594037927936) at H5I.c:1254
#29 H5I_dec_ref (id=72057594037927936) at H5I.c:1219
#30 0x0000141419186f14 in H5I_dec_app_ref (id=id@entry=72057594037927936) at H5I.c:1299
#31 0x000014141910e532 in H5F__close (file_id=file_id@entry=72057594037927936) at H5Fint.c:1951
#32 0x0000141419103cf2 in H5Fclose (file_id=72057594037927936) at H5F.c:674
#33 0x000014141b9f9d11 in PetscViewerFileClose_HDF5 (viewer=0x589aeb031e80) at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/src/sys/classes/viewer/impls/hdf5/hdf5v.c:107
#34 0x000014141b9fa06c in PetscViewerDestroy_HDF5 (viewer=0x589aeb031e80) at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/src/sys/classes/viewer/impls/hdf5/hdf5v.c:126
#35 0x000014141ba0a784 in PetscViewerDestroy (viewer=0x1413f79f4078) at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/src/sys/classes/viewer/interface/view.c:101
#36 0x000014141df1d664 in __pyx_pf_8petsc4py_5PETSc_6Viewer_6destroy (__pyx_v_self=0x1413f79f4040) at src/petsc4py/PETSc.c:124417
#37 __pyx_pw_8petsc4py_5PETSc_6Viewer_7destroy (__pyx_v_self=0x1413f79f4040, __pyx_args=<optimised out>, __pyx_nargs=<optimised out>, __pyx_kwds=<optimised out>) at src/petsc4py/PETSc.c:58858
#38 0x0000589ae688355e in ?? ()
#39 0x0000589ae684b45c in _PyEval_EvalFrameDefault ()
#40 0x0000589ae68629fc in _PyFunction_Vectorcall ()
#41 0x0000589ae684b45c in _PyEval_EvalFrameDefault ()
#42 0x0000589ae68707f1 in ?? ()
#43 0x0000589ae684b26d in _PyEval_EvalFrameDefault ()
#44 0x0000589ae68479c6 in ?? ()
#45 0x0000589ae693d256 in PyEval_EvalCode ()
#46 0x0000589ae6968108 in ?? ()
#47 0x0000589ae69619cb in ?? ()
#48 0x0000589ae6967e55 in ?? ()
#49 0x0000589ae6967338 in _PyRun_SimpleFileObject ()
#50 0x0000589ae6966f83 in _PyRun_AnyFileObject ()
#51 0x0000589ae6959a5e in Py_RunMain ()
#52 0x0000589ae693002d in Py_BytesMain ()
#53 0x000014141ec29d90 in __libc_start_call_main (main=main@entry=0x589ae692fff0, argc=argc@entry=2, argv=argv@entry=0x7fff17065978) at ../sysdeps/nptl/libc_start_call_main.h:58
#54 0x000014141ec29e40 in __libc_start_main_impl (main=0x589ae692fff0, argc=2, argv=0x7fff17065978, init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>, stack_end=0x7fff17065968) at ../csu/libc-start.c:392
#55 0x0000589ae692ff25 in _start ()

and

#0  MPID_nem_queue_empty (qhead=0xa6ce0fff200)
    at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/linux-gnu-c-opt/externalpackages/mpich-4.2.2/src/mpl/include/mpl_atomic_c11.h:104
#1  MPID_nem_mpich_blocking_recv (completions=<optimised out>, in_fbox=<synthetic pointer>, cell=<synthetic pointer>) at ./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:936
#2  MPIDI_CH3I_Progress (progress_state=progress_state@entry=0x7ffd78cdc0d4, is_blocking=is_blocking@entry=1) at src/mpid/ch3/channels/nemesis/src/ch3_progress.c:354
#3  0x00000a6ce963c847 in MPIR_Wait_state (request_ptr=request_ptr@entry=0xa6ce98cf4a0 <MPIR_Request_direct>, status=status@entry=0x1, state=state@entry=0x7ffd78cdc0d4) at src/mpi/request/request_impl.c:736
#4  0x00000a6ce963c9cb in MPIR_Wait_impl (status=0x1, request_ptr=0xa6ce98cf4a0 <MPIR_Request_direct>) at src/mpi/request/request_impl.c:760
#5  MPID_Wait (status=0x1, request_ptr=0xa6ce98cf4a0 <MPIR_Request_direct>) at ./src/mpid/ch3/include/mpidpost.h:267
#6  MPIR_Wait (request_ptr=request_ptr@entry=0xa6ce98cf4a0 <MPIR_Request_direct>, status=status@entry=0x1) at src/mpi/request/request_impl.c:779
#7  0x00000a6ce95f8c6b in MPIC_Wait (request_ptr=0xa6ce98cf4a0 <MPIR_Request_direct>) at src/mpi/coll/helper_fns.c:90
#8  0x00000a6ce95f90f1 in MPIC_Recv (buf=buf@entry=0x7ffd78cdc3a8, count=count@entry=8, datatype=datatype@entry=1275068685, source=<optimised out>, tag=tag@entry=2, comm_ptr=comm_ptr@entry=0x631691c26a20, 
    status=0x7ffd78cdc200) at src/mpi/coll/helper_fns.c:198
#9  0x00000a6ce95762c9 in MPIR_Bcast_intra_binomial (buffer=buffer@entry=0x7ffd78cdc3a8, count=count@entry=8, datatype=datatype@entry=1275068685, root=root@entry=0, comm_ptr=comm_ptr@entry=0x631691c26a20, 
    errflag=MPIR_ERR_NONE) at src/mpi/coll/bcast/bcast_intra_binomial.c:97
#10 0x00000a6ce95dbf2c in MPIR_Bcast_allcomm_auto (buffer=buffer@entry=0x7ffd78cdc3a8, count=count@entry=8, datatype=datatype@entry=1275068685, root=root@entry=0, comm_ptr=0x631691c26a20, errflag=MPIR_ERR_NONE)
    at src/mpi/coll/mpir_coll.c:324
#11 0x00000a6ce95dc061 in MPIR_Bcast_impl (buffer=buffer@entry=0x7ffd78cdc3a8, count=count@entry=8, datatype=datatype@entry=1275068685, root=root@entry=0, comm_ptr=comm_ptr@entry=0x631691c26a20, 
    errflag=errflag@entry=MPIR_ERR_NONE) at src/mpi/coll/mpir_coll.c:421
#12 0x00000a6ce95dc2f9 in MPID_Bcast (errflag=MPIR_ERR_NONE, comm=0x631691c26a20, root=0, datatype=1275068685, count=8, buffer=0x7ffd78cdc3a8) at ./src/mpid/ch3/include/mpid_coll.h:30
#13 0x00000a6ce945ff6f in internal_Bcast (comm=-1006632948, root=0, datatype=1275068685, count=8, buffer=<optimised out>) at src/binding/c/c_binding.c:7708
#14 PMPI_Bcast (buffer=buffer@entry=0x7ffd78cdc3a8, count=count@entry=8, datatype=datatype@entry=1275068685, root=root@entry=0, comm=-1006632948) at src/binding/c/c_binding.c:7759
#15 0x00000a6ce7314bef in H5FD_mpio_truncate (dxpl_id=<optimised out>, closing=128, _file=0x631691814f80) at H5FDmpio.c:2023
#16 H5FD_mpio_truncate (_file=_file@entry=0x631691814f80, dxpl_id=<optimised out>, closing=closing@entry=false) at H5FDmpio.c:1979
#17 0x00000a6ce7124b61 in H5FD_truncate (file=0x631691814f80, closing=closing@entry=false) at H5FD.c:1580
#18 0x00000a6ce710bd5c in H5F__flush_phase2 (f=f@entry=0x631691c10e00, closing=closing@entry=false) at H5Fint.c:1846
#19 0x00000a6ce710e37a in H5F__flush_phase2 (closing=false, f=0x631691c10e00) at H5Fint.c:1825
#20 H5F__flush (f=f@entry=0x631691c10e00) at H5Fint.c:1904
#21 0x00000a6ce7103a84 in H5Fflush (object_id=object_id@entry=72057594037927936, scope=scope@entry=H5F_SCOPE_LOCAL) at H5F.c:638
#22 0x00000a6cce378a9c in __pyx_f_4h5py_4defs_H5Fflush (__pyx_v_object_id=72057594037927936, __pyx_v_scope=H5F_SCOPE_LOCAL)
    at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/h5py/h5py/defs.c:14175
#23 0x00000a6ccba7f202 in __pyx_pf_4h5py_3h5f_6flush (__pyx_v_obj=0xa6cc56348b0, __pyx_v_obj=0xa6cc56348b0, __pyx_self=<optimised out>, __pyx_v_scope=0)
    at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/h5py/h5py/h5f.c:7587
#24 __pyx_pw_4h5py_3h5f_7flush (__pyx_self=<optimised out>, __pyx_args=<optimised out>, __pyx_nargs=1, __pyx_kwds=<optimised out>)
    at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/h5py/h5py/h5f.c:7554
#25 0x00000a6ccd7ac8c1 in __Pyx_PyObject_Call (kw=0xa6cc57decc0, arg=0xa6cc5609840, func=0xa6cca6b4a00)
    at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/h5py/h5py/_objects.c:14294
#26 __pyx_pf_4h5py_8_objects_9with_phil_wrapper (__pyx_v_kwds=0xa6cc56277c0, __pyx_v_args=0xa6cc5609840, __pyx_self=<optimised out>)
    at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/h5py/h5py/_objects.c:6419
#27 __pyx_pw_4h5py_8_objects_9with_phil_1wrapper (__pyx_self=<optimised out>, __pyx_args=0xa6cc5609840, __pyx_kwds=<optimised out>)
    at /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/h5py/h5py/_objects.c:6330
#28 0x000063168db13a7b in _PyObject_MakeTpCall ()
#29 0x000063168db0c629 in _PyEval_EvalFrameDefault ()
#30 0x000063168db1d9fc in _PyFunction_Vectorcall ()
#31 0x000063168db0645c in _PyEval_EvalFrameDefault ()
#32 0x000063168db1d9fc in _PyFunction_Vectorcall ()
#33 0x000063168db0645c in _PyEval_EvalFrameDefault ()
#34 0x000063168db2b7f1 in ?? ()
#35 0x000063168db0626d in _PyEval_EvalFrameDefault ()
#36 0x000063168db029c6 in ?? ()
#37 0x000063168dbf8256 in PyEval_EvalCode ()
#38 0x000063168dc23108 in ?? ()
#39 0x000063168dc1c9cb in ?? ()
#40 0x000063168dc22e55 in ?? ()
#41 0x000063168dc22338 in _PyRun_SimpleFileObject ()
#42 0x000063168dc21f83 in _PyRun_AnyFileObject ()
#43 0x000063168dc14a5e in Py_RunMain ()
#44 0x000063168dbeb02d in Py_BytesMain ()
#45 0x00000a6cece29d90 in __libc_start_call_main (main=main@entry=0x63168dbeaff0, argc=argc@entry=2, argv=argv@entry=0x7ffd78cdd278) at ../sysdeps/nptl/libc_start_call_main.h:58
#46 0x00000a6cece29e40 in __libc_start_main_impl (main=0x63168dbeaff0, argc=2, argv=0x7ffd78cdd278, init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>, stack_end=0x7ffd78cdd268) at ../csu/libc-start.c:392
#47 0x000063168dbeaf25 in _start ()

Environment:

  • OS: Ubuntu 22.04
  • Python version: 3.10.12
  • Output of firedrake-status
Firedrake Configuration:
   package_manager: False
   minimal_petsc: False
   mpicc: /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/linux-gnu-c-opt/bin/mpicc
   mpicxx: /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/linux-gnu-c-opt/bin/mpicxx
   mpif90: /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/linux-gnu-c-opt/bin/mpif90
   mpiexec: /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/src/petsc/linux-gnu-c-opt/bin/mpiexec
   disable_ssh: False
   honour_petsc_dir: True
   with_parmetis: True
   slepc: True
   packages: ['git+ssh://github.com/firedrakeproject/Irksome.git#egg=Irksome']
   honour_pythonpath: False
   opencascade: False
   torch: False
   petsc_int_type: int32
   cache_dir: /home/farrellp/git/install-scripts/firedrake/firedrake-dev-20240828-mpich/.cache
   complex: False
   remove_build_files: False
   with_blas: None
   netgen: True
Additions:
   None
Environment:
   PYTHONPATH: None
   PETSC_ARCH: linux-gnu-c-opt
   PETSC_DIR: /home/farrellp/local/firedrake/firedrake-dev-20240828-mpich/src/petsc
Status of components:
---------------------------------------------------------------------------
|Package             |Branch                        |Revision  |Modified  |
---------------------------------------------------------------------------
|FInAT               |master                        |5914471   |False     |
|Irksome             |master                        |ed30518   |False     |
|PyOP2               |master                        |5f18075f  |False     |
|fiat                |master                        |8aec150   |False     |
|firedrake           |master                        |e035709e3 |False     |
|h5py                |firedrake                     |db2ab02f  |False     |
|libsupermesh        |master                        |c07de13   |False     |
|loopy               |main                          |d9876d83  |False     |
|ngsPETSc            |main                          |7afc820   |False     |
|petsc               |firedrake                     |3a8a2250205|False     |
|pyadjoint           |master                        |92121af   |False     |
|pytest-mpi          |main                          |f2566a1   |False     |
|slepc               |firedrake                     |ef1ce425c |False     |
|tsfc                |master                        |9f42f95   |False     |
|ufl                 |master                        |c1a8afb1  |False     |
---------------------------------------------------------------------------
  • Any relevant environment variables or modifications [eg: PYOP2_DEBUG=1]

Additional Info
N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants