You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered the following problem while training a diffuison model on cifar-10 datasest. Who also encountered this problem and how to solve it?
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
python:699378 terminated with signal 6 at PC=7f8740a96a7c SP=7ffe9c925f50. Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f8740a96a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f8740a42476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f8740a287f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f873a321b4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f8740aeafb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f8740aea781]
python(+0x27257e)[0x55d9043c957e]
python(+0x185c51)[0x55d9042dcc51]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyObject_FastCallDictTstate+0x569)[0x55d9043049b9]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(+0x1a58f6)[0x55d9042fc8f6]
python(_PyObject_FastCallDictTstate+0x30b)[0x55d90430475b]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(+0x18fbc7)[0x55d9042e6bc7]
python(+0x18fc3d)[0x55d9042e6c3d]
python(+0x23d931)[0x55d904394931]
python(PyObject_GetIter+0x16)[0x55d9042a1b76]
python(_PyEval_EvalFrameDefault+0x66a7)[0x55d90432ccc7]
python(+0x240615)[0x55d904397615]
python(+0x191f43)[0x55d9042e8f43]
python(+0x185d61)[0x55d9042dcd61]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(+0x1a579a)[0x55d9042fc79a]
python(_PyEval_EvalCodeWithName+0x4b)[0x55d9042fd14b]
python(PyEval_EvalCodeEx+0x44)[0x55d9042fd194]
python(PyEval_EvalCode+0x1c)[0x55d9042fd1bc]
python(+0x2525cd)[0x55d9043a95cd]
python(+0x276196)[0x55d9043cd196]
python(+0x120091)[0x55d904277091]
python(PyRun_SimpleFileExFlags+0x1c1)[0x55d9043d3ee1]
python(Py_RunMain+0x398)[0x55d9043d45b8]
python(Py_BytesMain+0x39)[0x55d9043d4729]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f8740a29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f8740a29e40]
python(+0x203667)[0x55d90435a667]
The text was updated successfully, but these errors were encountered:
OS: Ubuntu 22.04
Graphic: RTX 3090
Python 3.10
mpi4py: 3.5.1
train_bash.sh
I encountered the following problem while training a diffuison model on cifar-10 datasest. Who also encountered this problem and how to solve it?
The text was updated successfully, but these errors were encountered: