Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train error on ubuntu 22.04 #118

Open
youyuanyi opened this issue Nov 1, 2023 · 1 comment
Open

Train error on ubuntu 22.04 #118

youyuanyi opened this issue Nov 1, 2023 · 1 comment

Comments

@youyuanyi
Copy link

OS: Ubuntu 22.04
Graphic: RTX 3090
Python 3.10
mpi4py: 3.5.1
train_bash.sh

#!/bin/bash

MODEL_FLAGS="--image_size 32 --num_channels 128 --num_res_blocks 3 --learn_sigma True --dropout 0.3 --class_cond True "
DIFFUSION_FLAGS="--diffusion_steps 4000 --noise_schedule cosine"
TRAIN_FLAGS="--lr 1e-4 --batch_size 128"

# Train
python scripts/image_train.py --data_dir ../data $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I encountered the following problem while training a diffuison model on cifar-10 datasest. Who also encountered this problem and how to solve it?

A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.

python:699378 terminated with signal 6 at PC=7f8740a96a7c SP=7ffe9c925f50.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f8740a96a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f8740a42476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f8740a287f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f873a321b4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f8740aeafb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f8740aea781]
python(+0x27257e)[0x55d9043c957e]
python(+0x185c51)[0x55d9042dcc51]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyObject_FastCallDictTstate+0x569)[0x55d9043049b9]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(+0x1a58f6)[0x55d9042fc8f6]
python(_PyObject_FastCallDictTstate+0x30b)[0x55d90430475b]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(+0x18fbc7)[0x55d9042e6bc7]
python(+0x18fc3d)[0x55d9042e6c3d]
python(+0x23d931)[0x55d904394931]
python(PyObject_GetIter+0x16)[0x55d9042a1b76]
python(_PyEval_EvalFrameDefault+0x66a7)[0x55d90432ccc7]
python(+0x240615)[0x55d904397615]
python(+0x191f43)[0x55d9042e8f43]
python(+0x185d61)[0x55d9042dcd61]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(+0x1a579a)[0x55d9042fc79a]
python(_PyEval_EvalCodeWithName+0x4b)[0x55d9042fd14b]
python(PyEval_EvalCodeEx+0x44)[0x55d9042fd194]
python(PyEval_EvalCode+0x1c)[0x55d9042fd1bc]
python(+0x2525cd)[0x55d9043a95cd]
python(+0x276196)[0x55d9043cd196]
python(+0x120091)[0x55d904277091]
python(PyRun_SimpleFileExFlags+0x1c1)[0x55d9043d3ee1]
python(Py_RunMain+0x398)[0x55d9043d45b8]
python(Py_BytesMain+0x39)[0x55d9043d4729]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f8740a29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f8740a29e40]
python(+0x203667)[0x55d90435a667]
@DailyVy
Copy link

DailyVy commented Mar 19, 2024

I have the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@youyuanyi @DailyVy and others