Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]Threading error after last train #284

Open
GarryJAY502 opened this issue Nov 12, 2024 · 6 comments
Open

[Question]Threading error after last train #284

GarryJAY502 opened this issue Nov 12, 2024 · 6 comments
Assignees

Comments

@GarryJAY502
Copy link

❓ Question

Hi
I have a question ,In the last stage of training, there was an error when using batchgenerators. I noticed that someone had mentioned this issue before. Is there a solution now.

Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/home/liuyvjie/opt/miniforge3/envs/nndet_venv/lib/python3.9/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print"
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

What is the purpose of this package batchgenerators
, and will this error affect my training process and result output saving

@mibaumgartner
Copy link
Collaborator

Dear @GarryJAY502 ,

batchgenerator is used for augmentation and data loading in nnDetection and thus is essential for proper functioanlity.

As the message already indicates: "RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message" The passage you posted does not contain the actual error, please provide the full error message.

Best,
Michael

@mibaumgartner mibaumgartner self-assigned this Nov 12, 2024
@GarryJAY502
Copy link
Author

亲爱的@GarryJAY502

batchgenerator 用于 nnDetection 中的增强和数据加载,因此对于正常功能至关重要。

正如消息所表明的那样:“RuntimeError:一个或多个后台工作程序不再处于活动状态。退出。请检查上面的打印语句以获取实际错误消息”您发布的段落不包含实际错误,请提供完整的错误消息。

最好的, 迈克尔
thanks,Michael

截屏2024-11-12 10 21 43 The error message only contains the part shown in the figure, without specific content This happened after the last epoch of training,What tasks will nnDetection perform after this?

@mibaumgartner
Copy link
Collaborator

Dear @GarryJAY502 ,

that is indeed curious and may be a problem within batchgenerators which might not shut down the workers correctly in combination with pytorch lightning.

nnDetection does not use batchgenerators after the training anymore. After training the empirical parameters need to be determined and whole patient inference is performed to give the final validation results.

Best,
Michael

@GarryJAY502
Copy link
Author

Dear @GarryJAY502 ,

that is indeed curious and may be a problem within batchgenerators which might not shut down the workers correctly in combination with pytorch lightning.

nnDetection does not use batchgenerators after the training anymore. After training the empirical parameters need to be determined and whole patient inference is performed to give the final validation results.

Best, Michael

thanks,Michael
Does this mean that I can ignore this error and continue with the next task? Can the folder model/Task100_LymphNodes/RetinaUNetV001-D3V001_3d still be useful, such as run nndet_comsolidate or nndet_predict.

@mibaumgartner
Copy link
Collaborator

If the training runs through completely, you can continue. The screenshot you posted only shows epoch 1, which is definitely not sufficient; the full schedule contains 60 epochs.

@GarryJAY502
Copy link
Author

如果训练完全完成,您可以继续。您发布的屏幕截图仅显示第 1 个时期,这肯定不够;完整的时间表包含 60 个时期。
Thank you very much. And I just set epoch=1 to reproduce this error and try to debug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants