You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been getting RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message, but have not been able to figure out why. It has been previously referenced in #2514, #2516 and #2297, but I can't see if they have been solved. I'm trying to train inside a docker container built on top of the most recent PyTorch image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel , using a NVIDIA A100-SXM4-80GB GPU and cuDNN version: 90100.
Below are the generated log messages.
Any lead would be appreaciated. Thanks in advance.
Alejandro
Matplotlib created a temporary cache directory at /tmp/matplotlib-hxupd6_c because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using device: cuda:0
/workspace/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None
#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
2024-11-11 12:16:22.413313: do_dummy_2d_data_aug: False
2024-11-11 12:16:22.419833: Using splits from existing split file: /data/nnUNet_preprocessed/Dataset888_EnhancedT1ce/splits_final.json
2024-11-11 12:16:22.420686: The split file contains 5 splits.
2024-11-11 12:16:22.420724: Desired fold for training: 0
2024-11-11 12:16:22.420754: This split has 1097 training and 275 validation cases.
using pin_memory on device 0
using pin_memory on device 0
Exception in thread Thread-3 (results_loop):
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
2024-11-11 12:16:43.921363: Using torch.compile...
/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
This is the configuration used by this training:
Configuration name: 3d_fullres
{'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 11, 'patch_size': [160, 192, 160], 'median_image_size_in_voxels': [140.0, 170.0, 137.0], 'spacing': [1.0, 1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [True], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.ResidualEncoderUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'n_blocks_per_stage': [1, 3, 4, 6, 6, 6], 'n_conv_per_stage_decoder': [1, 1, 1, 1, 1], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': False}
These are the global plan.json settings:
{'dataset_name': 'Dataset888_EnhancedT1ce', 'plans_name': 'nnUNetResEncUNetLPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [140, 170, 137], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'nnUNetPlannerResEncL', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 1905559.25, 'mean': 2247.761180218517, 'median': 709.0, 'min': 0.0, 'percentile_00_5': 0.0, 'percentile_99_5': 9960.0, 'std': 26064.039749429445}}}
2024-11-11 12:16:44.753634: unpacking dataset...
2024-11-11 12:16:50.472351: unpacking done...
2024-11-11 12:16:50.473873: Unable to plot network architecture: nnUNet_compile is enabled!
2024-11-11 12:16:50.481694:
2024-11-11 12:16:50.481765: Epoch 0
2024-11-11 12:16:50.481896: Current learning rate: 0.01
Traceback (most recent call last):
File "/opt/conda/bin/nnUNetv2_train", line 8, in <module>
sys.exit(run_training_entry())
^^^^^^^^^^^^^^^^^^^^
File "/workspace/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/workspace/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/workspace/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
item = self.__get_next_item()
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
The text was updated successfully, but these errors were encountered:
Hi nnUNet team,
I've been getting
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
, but have not been able to figure out why. It has been previously referenced in #2514, #2516 and #2297, but I can't see if they have been solved. I'm trying to train inside a docker container built on top of the most recent PyTorch imagepytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
, using aNVIDIA A100-SXM4-80GB
GPU and cuDNN version: 90100.Below are the generated log messages.
Any lead would be appreaciated. Thanks in advance.
Alejandro
The text was updated successfully, but these errors were encountered: