RuntimeError: One or more background workers are no longer alive. #2595

MoraRubio · 2024-11-11T12:28:46Z

Hi nnUNet team,

I've been getting RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message, but have not been able to figure out why. It has been previously referenced in #2514, #2516 and #2297, but I can't see if they have been solved. I'm trying to train inside a docker container built on top of the most recent PyTorch image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel , using a NVIDIA A100-SXM4-80GB GPU and cuDNN version: 90100.

Below are the generated log messages.

Any lead would be appreaciated. Thanks in advance.

Alejandro

Matplotlib created a temporary cache directory at /tmp/matplotlib-hxupd6_c because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using device: cuda:0
/workspace/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-11-11 12:16:22.413313: do_dummy_2d_data_aug: False
2024-11-11 12:16:22.419833: Using splits from existing split file: /data/nnUNet_preprocessed/Dataset888_EnhancedT1ce/splits_final.json
2024-11-11 12:16:22.420686: The split file contains 5 splits.
2024-11-11 12:16:22.420724: Desired fold for training: 0
2024-11-11 12:16:22.420754: This split has 1097 training and 275 validation cases.
using pin_memory on device 0
using pin_memory on device 0
Exception in thread Thread-3 (results_loop):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
2024-11-11 12:16:43.921363: Using torch.compile...
/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
  warnings.warn(

This is the configuration used by this training:
Configuration name: 3d_fullres
 {'data_identifier': 'nnUNetPlans_3d_fullres', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 11, 'patch_size': [160, 192, 160], 'median_image_size_in_voxels': [140.0, 170.0, 137.0], 'spacing': [1.0, 1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization'], 'use_mask_for_norm': [True], 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'architecture': {'network_class_name': 'dynamic_network_architectures.architectures.unet.ResidualEncoderUNet', 'arch_kwargs': {'n_stages': 6, 'features_per_stage': [32, 64, 128, 256, 320, 320], 'conv_op': 'torch.nn.modules.conv.Conv3d', 'kernel_sizes': [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], 'strides': [[1, 1, 1], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2], [2, 2, 2]], 'n_blocks_per_stage': [1, 3, 4, 6, 6, 6], 'n_conv_per_stage_decoder': [1, 1, 1, 1, 1], 'conv_bias': True, 'norm_op': 'torch.nn.modules.instancenorm.InstanceNorm3d', 'norm_op_kwargs': {'eps': 1e-05, 'affine': True}, 'dropout_op': None, 'dropout_op_kwargs': None, 'nonlin': 'torch.nn.LeakyReLU', 'nonlin_kwargs': {'inplace': True}}, '_kw_requires_import': ['conv_op', 'norm_op', 'dropout_op', 'nonlin']}, 'batch_dice': False} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset888_EnhancedT1ce', 'plans_name': 'nnUNetResEncUNetLPlans', 'original_median_spacing_after_transp': [1.0, 1.0, 1.0], 'original_median_shape_after_transp': [140, 170, 137], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'nnUNetPlannerResEncL', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 1905559.25, 'mean': 2247.761180218517, 'median': 709.0, 'min': 0.0, 'percentile_00_5': 0.0, 'percentile_99_5': 9960.0, 'std': 26064.039749429445}}} 

2024-11-11 12:16:44.753634: unpacking dataset...
2024-11-11 12:16:50.472351: unpacking done...
2024-11-11 12:16:50.473873: Unable to plot network architecture: nnUNet_compile is enabled!
2024-11-11 12:16:50.481694: 
2024-11-11 12:16:50.481765: Epoch 0
2024-11-11 12:16:50.481896: Current learning rate: 0.01
Traceback (most recent call last):
  File "/opt/conda/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
             ^^^^^^^^^^^^^^^^^^^^
  File "/workspace/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/workspace/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
    nnunet_trainer.run_training()
  File "/workspace/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

The text was updated successfully, but these errors were encountered:

fitzjalen · 2024-11-12T17:39:14Z

+1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: One or more background workers are no longer alive. #2595

RuntimeError: One or more background workers are no longer alive. #2595

MoraRubio commented Nov 11, 2024

fitzjalen commented Nov 12, 2024

RuntimeError: One or more background workers are no longer alive. #2595

RuntimeError: One or more background workers are no longer alive. #2595

Comments

MoraRubio commented Nov 11, 2024

fitzjalen commented Nov 12, 2024