Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: One or more background workers are no longer alive. #54

Open
kaident-tr opened this issue Jul 24, 2024 · 4 comments
Open

Comments

@kaident-tr
Copy link

kaident-tr commented Jul 24, 2024

Hi all, when I start training in the Windows environment, I get this error information. Eventhough I have tried the solution from MIC-DKFZ/nnUNet#1343 from the original nnUNet by setting the environment OMP_NUM_THREADS=1, It still not be solved.
Thank you in advance for your help!

`This is the configuration used by this training:
Configuration name: 2d
{'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 14, 'patch_size': [512, 448], 'median_image_size_in_voxels': [512.0, 512.0], 'spacing': [0.7958984971046448, 0.7958984971046448], 'normalization_schemes': ['CTNormalization'], 'use_mask_for_norm': [False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 1, 1, 1], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 1, 1], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}

These are the global plan.json settings:
{'dataset_name': 'Dataset701_AbdomenCT', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [2.5, 0.7958984971046448, 0.7958984971046448], 'original_median_shape_after_transp': [97, 512, 512], 'image_reader_writer': 'SimpleITKIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 3071.0, 'mean': 97.29691314697266, 'median': 118.0, 'min': -1024.0, 'percentile_00_5': -958.0, 'percentile_99_5': 270.0, 'std': 137.85003662109375}}}

2024-07-24 17:20:43.049483: unpacking dataset...
2024-07-24 17:20:43.598747: unpacking done...
2024-07-24 17:20:43.599747: do_dummy_2d_data_aug: False
2024-07-24 17:20:43.666747: Unable to plot network architecture:
2024-07-24 17:20:43.666747: No module named 'hiddenlayer'
2024-07-24 17:20:43.759725:
2024-07-24 17:20:43.760716: Epoch 0
2024-07-24 17:20:43.761715: Current learning rate: 0.01
using pin_memory on device 0
Traceback (most recent call last):
File "\?\C:\ProgramData\Anaconda3\envs\umamba\Scripts\nnUNetv2_train-script.py", line 33, in
sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
File "f:\u-mamba-main\umamba\nnunetv2\run\run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "f:\u-mamba-main\umamba\nnunetv2\run\run_training.py", line 204, in run_training
nnunet_trainer.run_training()
File "f:\u-mamba-main\umamba\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 1258, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
File "f:\u-mamba-main\umamba\nnunetv2\training\nnUNetTrainer\nnUNetTrainer.py", line 900, in train_step
output = self.network(data)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "f:\u-mamba-main\umamba\nnunetv2\nets\UMambaBot_2d.py", line 432, in forward
skips[-1] = self.mamba_layer(skips[-1])
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "f:\u-mamba-main\umamba\nnunetv2\nets\UMambaBot_2d.py", line 61, in forward
x_mamba = self.mamba(x_norm)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\modules\mamba_simple.py", line 146, in forward
out = mamba_inner_fn(
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\ops\selective_scan_interface.py", line 317, in mamba_inner_fn
return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\autograd\function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\torch\cuda\amp\autocast_mode.py", line 113, in decorate_fwd
return fwd(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\mamba_ssm\ops\selective_scan_interface.py", line 187, in forward
conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(
TypeError: causal_conv1d_fwd(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: Optional[torch.Tensor], arg3: Optional[torch.Tensor], arg4: bool) -> torch.Tensor

Invoked with: tensor([[[-0.3531, -0.3256, -0.5120, ..., -0.3845, -0.3780, -0.2731],
[-0.1226, 0.0515, 0.0443, ..., -0.0484, -0.0954, 0.2243],
[ 0.2591, 0.4765, 0.4899, ..., 0.2762, 0.2085, 0.1601],
...,
[-0.4706, 0.0122, -0.0670, ..., -0.6855, -1.0694, -0.7547],
[ 0.2710, 0.6020, 0.5813, ..., 0.0339, 0.0822, 0.5069],
[-0.0817, 0.1549, 0.1879, ..., -0.1216, -0.4358, -0.3873]],

    [[-0.7350, -0.6563, -0.6970,  ..., -0.5548, -0.2491, -0.3194],
     [-0.3465, -0.6268, -0.4854,  ...,  0.2556,  0.1076,  0.1940],
     [ 0.0645,  0.5889,  0.7408,  ...,  0.4412,  0.1118,  0.2022],
     ...,
     [-0.7669, -0.8219, -0.9606,  ..., -0.6517, -0.6021, -0.7447],
     [ 0.6877,  0.3808,  0.4204,  ...,  0.2805,  0.3491,  0.3867],
     [ 0.1577,  0.0902,  0.0191,  ..., -0.5127, -0.3992, -0.4217]],

    [[-0.6899, -0.6800, -0.7939,  ..., -0.2452, -0.2823, -0.2156],
     [-0.2452, -0.2569, -0.4180,  ...,  0.2565,  0.3105,  0.2020],
     [ 0.4328,  0.6825,  0.6242,  ...,  0.2382,  0.2548,  0.2945],
     ...,
     [-0.5348, -0.4934, -0.6218,  ..., -0.8466, -0.8843, -0.9299],
     [ 0.1885,  0.4097,  0.3503,  ...,  0.5430,  0.5202,  0.5581],
     [-0.4576, -0.3852, -0.5572,  ..., -0.4343, -0.5026, -0.4852]],

    ...,

    [[-0.3982, -0.6243, -0.6702,  ..., -0.2997, -0.0544, -0.6496],
     [-0.3635, -0.3576, -0.4177,  ...,  0.1261,  0.1114,  0.0181],
     [ 0.3839,  0.7153,  0.7155,  ...,  0.2303,  0.1457, -0.1998],
     ...,
     [-0.6408, -0.5035, -0.6167,  ..., -0.6473, -0.4699, -0.2966],
     [ 0.3132,  0.4346,  0.4209,  ...,  0.0756,  0.2835,  0.2599],
     [-0.2990, -0.3384, -0.4100,  ...,  0.0843, -0.1040, -0.0645]],

    [[-0.4619, -0.7534, -0.7760,  ..., -0.5952, -0.3705, -0.3551],
     [-0.1528, -0.3495, -0.3650,  ...,  0.0889,  0.2627,  0.0885],
     [ 0.5250,  0.7301,  0.7312,  ...,  0.2815,  0.2979,  0.2394],
     ...,
     [-0.6124, -0.5625, -0.6515,  ..., -0.4177, -0.9805, -0.9586],
     [ 0.3327,  0.3848,  0.4037,  ...,  0.0295,  0.4747,  0.5617],
     [-0.3875, -0.3905, -0.4910,  ..., -0.0437, -0.5517, -0.5322]],

    [[-0.5744, -0.5597, -0.6744,  ..., -0.4591, -0.5266, -0.3234],
     [-0.2457, -0.3103, -0.3841,  ...,  0.0146,  0.0279,  0.0058],
     [ 0.5145,  0.6709,  0.6334,  ...,  0.0854,  0.1010,  0.3496],
     ...,
     [-0.6111, -0.6036, -0.6492,  ..., -0.6807, -0.6825, -0.8804],
     [ 0.2965,  0.4934,  0.4702,  ...,  0.5427,  0.5108,  0.7819],
     [-0.3857, -0.3858, -0.3655,  ..., -0.4994, -0.5220, -0.0722]]],
   device='cuda:0', requires_grad=True), tensor([[ 0.2771, -0.4502,  0.2234,  0.4393],
    [-0.2371,  0.0904,  0.3013,  0.2585],
    [-0.2705,  0.0695,  0.4170, -0.1234],
    ...,
    [ 0.3458, -0.2377, -0.4476,  0.1447],
    [ 0.4869,  0.3001, -0.4930,  0.0575],
    [ 0.4755, -0.2672,  0.3849, -0.0855]], device='cuda:0',
   requires_grad=True), Parameter containing:

tensor([-0.0066, -0.3897, 0.1920, ..., 0.1256, -0.0983, -0.4903],
device='cuda:0', requires_grad=True), None, None, None, True
Exception in thread Thread-4 (results_loop):
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\umamba\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\ProgramData\Anaconda3\envs\umamba\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "C:\ProgramData\Anaconda3\envs\umamba\lib\site-packages\batchgenerators\dataloading\nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message`

@Saul62
Copy link

Saul62 commented Jul 26, 2024

我也遇到同样的问题,请问你是否已经解决?

@qxxfd
Copy link

qxxfd commented Jul 27, 2024

我也遇到同样的问题,请问你是否已经解决?CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 11 3d_fullres 0

############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-07-28 01:12:55.969214: do_dummy_2d_data_aug: True
2024-07-28 01:12:55.970464: Using splits from existing split file: /gpfs/share/home/2301210659/tools/nnunet_v2/dataset/nnUNet_preprocessed/Dataset011_T-tubule/splits_final.json
2024-07-28 01:12:55.971022: The split file contains 5 splits.
2024-07-28 01:12:55.971232: Desired fold for training: 0
2024-07-28 01:12:55.971411: This split has 4 training and 1 validation cases.
using pin_memory on device 0
Exception in background worker 3:
local variable 'region_labels' referenced before assignment
Traceback (most recent call last):
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer
item = next(data_loader)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in next
return self.generate_train_batch()
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/dataloading/data_loader_3d.py", line 61, in generate_train_batch
tmp = self.transforms(**{'image': data_all[b], 'segmentation': seg_all[b]})
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call
return self.apply(data_dict, **params)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/compose.py", line 13, in apply
data_dict = t(**data_dict)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call
return self.apply(data_dict, **params)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 67, in apply
data_dict['segmentation'] = self._apply_to_segmentation(data_dict['segmentation'], **params)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/seg_to_regions.py", line 17, in _apply_to_segmentation
if isinstance(region_labels, int) or len(region_labels) == 1:
UnboundLocalError: local variable 'region_labels' referenced before assignment
Traceback (most recent call last):
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training
self.on_train_start()
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start
self.dataloader_train, self.dataloader_val = self.get_dataloaders()
File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in get_dataloaders
_ = next(mt_gen_train)
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next
item = self.__get_next_item()
File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

@kaident-tr
Copy link
Author

我也遇到同样的问题,请问你是否已经解决?CUDA_VISIBLE_DEVICES=1 nnUNetv2_train 11 3d_fullres 0

############################ INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################

Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-07-28 01:12:55.969214: do_dummy_2d_data_aug: True 2024-07-28 01:12:55.970464: Using splits from existing split file: /gpfs/share/home/2301210659/tools/nnunet_v2/dataset/nnUNet_preprocessed/Dataset011_T-tubule/splits_final.json 2024-07-28 01:12:55.971022: The split file contains 5 splits. 2024-07-28 01:12:55.971232: Desired fold for training: 0 2024-07-28 01:12:55.971411: This split has 4 training and 1 validation cases. using pin_memory on device 0 Exception in background worker 3: local variable 'region_labels' referenced before assignment Traceback (most recent call last): File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 53, in producer item = next(data_loader) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/data_loader.py", line 126, in next return self.generate_train_batch() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/dataloading/data_loader_3d.py", line 61, in generate_train_batch tmp = self.transforms(**{'image': data_all[b], 'segmentation': seg_all[b]}) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call return self.apply(data_dict, **params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/compose.py", line 13, in apply data_dict = t(**data_dict) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 18, in call return self.apply(data_dict, **params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/base/basic_transform.py", line 67, in apply data_dict['segmentation'] = self._apply_to_segmentation(data_dict['segmentation'], **params) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgeneratorsv2/transforms/utils/seg_to_regions.py", line 17, in _apply_to_segmentation if isinstance(region_labels, int) or len(region_labels) == 1: UnboundLocalError: local variable 'region_labels' referenced before assignment Traceback (most recent call last): File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training self.on_train_start() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start self.dataloader_train, self.dataloader_val = self.get_dataloaders() File "/gpfs/share/home/2301210659/tools/nnunet_v2/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in get_dataloaders _ = next(mt_gen_train) File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() File "/gpfs/share/home/2301210659/.conda/envs/nnunetv2/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

我已经解决问题,我想应该很多问题最终都归于“One or more background workers...",所以你尝试追踪到上面的traceback. 我的问题是重新安装回那些需要的package. 不如你尝试3.10版本吧(因为我看见作者推荐3.10版本)

@AyacodeYa
Copy link

Hi everyone, maybe this can help you. #56 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants