Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nndet prep_train fails using local Dockerfile #38

Closed
alexandreroutier opened this issue Feb 28, 2023 · 3 comments
Closed

nndet prep_train fails using local Dockerfile #38

alexandreroutier opened this issue Feb 28, 2023 · 3 comments

Comments

@alexandreroutier
Copy link

Hello,

When trying to run your Docker version of the nnDetection network:

cd src/picai_baseline/nndetection/training_docker/
docker build . --tag joeranbosma/picai_nndetection:latest

The preprocessing steps worked well but when running the Docker command of nndet prep_train:

docker run --cpus=6 --gpus='"device=0"' -it --rm \
        -v /workdir:/workdir \
        joeranbosma/picai_nndetection:latest nndet prep_train \
        Task2203_picai_baseline /workdir/ \
        --custom_split /workdir/nnUNet_raw_data/Task2203_picai_baseline/splits.json \  
        --fold 0

I have the following error message:

=============
== PyTorch ==
=============

NVIDIA Release 20.12 (build 17950526)
PyTorch Version 1.8.0a0+1606899

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.      

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.       

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.       

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be   
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --ipc=host ...

[#] Creating plans and preprocessing data
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 567, in _build_master
    ws.require(__requires__)
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 884, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 775, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (packaging 20.4 (/opt/conda/lib/python3.8/site-packages), Requirement.parse('packaging>20.9'), {'shap'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/nndet_prep", line 33, in <module>
    sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_prep')())
  File "/opt/conda/bin/nndet_prep", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/opt/conda/lib/python3.8/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/opt/conda/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/code/nnDetection/scripts/preprocess.py", line 36, in <module>
    from nndet.planning import DatasetAnalyzer
  File "/opt/code/nnDetection/nndet/planning/__init__.py", line 2, in <module>
    from nndet.planning.experiment import PLANNER_REGISTRY
  File "/opt/code/nnDetection/nndet/planning/experiment/__init__.py", line 6, in <module>
    from nndet.planning.experiment.v001 import D3V001
  File "/opt/code/nnDetection/nndet/planning/experiment/v001.py", line 6, in <module>
    from nndet.ptmodule import MODULE_REGISTRY
  File "/opt/code/nnDetection/nndet/ptmodule/__init__.py", line 3, in <module>
    from nndet.ptmodule.base_module import LightningBaseModule
  File "/opt/code/nnDetection/nndet/ptmodule/base_module.py", line 24, in <module>
    import pytorch_lightning as pl
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 16, in <module>
    from torchmetrics import Accuracy as _Accuracy
  File "/opt/conda/lib/python3.8/site-packages/torchmetrics/__init__.py", line 14, in <module>
    from torchmetrics import functional  # noqa: E402
  File "/opt/conda/lib/python3.8/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit, pit_permutate
  File "/opt/conda/lib/python3.8/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module>
    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit, pit_permutate  # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/torchmetrics/functional/audio/pit.py", line 24, in <module>
    from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
  File "/opt/conda/lib/python3.8/site-packages/torchmetrics/utilities/imports.py", line 22, in <module>
    from pkg_resources import DistributionNotFound, get_distribution
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3239, in <module>
    def _initialize_master_working_set():
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3222, in _call_aside
    f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3251, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 569, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 582, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/opt/conda/lib/python3.8/site-packages/pkg_resources/__init__.py", line 775, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (packaging 20.4 (/opt/conda/lib/python3.8/site-packages), Requirement.parse('packaging>20.9'), {'shap'})
Traceback (most recent call last):
  File "/usr/local/bin/nndet", line 369, in <module>
    action(sys.argv[2:])
  File "/usr/local/bin/nndet", line 148, in nndet_prep_train
    subprocess.check_call(cmd)
  File "/opt/conda/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['nndet_prep', '2203']' returned non-zero exit status 1.

However, if I use the version from Docker Hub:

docker pull joeranbosma/picai_nndetection:latest

docker run ... nndet prep_train will run without issue. I believe there might be an issue when installing the dependencies in the Dockerfile but I was not able to find what was causing the issue.

Best,
Alexandre

@joeranbosma
Copy link
Collaborator

Hello Alexandre,

Thanks for pointing out this issue! You are indeed correct, the build throws the following error:

ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

tensorboard 2.12.0 requires protobuf>=3.19.6, but you'll have protobuf 3.14.0 which is incompatible.
requests 2.24.0 requires urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you'll have urllib3 1.26.14 which is incompatible.
torchmetrics 0.11.1 requires torch>=1.8.1, but you'll have torch 1.8.0a0+1606899 which is incompatible.
shap 0.41.0 requires packaging>20.9, but you'll have packaging 20.4 which is incompatible.
docker 6.0.1 requires requests>=2.26.0, but you'll have requests 2.24.0 which is incompatible.
mlxtend 0.21.0 requires scikit-learn>=1.0.2, but you'll have scikit-learn 0.23.2 which is incompatible.

I've seen this error before, it's caused by the update of dependencies (e.g., mlxtend to version 0.21.0), where different dependencies have conflicts. I've been able to fix this error by fixing a bunch of dependencies to the exact version used in joeranbosma/picai_nndetection:latest:

mlxtend==0.19.0
tensorboard==2.11.0
requests==2.28.1
torchmetrics==0.7.3
docker==6.0.1
packaging==20.4
mlflow==1.30.0

I've updated the repository such that the latest version builds successfully: https://github.com/DIAGNijmegen/picai_baseline/tree/main/src/picai_baseline/nndetection/training_docker

Could you try again with the updated requirements.txt?

@alexandreroutier
Copy link
Author

Hi @joeranbosma,

Sorry for my late answer, I came across issues after updating the requirements. I had to do the following extra steps:

  • Add SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True because of skearn vs scikit-learn installation
  • "Downgrade" the version of nnDetection. The error regarding meshgrid is due to an outdated version of PyTorch as explained e.g. in PyTorch error: TypeError: meshgrid() got an unexpected keyword argument 'indexing' lucidrains/deep-daze#194 . Newest commits of nnDetection solved at some point this PyTorch issue. Howerver, since the Dockerfile is using an old "base image" of PyTorch, this generates a discrepancy between the old version of PyTorch and the latest version of nnDetection. I don't remember how but I chose the 1044ace5340b2a07bf9f9d5f92681f712cc0d2b4 commit from nnDetection. I believe this matches the version you had on Docker Hub but I am not 100% sure.

In the end, I added:

 # Install mibaumgartner code
 COPY ./requirements.txt .
+ENV SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True
 RUN pip3 install -r requirements.txt  \
   && pip3 install hydra-core --upgrade --pre \
   && pip3 install git+https://github.com/mibaumgartner/pytorch_model_summary.git

 # Install nnDetection
 RUN git clone https://github.com/MIC-DKFZ/nnDetection /opt/code/nnDetection
+RUN cd /opt/code/nnDetection \
+    && git checkout 1044ace5340b2a07bf9f9d5f92681f712cc0d2b4
 COPY ./consolidate.py /opt/code/nnDetection/scripts/consolidate.py
 COPY ./predict.py /opt/code/nnDetection/scripts/predict.py

to the Dockerfile and the meshgrid error disappeared. However, I faced another error during the training. My computer was facing memory issues at the same time and I didn't have the opportunity to investigate more.

Alexandre

@joeranbosma
Copy link
Collaborator

This indeed did the trick! This fix will be integrated in #48.

joeranbosma added a commit that referenced this issue May 15, 2023
- Incorporate bugfix detailed in
#38.
- [Bugfix detection map generation from bounding
boxes](5a33f23)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants