Lung Nodule Analysis System

This project is an end-to-end deep learning pipeline for lung cancer detection using 3D CT scans. It begins by identifying potential nodules in the lungs through segmentation, followed by filtering these candidates using a cascade of specialized CNN classifiers to assess malignancy.

Lung Cancer Detection Pipline

1. Loading and processing raw data files

The first step involves loading the raw 3D CT scan data and preprocessing it to make it suitable for subsequent steps, which includes:

Transforming from XYZ (continuous) coordinates to IRC (discrete) coordinates, using the transformation directions and voxel dimensions attached in the metadata files.
Unifying the XYZ coordinates of the candidate nodules and annotations (actual nodules info).
Standardizing the voxel intensity (density values) to fit in the context of our problem.

Relavant files:

utils/common_utils/util.py: Contains general utility functions for data transformations.
utils/CT.py: Handles loading and preprocessing of CT scan data.

2. Segmentation

Utilize a U-Net architecture to segment the lung region and detect candidate areas that potentially look like nodules. This step generates a heatmap of regions of interest, allowing the pipeline to focus on candidate nodules while discarding irrelevant anatomy. All the components of such subsystem are implemented from scratch, including:

U-Net model architecture.
Custom dataset with dataset balancing, GPU-based augmentation, and disk pre-caching for accelerated training.
Fully fledged command-line application for training. It will parse command-line arguments, have a full-featured --help command, and be easy to run in a wide variety of environments.

Relevant files:

utils/Unet.py: Implements the U-Net architecture used for segmentation.
utils/seg_dset.py: Defines the custom dataset class for segmentation, including data loading and balancing.
utils/seg_dset_utils.py: Provides utilities for dataset augmentation.
utils/seg_precaching.py: Implements disk pre-caching functionalities to speed up the training process.
seg_training.py: Implements command-line applicaiton for segmentation model training.

3. Nodule Grouping

Group the detected regions from the segmentation heatmap into clusters that represent potential nodules. Each cluster is identified by the coordinates of its center, providing a targeted approach for further analysis. This step acts as the bridge between the segmentation model and the CNN classifiers.

To achieve this, we employ a simple connected-components algorithm for grouping suspected nodule voxels into chunks, which will then be passed to the classification step.

Relevant file:

nodule_analysis_pipline.py: Implements the entire pipeline, from segmentation to nodule grouping, analysis, and diagnosis.

4. Nodule Classification

After grouping the potential nodules detected by the segmentation model, the next step is to screen out any false positives using a CNN-based binary classification network. This step aims to refine the detection by distinguishing between actual nodules and non-nodules. All components for this subsystem are implemented from scratch, including:

A custom dataset with data balancing, GPU-based augmentation, and disk pre-caching to accelerate training.
A fully-featured command-line application for training the classifier, with functionality for fine-tuning the malignancy classifier in the following step.

Relevant files:

utils/NoduleClassifier.py: Implements the CNN architecture used for nodule classification.
utils/classifier_dset.py: Defines the custom dataset class for nodule classification, including data loading and balancing.
utils/classifier_dset_utils.py: Provides utilities for dataset augmentation.
utils/nodule_precaching.py: Implements disk pre-caching functionalities to speed up the training process.
classifier_training.py: Implements the command-line application for training the classification model.

5. Diagnosis and Malignancy Classification

In the final step, the classification results are aggregated to make a comprehensive diagnosis. The system evaluates each identified nodule using a dedicated CNN classifier to determine whether it is malignant or benign. This malignancy classifier is fine-tuned from the nodule classifier to enhance accuracy in distinguishing between benign and malignant nodules. The entire pipeline culminates in generating a report that flags patients as suspicious if at least one nodule is identified as potentially malignant.

Relevant files:

utils/classifier_dset.py: Defines the custom dataset class for malignancy classification, including data loading and balancing.
classifier_training.py: Implements the command-line application for training the malignancy classification model.
nodule_analysis_pipline.py: Implements the entire pipeline, from segmentation to nodule grouping, analysis, and diagnosis.

Dataset: LUNA 2016

This project uses the LUNA 2016 (LUng Nodule Analysis) dataset, which consists of 3D CT scans labeled with lung nodule annotations. The dataset is part of a challenge aimed at improving nodule detection algorithms through standardized evaluation. It includes 10 subsets of scans for tasks like Nodule Detection (NDET) and False Positive Reduction (FPRED).

For more details, visit the LUNA 2016 Grand Challenge.

Installation

To set up the Lung Nodule Analysis System, follow these steps:

Clone the Repository

git clone [email protected]:joyou159/Lung-Nodule-Analysis-System.git
cd Lung-Nodule-Analysis-System

Install Dependencies
```
 pip install -r requirements.txt
```

Run Infernece

# check --help to get all possible command-line arguments 
python nodule_analysis_app.py "subject's series_uid"

Models Training

For training, I utilized the Kaggle platform to develop both the classification and segmentation models. fortunately, half of the dataset, specifically subsets 0 through 4, is hosted on Kaggle, which allowed me to set up a notebook and began working directly.

Classification Models

Segmentation Model

Analysis Pipeline

Each notebook works as a standalone module, so feel free to investigate each one individually.

During the training process, i employed TensorBoard to monitor model metrics and performance while identifying any potential issues. You can explore these insights by launching the tensorboard.ipynb notebook after specifying the runs file.

Note

Due to limited on-disk space on Kaggle (≈ 19.5GB) and computational resources quota, the pipeline models were trained using only one subset of the LUNA 2016 dataset instead of the entire dataset. This limitation was necessary for on-disk dataset pre-caching to ensure high-speed training. As a result, the model's performance may be somewhat lower than if it had been trained on the complete dataset. If you have sufficient computational resources, we encourage you to train the models on the entire dataset, as this would significantly enhance performance.

Models Retraining Setup

Before retraining these models, ensure that you adjust the file paths in the source code:

Update the path to malignancy_annotations\annotations_for_malignancy.csv in the source code to match your local setup.
Set the dataset directory paths as needed to reflect your environment.

Classification Models

# precaching to speed up training process
python utils\nodule_precaching.py \
  --num-workers 4 \
  --batch-size 64 \
  --subsets-included 0 \  # include more subsets if possible

# Use --help to get more details about each argument
# training nodule classifier
python classifier_training.py \
  --num-workers 4 \
  --batch-size 128 \
  --epochs 10 \
  --subsets-included 0 \  # include more subsets if possible
  --augmented \
  --augment-flip \
  --augment-offset \
  --augment-scale \
  --augment-rotate \
  --augment-noise

# Use --help to get more details about each argument
# malignancy classifier fine-tuning
python classifier_training.py \
  --num-workers 4 \
  --batch-size 64 \
  --malignant \
  --dataset MalignantLunaDataset \
  --finetune nodule_cls_model_path \  
  --finetune-depth 1 \
  --epochs 10 \
  --subsets-included 0 \  # include more subsets if possible

Segmentation Model

# precaching to speed up training process
python utils\seg_precaching.py \
  --num-workers 4 \
  --batch-size 64 \
  --subsets-included 0 \  # include more subsets if possible

# try --help to get more details about each argument
python seg_training.py \
  --num-workers 4 \
  --batch-size 128 \
  --epochs 10 \
  --subsets-included 0 \  # include more subsets if possible
  --augmented \
  --augment-flip \
  --augment-offset \
  --augment-scale \
  --augment-rotate \
  --augment-noise

Acknowledgments

This project is guided by "Deep Learning with PyTorch" by Eli Stevens, Luca Antiga, and Thomas Viehmann, published by Manning Publications. The concepts and techniques discussed in Part 2 of the book have been particularly influential in shaping the implementation and design of this pipline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lung Nodule Analysis System

Lung Cancer Detection Pipline

1. Loading and processing raw data files

2. Segmentation

3. Nodule Grouping

4. Nodule Classification

5. Diagnosis and Malignancy Classification

Dataset: LUNA 2016

Installation

Models Training

Models Retraining Setup

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
READMD-Assets		READMD-Assets
malignancy_annotations		malignancy_annotations
models		models
runs_classification_tensorboard_metrics/tensorboard-prefix		runs_classification_tensorboard_metrics/tensorboard-prefix
runs_detection_tensorboard_metrics/tensorboard-prefix		runs_detection_tensorboard_metrics/tensorboard-prefix
utils		utils
README.md		README.md
classifier_training.py		classifier_training.py
nodule_analysis_pipline.py		nodule_analysis_pipline.py
requirements.txt		requirements.txt
seg_training.py		seg_training.py
tensorboard.ipynb		tensorboard.ipynb

joyou159/Lung-Nodule-Analysis-System

Folders and files

Latest commit

History

Repository files navigation

Lung Nodule Analysis System

Lung Cancer Detection Pipline

1. Loading and processing raw data files

2. Segmentation

3. Nodule Grouping

4. Nodule Classification

5. Diagnosis and Malignancy Classification

Dataset: LUNA 2016

Installation

Models Training

Models Retraining Setup

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages