Skip to content
This repository has been archived by the owner on Jun 26, 2021. It is now read-only.

Transition Guide from 0.3.x to 0.4.0

Justus Schock edited this page Jun 27, 2019 · 2 revisions

Transition to Delira 0.4.0

Author: Justus Schock (@justusschock)

Delira 0.4.0 offers a couple of new features, which are essential for future developments. The core of all changes is the unified training and prediction API.

For this we completely rewrote a unified trainer and introduced a Predictor class. Unfortunately, these changes are breaking the backward compatibility by quite a bit. Let's first see, how we trained and defined a model in delira 0.3.2 :

Old Training API: Delira 0.3.x

Let's assume, we just want to train a very simple network consisting of 3 fully connected layers in PyTorch.

Model

Our model definition would probably look something like this:

from delira.models import AbstractPyTorchNetwork
import torch
import logging

class SimpleNet(AbstractPyTorchNetwork):
    def __init__(self, num_inputs=32, num_outputs=10, num_hidden=64):
        super().__init__()
        
        # build our actual network, just use some linear layers and relus here
        self.fc1 = torch.nn.Linear(num_inputs, num_hidden)
        self.fc2 = torch.nn.Linear(num_hidden, num_hidden)
        self.fc3 = torch.nn.Linear(num_hidden, num_outputs)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        # pass our tensor x through all the layers
        return self.fc3(self.relu(self.fc2(self.relu(self.fc1(x)))))

    @staticmethod
    def closure(model, data_dict, optimizers, criterions=None, metrics=None, fold=0, **kwargs):
        # initialize variables
        if criterions is None:
            criterions = {}
        if metrics is None:
            metrics = {}
        assert (optimizers and criterions) or not optimizers, \
            "Criterion dict cannot be emtpy, if optimizers are passed"

        loss_vals = {}
        metric_vals = {}
        total_loss = 0

        # choose suitable context manager
        if optimizers:
            context_man = torch.enable_grad
        else:
            context_man = torch.no_grad

        with context_man():
            # obtain predictions from network
            inputs = data_dict.pop("data")
            preds = model(inputs)

            # calculate losses
            if data_dict:
                for key, crit_fn in criterions.items():
                    _loss_val = crit_fn(preds, *data_dict.values())
                    loss_vals[key] = _loss_val.item()
                    total_loss += _loss_val

                # calculate metrics
                with torch.no_grad():
                    for key, metric_fn in metrics.items():
                        metric_vals[key] = metric_fn(preds, *data_dict.values()).item()

        # backpropagation
        if optimizers:
            optimizers["default"].zero_grad()
            total_loss.backward()
            optimizers["default"].step()

        # log values
        for key, val in {**metric_vals, **loss_vals}.items():
            logging.info({"scalar", {"name": key, "value": val}})

        return metric_vals, loss_vals, [preds.detach()]

    @staticmethod
    def prepare_batch(batch: dict, input_device, output_device):
        return_dict = {"data": torch.from_numpy(batch.pop("data")).to(input_device).to(torch.float)

        for key, val in batch.items():
            return_dict[key] = torch.from_numpy(val).to(output_device).to(torch.float)

        return return_dict
        

For simplicity, we did not log predictions, did not enable mixed precision via APEX and left out all the docstrings. We also implemented the prepare_batch method in a way, that we can use a simple L1-Error for training.

Dataset

To train a model, we also need a dataset, which will provide the actual data. For this example we use an artificial dataset, which creates random arrays as input and output:

from delira.data_loading import AbstractDataset
import numpy as np

class RandomDataset(AbstractDataset):
    def __init__(self, length, num_inputs, num_outputs):
        super().__init__(None, None)
        # set attributes for length, number of inputs and number of outputs
        self._length = length
        self._num_inputs = num_inputs
        self._num_outputs = num_outputs

    def __getitem__(self, index):
        # sample random data
        input_data = np.random.rand(self._num_inputs)
        output_data = np.random.rand(self._num_outputs)
        return {"data": input_data, "label": output_data}

    def get_sample_from_index(self, index):
        return self.__getitem__(index)

    def __len__(self):
        return self._length

Preparations for training

Next we set up our hyperparameters for training as well as the model kwargs:

import torch
from delira.training import Parameters
params = Parameters(fixed_params={
    "model": {
        "num_inputs": 5,
        "num_hidden": 20,
        "num_outputs": 10
    },
    "training": {
        "batch_size": 64, # batchsize to use
        "num_epochs": 10, # number of epochs to train
        "optimizer_cls": torch.optim.Adam, # optimization algorithm to use
        "optimizer_params": {'lr': 1e-3}, # initialization parameters for this algorithm
        "losses": {"L1": torch.nn.L1Loss()}, # the loss function
        "lr_sched_cls": None,  # the learning rate scheduling algorithm to use
        "lr_sched_params": {}, # the corresponding initialization parameters
        "metrics": {"MSE": torch.nn.MSELoss()} # and some evaluation metrics
    }
})

After we created our parameters and thereby defined our actual training, we just need to create instances of our dataset for training and validation and wrap it into datamanagers:

from delira.data_loading import BaseDataManager, SequentialSampler, RandomSampler
dset_train = RandomDataset(5000, 5, 10)
dset_val = RandomDataset(500, 5, 10)

manager_train = BaseDataManager(dset_train, params.nested_get("batch_size"),
                                transforms=None, sampler_cls=RandomSampler,
                                n_process_augmentation=4)
manager_val = BaseDataManager(dset_val, params.nested_get("batch_size"), 
                              transforms=None, sampler_cls=SequentialSampler,
                              n_process_augmentation=4

For simplicity, we just omitted any transforms here. Now, we just have to create an Experiment and run it:

from delira.training import PyTorchExperiment
from delira.training.train_utils import create_optims_default_pytorch

experiment = PyTorchExperiment(params, SimpleNet,
                               name="ClassificationExample",
                               save_path="./tmp/delira_Experiments",
                               optim_builder=create_optims_default_pytorch,
                               gpu_ids=[0])
experiment.save()

model = experiment.run(manager_train, manager_val)

And that's it. Training should start now. The PyTorchExperiment internally creates a PyTorchNetworkTrainer and calls its train method. If we wanted to do the same with TensorFlow, we would have to TfExperiment instead, which creates and calls an TfNetworkTrainer. Sounds simple, right? Theoretically it is that simple - with just one problem:

So far, we had an AbstractTrainer specifying an interface which was implemented by both, the PyTorchNetworkTrainer and the TfNetworkTrainer and same goes for the experiment classes. These implementations were not coupled and did share almost no code. Thus, the behavior of both trainers - although providing the same API - could differ slightly or (in worst case) be completely different. Additionally there was a lot of code duplication inside the framework. To solve this, we implemented a BaseNetworkTrainer and a BaseExperiment, which contain most of the training and experiment based code in a backend-agnostic way and the backend-specific classes (PyTorchNetworkTrainer, TfNetworkTrainer, PyTorchExperiment and TfExperiment just add the backend-specific stuff like seeding, switching between training and validation mode and backend-specific options for initialization and potential speedup.

There's another problem, if we wanted to test this model on an extra test dataset, the way to go was to partially initialize a full trainer with dummy values and then call its predict function. This is extremely inefficient (in terms of memory and computation time, as the trainer setup usually involves checking many conditions and eventually wrapping classes by other classes or changing arguments based on these conditions) and also not framework-agnostic.

To solve this, we created a new base class for the BaseExperiment: the Predictor. The Predictor provides basic functionalities for prediction and metric calculation. These capabilities are extended by training-related stuff in the BaseExperiment.

While doing so, we also solved another problem: delira's data_loading package was designed with huge datasets in mind, which cannot be completely stored in RAM. Therefore, it provides BaseLazyDataset etc. While predicting from a dataset, we cached all predictions for one epoch. Although we only cached predictions from the validation/testset and they are usually much smaller than the actual trainset, even these sets may cause OutOfMemory Errors.

To solve this, we made the Predictor return a generator when predicting from a datamanager instead of internally caching everything.

Now let's have a look at the new API and the changes:

New API: Delira 0.4.0

With the new API we need to change a few minor things, to make the training work again:

Model

First let's have a look at our new network definition:

from delira.models import AbstractPyTorchNetwork
import torch
import logging

class SimpleNet(AbstractPyTorchNetwork):
    def __init__(self, num_inputs=32, num_outputs=10, num_hidden=64):
        super().__init__()
        
        # build our actual network, just use some linear layers and relus here
        self.fc1 = torch.nn.Linear(num_inputs, num_hidden)
        self.fc2 = torch.nn.Linear(num_hidden, num_hidden)
        self.fc3 = torch.nn.Linear(num_hidden, num_outputs)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        # pass our tensor x through all the layers
        out = self.fc3(self.relu(self.fc2(self.relu(self.fc1(x)))))
        # NOTE: models must now return a dict; can contain arbitrary number of elements and arbitrary types
        return {"pred": out} 

    # NOTE: criterions were renamed to losses
    @staticmethod
    def closure(model, data_dict, optimizers, losses=None, metrics=None, fold=0, **kwargs):
        # initialize variables
        if losses is None:
            losses = {}
        if metrics is None:
            metrics = {}
        assert (optimizers and losses) or not optimizers, \
            "Criterion dict cannot be emtpy, if optimizers are passed"

        loss_vals = {}
        metric_vals = {}
        total_loss = 0

        # choose suitable context manager
        if optimizers:
            context_man = torch.enable_grad
        else:
            context_man = torch.no_grad

        with context_man():
            # obtain predictions from network
            inputs = data_dict.pop("data")
            preds = model(inputs)

            # calculate losses
            if data_dict:
                for key, crit_fn in losses.items():
                    # NOTE: to access the actual tensor, we need to index the resulting dict
                    _loss_val = crit_fn(preds["pred"], *data_dict.values())
                    loss_vals[key] = _loss_val.item()
                    total_loss += _loss_val

                # calculate metrics
                with torch.no_grad():
                    for key, metric_fn in metrics.items():
                        metric_vals[key] = metric_fn(preds["pred"], *data_dict.values()).item()

        # backpropagation
        if optimizers:
            optimizers["default"].zero_grad()
            total_loss.backward()
            optimizers["default"].step()

        # log values
        for key, val in {**metric_vals, **loss_vals}.items():
            logging.info({"scalar", {"name": key, "value": val}})

        # NOTE: closure returns only dicts now
        return metric_vals, loss_vals, {k: v.detach() for k, v in preds.items()}

    @staticmethod
    def prepare_batch(batch: dict, input_device, output_device):
        return_dict = {"data": torch.from_numpy(batch.pop("data")).to(input_device).to(torch.float)

        for key, val in batch.items():
            return_dict[key] = torch.from_numpy(val).to(output_device).to(torch.float)

        return return_dict
        

So all together the major changes inside the model, were to make everything a dict now (which is needed during training and prediction) and renaming the criterions to losses for conformity between backends.

Dataset

The new training API in delira 0.4.0 does not change the dataset API, meaning that our dataset from before should work as is (The dataset API has changed from delira 0.3.1 to delira 0.3.2 as you can see here.

Training Preparation

For training preparation the names of our hyperparameters have changed a bit. We now have to use val_metrics and train_metrics instead of metrics, since currently the train_metrics are also calculated within the closure and thus work on the backend's tensor class, but the val_metrics don't. They currently operate on numpy arrays, since the validations is no longer done inside the closure, but automatically inside the trainer/predictor. The next major release (0.5.0) will probably re-unite the metrics and move the calculation of the train metrics outside the closure.

Our new hyperparameters are:

import torch
from sklearn.metrics import mean_squared_error
from delira.training import Parameters
params = Parameters(fixed_params={
    "model": {
        "num_inputs": 5,
        "num_hidden": 20,
        "num_outputs": 10
    },
    "training": {
        "batch_size": 64, # batchsize to use
        "num_epochs": 10, # number of epochs to train
        "optimizer_cls": torch.optim.Adam, # optimization algorithm to use
        "optimizer_params": {'lr': 1e-3}, # initialization parameters for this algorithm
        "losses": {"L1": torch.nn.L1Loss()}, # the loss function
        "lr_sched_cls": None,  # the learning rate scheduling algorithm to use
        "lr_sched_params": {}, # the corresponding initialization parameters
        # NOTE: The name and the argument of the metrics have changed!
        "val_metrics": {"MSE": mean_squared_error} # and some evaluation metrics
    }
})

Creating the datasets and wrapping them into datamanagers remains the same as before.

Training

Our experiment class now accepts and additional key_mapping argument, which is a dict. This dict defines the mapping of data keys inside our batch dict to keys accepted by our model when calling it. Since calling a PyTorch model executes its forward and our forward has one parameter x, we need to define the keymapping as key_mapping={"x": "data"} which means, that the value lying under the data key inside the batchdict will be given as x to our model. This is necessary, since the trainer automatically validates the network if a validation set is given. Due to this, the closure should not contain any data-dependent operations that are necessary for prediction and training. These operations should be moved to the preprocessing or inside the model's forward instead.

The new training now looks like:

from delira.training import PyTorchExperiment
from delira.training.train_utils import create_optims_default_pytorch

experiment = PyTorchExperiment(params, SimpleNet,
                               name="ClassificationExample",
                               save_path="./tmp/delira_Experiments",
                               optim_builder=create_optims_default_pytorch,
                               key_mapping={"x": "data"},
                               gpu_ids=[0])
experiment.save()

model = experiment.run(manager_train, manager_val)

Hopefully the changes from delira 0.3.2 to delira 0.4.0 are now clear to you.

If there are any questions left, feel free to contact us. The best way to do so is via our slack community or by just opening an issue at this repo.

Clone this wiki locally