Final plan

Tasks

Ashwani
- Attention visualisation
- Splitting the datasets in ONMT training
Ben
- Write scripts to extract features from data e.g. sentence length, magnitude of numbers
Filippo
- Running the model and gathering statistics on loss, BLEU, decomposed by difficulty, features
  - Loss is cross-entropy - need to add to metrics.py calculations

Proposal: Risk extrapolation

Reference paper: Out-of-Distribution Generalization via Risk Extrapolation, anonymous authors, 2020

Key idea

Our aim is to generalise to test data that is clearly different from, but a natural extrapolation of, the training data. There are particular features that need to be extrapolated, e.g. the magnitude of numbers or the length of formulas.

Now consider the plot below. It's from a specific example, but in general, consider the x axis to be "relevant feature that we want to generalise" such as number magnitude, and the y axis to be "performance".

plot

The dots represent performance on different training environments. In our case, the different environments could be "addition problems with smaller numbers" and "addition problems with larger numbers". To clarify: this does not have to correspond to the different maths modules of our dataset. I think splits within each module makes more sense here.

Now, suppose that performance on both training environments is pretty good, but the larger-magnitude environment is worse: the blue data. What tends to happen is, for an extreme extrapolation (the right end of the plot), performance will be much worse. I hypothesise that this is basically what's happening on our extrapolation set.

We can address this by reducing the variance of performance between training environments. This corresponds to bringing the two points closer together in the plot (green and orange data). This leads to better performance on the extreme extrapolation, because we remove dependence on the feature (the x axis), flattening out the performance.

This way, we can get better test performance (green and orange, right-hand side) even though average training performance may worsen (green and orange, left-hand side). This is OK: when you reduce overfitting, training is worse but test is better. But there is a trade-off, as indicated by the beta hyperparameter.

How to apply this to our task

Analyse the statistics of the dataset

This is similar to something we already discussed in the interim report, which would be good to do regardless:

From a very preliminary analysis on this task [arithmetic__add_or_sub_big], there appears to be some correlation between the complexity of the question (i.e. the number of digits that need to be added and the amount of text in the question) and the correctness of the result. In future we could test this systematically, measuring the variables over the whole dataset using a post-processing program.
For each module, there is a corresponding extrapolation set that increases some feature. The features can pretty much be summarised as "big", "longer", or "more". We need to measure these features precisely and split the training datasets by these features. Probably 2 or 3 partitions would be enough.
It's possible that the "easy", "medium" and "hard" difficulty splits are already suitable as the different environments. But we should verify this by measuring the relevant features for each difficulty. If the correlation is too low, we should use the method in the above point.

Check the variance in performance between the split datasets

Before using the method, we first need to verify that the performance for each split (e.g. smaller numbers, larger numbers) actually gives a significant difference in performance. If there is little difference, there will be little variance to minimise in the first place.

Train on the regularised loss

From the paper: $\sum_{e=1}^{m}\mathbb{R}_e + \beta\mathrm{Var}(\mathbb{R}_e)$ where $\mathbb{R}_e$ is the expected loss on environment e. Expected loss is approximated by the sample average of the final loss over the training points. That sounds fancy, but it's just the usual meaning of "loss" in deep learning. We are already optimising the first term, we just need to measure the variance between losses and set the hyperparameter beta.

Caveats

According to the authors, the regularisation term should only be applied partway through training, so we need to choose when. This is discussed in section 4.2.1 of the paper. They suggest

Stability penalties should be applied around when traditional overfitting begins, to ensure that the model has learned predictive features, and that penalties still give meaningful training signals.

This still isn't too hard: we would need to measure when the model starts to overfit (perhaps this already happens for the baseline). Hopefully their heuristic works for us, but there is a risk that it does not. Even if it does not, we could try doing our own tuning, but this will be costly.
It requires modifications to OpenNMT: splitting training into the different datasets and then combining their results to get the loss. I have not yet investigated this. Given the simplicity of the method I am assuming it will be feasible. But we should investigate first.

Other tasks (can be worked on in parallel)

Tokenisation/preprocessing

English whole words
- Very straightforward, worth a try
Polish notation
- Somewhat challenging to write a conversion script
- The conversion is context-sensitive
  - Not all problems are applicable
  - May need to handle different cases even for the problems that are applicable

Attention visualisation

Try using existing visualisation code on our model
Check many examples carefully to see if we can interpret the attention mechanism in a meaningful way
If there is a meaningful interpretation: great, we can highlight it in the report, and maybe it will give insight to potential improvements.
If there is no meaningful interpretation: we still report the results, but only a brief discussion.
Ashwani seems well-positioned to work on this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final plan

Final plan

Tasks

Proposal: Risk extrapolation

Key idea

How to apply this to our task

Caveats

Other tasks (can be worked on in parallel)

Clone this wiki locally