Ben project log

2020.01.06

Installing mathematics dataset

Example generate command: python -m mathematics_dataset.generate --filter=linear_1d

Get an error from python -m mathematics_dataset.generate:

train/calculus__differentiate_composed
Traceback (most recent call last):
File "/home/ben/miniconda3/envs/nlp/lib/python3.7/site-packages/sympy/core/compatibility.py", line 419, in as_int
    raise TypeError
TypeError

During handling of the above exception, another exception occurred:
...

sympy is 1.5.1, their setup.py says sympy>=1.2
- Downgraded to 1.2
- Now it works

2020.01.08

How does BERT work?

Some or all of it is an encoder. It encodes a sequence of tokens into some tensor.
Maybe the decoder is the part that is fine-tuned and the encoder is just the pretrained part that ships

Is transformers the right library?

So far isn't really meant for "seq2seq" tasks, which is what we're doing here
fairseq is for seq2seq but gets a worse wrap

https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/translation.py

2020.01.09

fairseq translation example

Read through it. It extends the general training API so it's not what I was looking for.

Installing fairseq from source so we can use examples

Executing "training a new model" preprocessing

https://fairseq.readthedocs.io/en/latest/getting_started.html#training-a-new-model
Preprocessing looks complicated
I need to know what the final format should be, and how I can get there
- The mathematics dataset repo may have insight
- They have a pregenerated data tar
- Files are .txt

Preprocessing math data

Use fairseq-preprocess to pre-process and binarise text files, e.g.

fairseq-preprocess \
--trainpref math/train-easy --validpref math/valid-easy --testpref math/interpolate \
--source-lang question --target-lang answer \
--destdir math-bin --dataset-impl raw

Character-level example: https://fairseq.readthedocs.io/en/latest/tutorial_classifying_names.html
- Data space-separates the characters. Will we be able to do this? We want spaces to be recognised, but for the model to still be character-level.
Tentative plan to preprocess math data
- Original file format is difficulty/subject.txt
- Each .txt has a series of pairs of lines of text. First line in pair is question - space-separated words. Second line in pair is answer - space-separated words (not sure if it's ever more than one word).
- Separate out questions from answers
  - Read each text file line-by-line
  - Every first line written to source file
  - Every second line written to target file
- Prepare for tokenization
  - Space-separate
- Tokenize
- Organise the data files
- Access the data files (e.g. by subclassing FairseqTask)

How to build upon original work

Use multi-modal data: Tokenise English by words, but symbols by characters

2020.01.17

Ideas

It's common to use pretrained word embeddings, or at least train an embedding as the first module of the model. What can we do with math embeddings? Is an Embedding module already part of standard architectures?

Learning

Attention
- Focus on a particular part of the input at a given step of the forward pass
- Encoder passes every hidden state to the decoder, rather than just the last one
- For each decoder step, a weight (softmaxed attention score) is assigned to each hidden state
- The context vector for a decoder step is the weighted sum of encoder hidden states
- Multi-headed: expands ability to focus on different positions, and gives multiple "representation subspaces"

Mechanics

We can guarantee that spaces are included in the character tokenisation by converting them to underscores

2020.01.19

Spinning up a minimal working example with OpenNMT

Preprocess

OpenNMT command default

onmt_preprocess -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

Split our data
```
python split_dataset.sh
```
SRC file is ~19 MiB, TGT file is ~1.6 MiB
Modifying split_dataset.py to further split into train and valid

Adapting preprocess command

onmt_preprocess -train_src data/mathematics_dataset-v1.0/train-easy-split/algebra__linear_1d_src_train.txt -train_tgt data/mathematics_dataset-v1.0/train-easy-split/algebra__linear_1d_tgt_train.txt -valid_src data/mathematics_dataset-v1.0/train-easy-split/algebra__linear_1d_src_valid.txt -valid_tgt data/mathematics_dataset-v1.0/train-easy-split/algebra__linear_1d_tgt_valid.txt -save_data data/demo/demo

Executed

Ok we also should add the option for character-level split in the script. Or it might be a good idea to write a separate script that overwrites the files.
- Underscores for spaces: line = line.replace(' ', '_')
- Space separation: line = ' '.join(line)
- The above ops do not affect single-character lines

Training

OpenNMT command default

onmt_train -data data/demo -save_model demo-model

Our command

onmt_train -data data/demo/demo -save_model demo-model -train_steps 250

Inference

OpenNMT command default

onmt_translate -model demo-model_XYZ.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose

Our command

onmt_translate -model demo-model_step_250.pt -src data/mathematics_dataset-v1.0/interpolate-split/algebra__linear_1d_src_test.txt -output data/demo/pred.txt -replace_unk -verbose

1000 step model: predicts 4 every time!
250 step model: predicts 3 every time!
- 3 is only about 7% of the validation set answers.
Either it needs more training, or it's not working at all.

2020.01.23

Diagnosing degenerate model

Last time we found with a default command that the model (apparently) learns a degenerate solution of the same output (a single digit) no matter what the input is.
Possible causes
- The data is not being preprocessed as expected (e.g. character-level tokens)
- Overfitting / model too large
  - But how could it achieve ~55% training accuracy with the same single digit?
- Something wrong with transfer to the test/inference regime
  - Is the data in the same format?
  - It would be good to see input/output examples during training, to check that it is in fact outputting the same digit
Unknowns
- Whether the stdout for training report training or validation accuracy
  - Most likely training, because validation is reported at end of epoch (but how is an epoch defined? The number of epochs argument is deprecated)
    - Use -valid_steps argument: perform validation every X steps
The first thing I'm going to try is a smaller model. The model previously used was ~10M parameters, which I think is very excessive relative to the single file of training data.
- Assume each model parameter is FP32 -> 4 bytes. algebra__linear_1d_src_train.txt is 35770113 bytes. To memorise the data you thus need ~9M parameters.
- Reducing RNN hidden states size from 500 to 50, which reduces to 300k parameters.
- I think a critical oversight I may have had is just how few training steps were performed. Sounds like training batch size is dynamic with "sents"---sentences?---but validation batch size is 32. So 250 steps is not nearly enough to cover 600k samples (600k/32 = 18750)
- Now trying 10k train steps with validation every 1k steps.
  - Validation accuracy % is (46 vs. 64) then (60 vs. 62) then (62 vs 62). So if this model ends up still outputting the same digit, the accuracy metric is not doing what I think it's doing.
  - Yep, tested inference on checkpoint 5000 and it always outputs 2.
  - Woah, hang on a tick. Checkpoint 10000 outputs different digits!
  - So it initially learns to output the same digit, then diversifies? I've never seen that kind of learning. From experience with image generation nets, once it learns to output the same thing every time there's no hope. Maybe outputting the same digit is just a product of the parameters still being somewhat random and small from initialisation? I need to understand sequence models better...
  - 5000 performance: PRED AVG SCORE: -2.1865, PRED PPL: 8.9042
  - 10000 performance: PRED AVG SCORE: -2.0108, PRED PPL: 7.4691
  - 15000: PRED AVG SCORE: -1.7671, PRED PPL: 5.8539
    - Validation accuracy 54%
  - 20000: PRED AVG SCORE: -1.9983, PRED PPL: 7.3768
    - Validation accuracy 68%, yet worse...
  - Although the output diversifies, it still doesn't show any sign of understanding the input (i.e. getting correct answers). This highlights the need to change the performance metrics to suit the task, and/or change the preprocessing.
Is it because there are start and end tokens, and they are included in the accuracy? Or perhaps the space character is included?
- The inference predictions in pred.txt don't have any whitespace
It would be insightful to try a set of problems that require longer answers!
- In DeepMind paper, calculus__differentiate is one of the best-performing: P(correct) ~= 93% for Simple LSTM. It has long answers with symbols, e.g. 8*d**3 - 70*d
We need to look into how accuracy is measured, because it is clearly not what we expect. Is there a way to have custom metrics?

Try calculus__differentiate data

2020.01.24

Reviewing calculus__differentiate size 100 model

Latest checkpoint 85000
It has learned syntax well
Occasionally it produces an answer with partial mathematical correctness
- The correctness is more in terms of getting some of the correct digits in order, but producing too few digits (based on BLEU evaluation, length ratio is very low, e.g. ~20% at step 85000)
  - Length ratio was due to a mismatch in the BLEU command; ignore
- Made up example to illustrate the kind of thing it does: Differentiate 814530*x**2 -> 16291*x. So it does some kind of doubling of the coefficient, but misses some of the digits.

Run BLEU evaluation

perl /home/ben/projects/nlp/OpenNMT-py/tools/multi-bleu.perl <path/to/reference/file> < <path/to/prediction/file>

Create smaller validation data (1000 lines)

head -1000 data/mathematics_dataset-v1.0/train-easy-split/calculus__differentiate_tgt_valid.txt > data/mathematics_dataset-v1.0/train-easy-split/calculus__differentiate_tgt_valid_1000.txt

We don't think omnt-translate compares to the test answers - you don't specify a reference file. It just computes the model's log likelihood (PRED SCORE) and perplexity (PPL)
We ran the BLEU Perl command wrong -- mismatching src with tgt. It turns out that BLEU is pretty good relative to the amount of training and model size. So we are basically ready to run on the cluster!

2020.01.27

Testing transformer command

OpenNMT recommended (for Google WMT replication)

python  train.py -data /tmp/de2/data -save_model /tmp/extra \
        -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
        -encoder_type transformer -decoder_type transformer -position_encoding \
        -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
        -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
        -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
        -max_grad_norm 0 -param_init 0  -param_init_glorot \
        -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
        -world_size 4 -gpu_ranks 0 1 2 3

Leave out world_size for non-parallel, and leave out gpu_ranks for non-GPU
This results in 44208683 parameters (44M)
Saxton et al. (2019) parameters
- Model
  - Transformer
  - embedding size: 512
  - heads: 8
  - ff size: 2048
- Optimiser
  - Adam
  - learning rate: 6e-4
  - beta1: 0.9 (default)
  - beta2: 0.995
  - epsilon: 1e-9 (default)
- Training
  - batch size: 1024
  - hardware: 8x NVIDIA P100
  - batches: 500k
  - asolute gradient value clipping: 0.1
Switching to using config file based on config-transformer-base-4GPU.yml example in OpenNMT-py source
```
onmt_train -config config/config-transformer-base-4GPU.yml
```
- Set according to Saxton settings above
- Also a small version for initial testing: config-transformer-small.yml

Reading deep learning for symbolic mathematics paper

Polish notation is interesting. It would be a challenging text processing problem to convert all relevant expressions in the mathematics dataset to Polish notation. We could manually select a subset of problems that have easily convertible notation.
Would it make more sense for sequence model to have input in opposite order to Polish notation? So that the first operations that need to be computed come first.

Where is the reported ONMT accuracy metric defined

utils.statistics.Statistics.accuracy()
- return 100 * (self.n_correct / self.n_words)
- As expected

Training small transformer

Note that batch size (AFAICT) is the number of question-answer pairs, rather than tokens (which would be specified by an explicit setting in config)
Validation accuracy was much lower (IIRC ~50% compared to ~65%) after 1000 steps; running again with more frequent monitoring
- Still improving after 500 steps

2020.01.28

Tutorial

10am Minto
Tutor is a computer vision specialist
Keep a google doc shared with tutor for progress
Tutor email: [email protected]

2020.01.29

Reviewing DeepMind paper

Curriculum learning: train over many (all) topics, but not all at once. Find the best order of topics to train for the best final performance.
Focus on how to improve extrapolation, even if the dataset/model are very limited compared to this paper.
Compare embedding numeracy with a language model e.g. BERT. Is numeracy much better given we train on explicitly mathematical data? Is extrapolation better?
I recall a Kaggle competition on prediction the next value in a sequence of numbers. There is such a task under train-hard/algebra__sequence_next_term.txt and train-hard/algebra__sequence_nth_term.txt. The best Kaggle results could have good insight.

2020.01.31

Reviewing DeepMind paper

Differentiable Neural Computer did not work well - but why? In principle it has good capability, by storing intermediate results in the memory bank
Simple LSTM: one-hot encoding input
For LSTM models: additional steps added (encoder or decoder?) with zero input, to allow further computations before outputting the answer
- "we observed that increasing the number of “thinking” steps (as defined above) from 0 up to 16 increased the performance."
Adaptive Computation Time does not help performance
Relational memory core (RMC) also tested, but is not better than LSTMs, and best hyperparameter setting yielded 1 memory slot (this somewhat defeats the purpose of RMC).
- "perhaps it is hard for the RMC to learn to use slots for manipulating mathematical entities"
- Perhaps notably, Attentional RMC with bidir LSTM encoder gives the best extrapolation performance besides Transformer, but only 1% above next best of Attentional LSTM with bidir LSTM encoder
Models predict answers autoregressively, i.e. predict the current token knowing the previously predicted tokens
- What about refining the prediction once it is complete? Like multiple passes over the answer. I'm skeptical that it would help though.
- What do humans do to check their answer? I check that I have understood the question correctly. I check the answer to see if it is sane. Then I might go through my steps of working, checking for local validity.
- AFAICT they do not use beam search, but they don't mention it so maybe they do by default. Beam search is very common.
Results
- "RMCs were more data efficient but trained more slowly"
- "LSTMs had better asymptotic performance"
- "attentional LSTM and the simple LSTM have similar performance...We speculate that the attentional model is not learning to algorithmically parse the question"
  - Something to investigate further
- "the Transformer has various advantages over LSTM architectures, such as (1) doing more calculations with the same number of parameters, (2) having a shallower architecture (with better gradient propagation), and (3) having an internal "memory" that is sequential, which is more pre-disposed to mathematical objects like sequences of digits."
- "Overall it seems that magnitude is easy for neural networks to learn."
- Add/subtract and multiply/divide are good separately (>90%), but poor together (~50%)
  - Evidence that models learn relatively shallow tricks, rather than algebraic/algorithmic manipulation
- Transformer is much better at polynomial manipulation, attributed to parallel sequential architecture being able to hold multiple coefficients in memory
- Models cannot add ones together correctly for n >= 7 ones
  - May be relying on the numbers being different to align subsums
    - We could test this further by trying other repeated numbers, and a mix of repeated and distinct numbers
- Totally consistent question phrasing induces fragility, e.g. the presence of a full stop is the difference between correct and incorrect answers
- Extrapolation: "models completely failed to add together more numbers than seen during training, which agrees with the suspicion that models have learnt to add numbers in parallel rather than calculating subsums"
  - So there are competing hypotheses about whether it adds subsums or adds in parallel
- In general this paper speculates several reasons for its results. A good research question could be to test one of these speculations.
Idea: visualising attention for these models

2020.01.31

Reading symbolic calculus paper

How much did they test functions that are are rare under their generation methods? This would indicate generalisation
They simplify expressions to their shortest unique form, and replace coefficients on like terms with a single coefficient on a single term
Transformer
- 8 heads
- 6 layers
- 512 units
- Adam
- lr 1e-4
- batch size 256 (equations)
Inference: beam search with early stopping
- Beam width: 1, 10, 50
- Output is correct if at least one in beam is correct
  - This is considered OK when the solution can be easiliy verified, e.g. differentiating back the integral
No constraints enforced on output (it tends to learn syntax very well)
Solutions evaluated by comparing to reference solution (in simplest form) in SymPy
- Not clear that they actually use SymPy; they say "we can"

Current summary of potential research objectives

Use beam search with early stopping on DeepMind dataset
Tokenise English as words
What improves extrapolation (even relatively, at the cost of interpolation performance)
Visualise attention of trained model to analyse reasoning
Probe hidden states of trained model and/or generate counterfactual training examples to analyse reasoning
Does Polish notation improve performance on DeepMind dataset (where applicable)?
Use insight from "Physics as Inverse Graphics" to improve extrapolation
Apply neurosymbolic model

Reading Wallace et al.

Adds further evidence for NNs being bad at extrapolation
- The probe model cannot predict numbers outside the training range from embeddings
- Big accuracy drops when question-answering dataset is modified with bigger numbers, or conversion from digits to words

Reading Do et al.

The difficulty with multi-step problems can be seen as non-smooth loss - small changes in input can give a big change in the output which isn't correlated in a direct way
Hypothesis: supervision on intermediate steps smooths the loss to improve learning and in turn performance
Data: DeepMind Mathematics, limited to particular hard problems
- We could adopt this approach
- Evaluating and simplifying polynomials, evaluating arithmetic expressions using order of operations, finding polynomial roots, and finding remainders.
- Augment with intermediate steps
Oh...this is a real amateur paper. They don't even have results for their method. Let's hope that's not us!
- At least they open-source it

2020.02.03

Setting up on cluster

Bridge from local to mlp via DICE
```
ssh -N -L localhost:3306:mlp:3306 s1000116@dice
```
- -N is to not execute a remote command, useful for just forwarding ports
- Can't get this to work
Set up the repo and env on cluster

Reviewing run-model.sh

Currently the expected file structure appears to be

project-dir
    config
        config1.yml
        config2.yml
        ...
    exp
        exp1
            data.train.0.pt
            data.valid.0.pt
            data.vocab.pt
        exp2
        ...
    train.sh

Considerations for data pipeline
- Organise experiment directories by data then by model, or model then data, or everything in the same directory with distinct file names?
- The minimal file structure for the node would just have the necessary files for the particular experiment, i.e.
```
project-dir
    config.yml
    data.train.0.pt
    data.valid.0.pt
    data.vocab.pt
```
  - To do this, we would copy the filled-in config template into project-dir as the generically named config.yml.
- Perhaps instead of storing the training data binaries with each experiment, we keep it under data/bin by category, copy over the relevant binaries into project-dir when executing, and delete them from project-dir afterwards.

2020.02.04

Research question

Scope: neural (particularly seq2seq) models to solve diverse high-school level worded mathematics problems
Part 1: what are the mechanisms underlying worse extrapolation?
Part 2: how can extrapolation performance be improved?
Procedure:
- Train baseline Transformer on problems that have corresponding extrapolation data
  - This is small relative to the full DeepMind dataset; a Transformer of modest size is expected to be sufficient
- Analyse trained Transformer
  - Probe outputs for different edge cases
  - Visualise attention
    - Compare failure
  - Problem: may not be interpretable, distinguishable
- Based on insights, implement a modification that is expected to improve extrapolation
  - Different preprocessing
    - SymPy
    - Tokenization
  - Hard attention
  - An extra module after the transformer - look into broader barelyliterature on extrapolation
  - Regularisation
  - Whether or not the extrapolation difference is interpretable (see above), we need to design something here, something original in some way (it can be just slightly novel)
- Could focus around the edges of extrapolation, because that could be easier

Ashwani's cluster directory structure

project-dir
    config
        config1.yml
        ...
    exp
        calculus__differentiate
            data
                processed_data  # don't need this
                    data.train.pt
                    ...
            logs
                log.txt
            model
                model_step_10000.pt
    inference.sh
    train.sh

2020.02.05

Running on cluster

Tar command for project directory: tar -czvf project-dir.tar.gz project-dir
SLURM job 740897
Reduce to 4 hours to try to get running quickly: 740898
New config data subdir: 740899/740900/740901

Prints out many message like this between training progress (e.g. 5-6 times per 100 training steps):

[2020-02-05 14:31:57,248 INFO] Loading dataset from exp/calculus__differentiate/data/data.train.0.pt
[2020-02-05 14:32:00,935 INFO] number of examples: 79137

Yet tokens per second is fast: e.g. 207651/41274 tok/s (first number is source, second is target)
- Ashwani reported similar magnitude in his run: 356742/68644 tok/s. This is roughly 10x what I got in my laptop CPU run.
- Ashwani's run took 1836 seconds for 1000 steps, on my laptop 1628 seconds.
So despite having an order of magnitude higher token processing rate, the total time is about the same between 4 GPUs and 1 CPU. This is highly suspicious.
This suggests there is a bottleneck other than the model execution. The other main source of computation we have thought of is preparing the data file(s).
- calculus__differentiate/data.train.0.pt is 12860868 bytes (~13 MB)
- For comparison: the zipped CIFAR-10 is 163 MB. So raw size should not be a problem.
- Model memory?
  - Using a ~250k parameter Transformer
Separate issue: where is the job accessing and saving data?
- Is there a way to access a node's directories outside slurm?
- The slurm .out prints that it saved a model checkpoint:
```
[2020-02-05 14:52:25,099 INFO] Saving checkpoint exp/calculus__differentiate/model/model_step_2000.pt
```
  - But this is not in disk/scratch, nor my home directory.
- Also, I renamed exp/calculus__differentiate/data (the directory given in train.sh) to exp/calculus__differentiate/data1 just as a test. Yet it reports finding a file exp/calculus__differentiate/data/data.train.0.pt. I don't know where this could be, or if it somehow infers the directory by regex.

2020.02.09

Working out cluster execution

Let's print out the device
```
python -c "import torch;print('DEVICE:', torch.cuda.current_device())"
```
- Job: 773829, 773890
- Prints 0, which indicates a valid CUDA device
- It throws an error because of data1 now...weird
- Changed data1 to data under project-dir, and re-tar-ed it
Moved more device info into Python script: 774285
- No error for data now
  - Perhaps because the files stayed on scratch? But if I ls /disk/scratch/s1000116, there is nothing...But why the error now? Maybe because scratch is deleted periodically, and it stayed there long enough.
  - http://computing.help.inf.ed.ac.uk/cluster-tips
  Each node has its own local scratch space, and each node can only access its own scratch space. Scratch space is faster than the distributed filesystem because it's always local to the machine.
It's odd that data.train.0.pt is ~12 MiB while data.valid.0.pt is ~15 MiB. This is also the case on local. Does it contain the complete dataset? Seems suspicious.
- Default -shard_size is 1000000
- Specify -shard_size 0: same result
- Specify -shard_size 10000: roughly same total sizes, but train is split into 60 shards of size ~210KB while valid is split into 7 shards of size ~2.3MB (except for last)
- Naturally I should check the source files...
  - src_train: ~85.7 MB
  - src_valid: ~9.5 MB
  - tgt_train: ~19.0 MB
  - tgt_valid: ~2.1 MB
We should try a smaller batch size. By my calculation it's not as big a memory load as e.g. 100 CIFAR images, but it's plausible that it's making a difference. After all, why would DeepMind limit to a batch size of 1024 on 8x P100 GPUs, if they could easily go larger?
- Batch size 16: 775978
  - Almost never the loading printout! But much slower overall: ~8k src tok/sec vs. ~180k (but that's very high variance: saw 70k and 300k))
    - 1000 steps in 366 seconds -> 16000/366 = 43.72 samples/second
    - Compare batch size 1024: 1000 steps in 931 seconds -> 1024000/931 = 1100 samples/second
    - Factor: ~25, which is close to the tok/sec ratio of 22.5
    - Suggests that the data bottleneck (if there is one) is no different in total
  - I'm assuming 1 step is 1 batch.
  - Also added export CUDA_VISIBLE_DEVICES=0,1,2,3 to run-model.sh for this
    - Without: 776374; about the same, so slurm is probably already doing this under the hood, or it's unnecessary anyway
- Batch size 128: 776780
  - 1000 steps in 370 seconds -> 128000/370 = 346 samples/second
  - Suggests diminishing returns or some optimum between 128 and 1024
- Batch size 256: 777060
  - 1000 steps in 415 seconds -> 256000/415 = 617 samples/second
- Batch size 512: 777631
  - 1000 steps in 568 seconds -> 512000/568 = 901 samples/second
- So it's diminishing returns, but not negative returns. 1024 is still best.

Recovering 512 batch size 10000-step run

Tar files were not consistent at the end of run-model.sh: saved vs. saved_base. I have made them both saved
The saved tar preserves the full directory structure from root. This seems overkill; just use the working directory.
We should use early stopping
Log files are split into chunks; number is appended e.g. log.txt.1. So don't bother using .txt
Note -shuffle is not implemented for onmt-preprocess -- have to do it ourselves. But data should be essentially shuffled anyway; it's just that it will be in a consistent order between experiments.
Validation perplexity is lowest at 1000 steps (4.5) and increases except for 2000-3000
- Compare to training perplexity which starts at 6.9 and ends at 1.14.
Validation accuracy peaks at 5000 steps (this is ~4 epochs)

Preparing revised experiment

Early stopping criteria: ppl (perplexity) or accuracy
- ppl
Early stopping patience: 3

Combined dataset

Extrapolation files

algebra__polynomial_roots_big.txt
arithmetic__add_or_sub_big.txt
arithmetic__add_sub_multiple_longer.txt
arithmetic__div_big.txt
arithmetic__mixed_longer.txt
arithmetic__mul_big.txt
arithmetic__mul_div_multiple_longer.txt
comparison__closest_more.txt
comparison__kth_biggest_more.txt
comparison__sort_more.txt
measurement__conversion.txt
numbers__place_value_big.txt
numbers__round_number_big.txt
probability__swr_p_level_set_more_samples.txt
probability__swr_p_sequence_more_samples.txt

Corresponding training modules

algebra__polynomial_roots
arithmetic__add_or_sub
arithmetic__add_sub_multiple
arithmetic__div
arithmetic__mixed
arithmetic__mul
arithmetic__mul_div_multiple
comparison__closest
comparison__kth_biggest
comparison__sort
measurement__conversion
numbers__place_value
numbers__round_number
probability__swr_p_level_set
probability__swr_p_sequence

15 modules vs. 56: ~27%
500k -> 134k batches. Let's say 100k.
Model: very roughly 25% capacity relative to 30M model, if it fits. One-half settings gives 5.6M parameters which would do.

Command to shuffle two files in the same way

paste -d '|' src_train.txt tgt_train.txt | shuf | awk -v FS="|" '{ print $1 > "src_train_shuf.txt" ; print $2 > "tgt_train_shuf.txt" }'

Checked that problems don't contain | character: `grep -rn './' -e '|'

2020.02.10

Attempting baseline

Unpack tar into specific directory (and remove 1 layer of folder nesting)
```
tar -xzvf file.tar.gz -C folder --strip-components=1    
```
790283 (typo in macro), 790287
Out of memory
- Will try quarter FF size (512), everything else half: 3,988,779 parameters
790298
Out of memory
- Quarter hidden size (128): 1,406,123 parameters
- I think it is more important to preserve number of heads at 4, and number of layers at 3. This is based on intuitions that (1) number of heads is important to the relative success of the Transformer, (2) number of layers important to allow the model to perform multi-step reasoning.
Interactive session
- Looks OK - 30 steps
- Has stagnated for about 30 minutes (30 steps was for validation and it hasn't reached that yet). Not sure if this is a problem on my end, or the cluster is just overloaded.
Full experiment: 790599
- Forgot to recompress project-dir
Full experiment: 790626
- At 300 steps
- Timed out, forgot about 1 hour limit I put for testing!
New full experiment: 792747
Interesting note, worth watching: number of parameters on cluster is slightly different to local. Cluster: 1,409,200; local: 1,406,123.
Using PGR-Standard partition (probably not supposed to). Got GeForce RTX 2080 Ti which is 11GB. Much better!
P100s are 12GB or 16GB but I presume DeepMind have the best, so 16GB (which is conservative anyway)
Seems like these 2080s could handle the quarter-size model.
We should use a smaller validation set. It seems to be a big bottleneck at this scale.
- Training 1000 steps is taking 5-6 minutes, while validation is taking ~50 minutes!
Resuming from checkpoint 3000, 66667-size validation set: 794676

2020.02.11

Reviewing extrapolation baseline

Early stopping at 14000 (best perplexity at 11000)
- Is this best? We should train it longer with more patience, just to see
Running step-11000 predictions on truncated validation set (66667 samples)
- We should also run on inference and extrapolation, and then compare to DeepMind
- Validation average sentence BLEU: 28.45%
- Validation binary accuracy: 23.21%
- Validation corpus BLEU: 59.81%
  - Why so different?

Extrapolation set

15 modules
20,000 samples per module
300,000 samples total

Resume extrapolation baseline

802891
803054

Reading SATNet paper

Main contribution: a MAXSAT layer
- MAXSAT is a generalisation of SAT: find the maximum number of clauses you can make true by some assignment to the variables. SAT is all the clauses.
- Differentiable
- Input: vector of bits or probabilities
- Transformation: MAXSAT SDP relaxation
- Output: vector of bits or probabilities
Input of probabilities means that MAXSAT layer can interface with softmax
- They use this to combine a ConvNet with a SATNet to learn Sudoku
  
  Each cell-wise probabilistic output of this convolutional layer is then fed as logical input to the SATNetlayer, along with an input mask
  - Does this mean every probability at each cell? So 9x9x10 probabilities in total? Or the maximum probability at each cell, giving 9x9 probabilities in total?
- The Transformer uses softmax
  - In self and context attention
  - In final generator (LogSoftmax)
- It probably does not make theoretical sense to combine the Transformer softmax with the intended functionality of MAXSAT layer. You would want to frame it as a constraint satisfaction problem somehow.

2020.02.12

Baseline progress

Resumed with ppl early stopping, patience of 9, and saving 10 checkpoints: 803054
- I'm assuming it does not save the best model no matter what - you have to ensure the checkpoint range exceeds the patience. But this could be wrong.
Stopped at 74000 steps
- Best at 65000 steps: accuracy 81.95% ppl 2.273
- Accuracy is still improving: 82.03% at step 74000. Really not sure which one is better to measure early stopping.
Resuming with accuracy early stopping, patience of 9: 809323

2020.02.19

Scaling laws for neural language models https://arxiv.org/pdf/2001.08361.pdf

Performance penalty for varying model size (N parameters) relative to dataset size (D samples) is predictable: N^0.74/D.
- Saxton: 56 modules, 30M parameters
- Us: 15 modules => ~5M parameters to get no comparative penalty. This is much more than the ~1.4M we used last. Suggests we should either reduce data or increase model size, but the latter probably isn't feasible since we were hitting memory limits.
- Us: 1.4M parameters => 5% of the full dataset. 15 modules would be ~27%, so need to cut that down by a factor of ~5.4. Either limit to a single difficulty (1/3) or cut math topics or cut the samples per topic. Perhaps a mix of all three is best.

2020.02.23

Running evaluation on higher-patience baseline

Last time:

Resuming with accuracy early stopping, patience of 9: 809323
Modified inference interpolation script for my details
Checkpoint 100000
- Interpolation: ~~827879, 827880 827881, 827882, 827886,~~ 827898
- Extrapolation: 828140
In saved prediction directory: for f in *.tar.gz; do tar xzvf $f --strip=6; done
Wrote a script to run metrics.py on all the prediction files in one go: run_metrics.sh
Results compared to 14,000 checkpoint
- Interpolation
  - Accuracy: 0.21 to 0.44 (0.23)
  - Sentence BLEU: 0.27 to 0.32 (0.05)
  - Corpus BLEU: 0.34 to 0.41 (0.07)
- Extrapolation
  - Accuracy: 0.10 to 0.26 (0.16)
  - Sentence BLEU: 0.22 to 0.27 (0.05)
  - Corpus BLEU: 0.27 to 0.33 (0.06)
- Big improvement
- The int-exp accuracy ratio increased: 48% to 59%
  - But absolute gap increased: 0.11 to 0.18
  - Which is the more relevant metric?
- Disproportionate improvement in binary accuracy relative to BLEU

Informatics VPN (for transferring to and from cluster)

Start root terminal: sudo -i
Start OpenVPN: openvpn --config /home/ben/scripts/Informatics-InfNets-AT.ovpn
Transfor to local: bash scripts/transfer_data_mlp_to_local.sh -s s1000116 -m /home/s1000116/experiments/extrapolation_baseline/results.txt -l /home/ben/projects/mlp-project

2020.02.24

Updated thoughts on LM scaling laws

Scaling Laws for Neural Language Models is not specific to our topic, but provides several useful heuristics to guide training of models like ours. For example, the penalty for mismatching model parameters N with dataset size D is predictable as ~N^0.74/D.
- Assuming Saxton et al. (our seed paper) as a gold standard, N=30M and D=112M examples. We are using a ~1.4M parameter model - about the biggest we can go on the cluster. So the heuristic suggests we train on ~12M examples to get comparable performance. We have instead trained on 30M examples.
- Some of the modules achieved virtually 0 accuracy on our baseline, so we could cut these from the dataset as one way of approaching 12M. On the other hand, it is not essential for our research question to get better absolute performance - we are just interested in improving performance relative to our baseline. Leaving the baseline as-is would save time.

2020.02.26

NMT-GAN

2020.03.01

Investigating invariant risk minimisation

Downloaded the code, running CMNIST experiment
IRM is less complicated than I thought - basically just add a penalty as the norm of the gradients.

Understanding S4.2.1 of risk extrapolation paper

Key point is that the scale of the penalty needs to be scaled as a function of training time. Specifically, it should be scaled up (like a step function) when the model begins to overfit. At least that is the claim in this paper, because this coincides with peak performance on CMNIST.
- Overfitting is considered as when the gap between training and validation performance begins to increase significantly
Waterfall schedule
- Desjardins et al. (2015)
  
  In both cases, learning rates were decreased using a "waterfall" annealing schedule, which divided the learning rate by 10 when the validation error failed to improve after a set number of evaluations.
- This
  
  increasing the relative weight of the penalty term after 100 epochs of training (using a so-called "waterfall" schedule (Desjardins et al., 2015)) is critically important to performance on the colored MNIST task
In IRM, this is where they apply the step-change in scale:
```
penalty_weight = (flags.penalty_weight 
    if step >= flags.penalty_anneal_iters else 1.0)
loss += penalty_weight * train_penalty
```
- flags.penalty_weight is 91257 and flags.penalty_annel_iters is 190 (epochs)
- To be clear: the penalty term is weighted by 1 for the first 100 epochs, then weighted by 10000 for the rest of training
The problem for us is, the train-valid gap is present from the beginning. The gap in perplexity (proportional to loss) actually starts higher at nearly 2.0, then decreases and stabilises at roughly 1.0 for most of training. Meanwhile, the accuracy gap starts at roughly 2%, increases, and flattens out at roughly 8-9%. All the while, validation accuracy improves on average throughout the 100 epochs.
- This seems like qualitatively different regime to the CMNIST domain. By one interpretation, since there is always a generalisation gap, we could argue that the strong penalty be applied from the begining. By another interpretation, we haven't reached the overfitting regime yet. And by a third interpretation this won't work at all...
  - We could test the second interpretation by running from checkpoint 100 to, say, 200. If we find that it does start to overfit in a qualitatively different way, we may have to traing our models longer...
    - Overfitting in a qualitatively different way means: either validation accuracy decreases significantly, or training accuracy increases while validation accuracy stays the same.

Trying V-REx in original CMNIST code

Simply set train_penalty to the variance of the loss, i.e. torch.stack([envs[0]['nll'], envs[1]['nll']]).var()
This will not necessarily work well out of the box - it depends on sensitivity to hyperparameters

It works! Results for one trial:

IRM (ours):
Flags:
        grayscale_model: False
        hidden_dim: 390
        l2_regularizer_weight: 0.00110794568
        lr: 0.0004898536566546834
        n_restarts: 1
        penalty_anneal_iters: 190
        penalty_weight: 91257.18613115903
        steps: 501
Restart 0
step            train nll       train acc       train penalty   test acc     
0               0.67671         0.53322         4.83176e-06     0.48310      
100             0.38461         0.85098         0.00921         0.10160      
200             0.88005         0.46266         0.00216         0.82160      
300             0.60410         0.68430         2.66043e-10     0.69870      
400             0.60134         0.68728         8.43274e-10     0.69840      
500             0.59862         0.69084         8.92941e-10     0.69990      
Final train acc (mean/std across restarts so far):
0.69084 0.0
Final test acc (mean/std across restarts so far):
0.6999 0.0

Running baseline to test overfitting

100k more steps (i.e. to 200k total)
830755

2020.03.02

Exploratory data analysis

Within each question, measure average float order of magnitude, minimum, or maximum?
- Arithmetically, the minimum indicates how many digits need to be added together (give or take a carry)

2020.03.03

ONMT mod

There is an _accum_batches() function where batches are formed
- Batches are currently a list. Modify this to be a list of lists: [[batch1_dataset1, batch1_dataset2, batch1_dataset3], ...]
Model is executed and loss computed in _gradient_accumulation()
TODO: work out how to implement the loss part
- Ashwani will work out how to go from dataset to batches

batches = [[batch1_dataset1, batch1_dataset2, batch1_dataset3], ...]

for batch in batches:
    losses = []
    for dataset_batch in batch:
        output = model(dataset_batch)
        standard_loss = standard_loss_fn(output, target)
        penalty_loss = penalty_loss_fn(output, target)  # IRM
        loss = standard_loss + beta * penalty_loss
        losses.append(loss)
    avg_loss = sum(losses) / len(losses)
    avg_loss.backward()

batches = [[batch1_dataset1, batch1_dataset2, batch1_dataset3], ...]

for batch in batches:
    losses = []
    for dataset_batch in batch:
        output = model(dataset_batch)
        standard_loss = standard_loss_fn(output, target)
        loss = standard_loss
        losses.append(loss)
    losses = np.array(losses)
    var_loss = losses.var()
    avg_loss = sum(losses) / len(losses)
    total_loss = avg_loss + beta * var_loss
    total_loss.backward()

Training loop: onmt.trainer.train
- Iterates batches from Trainer._accum_batches(train_iter)
  - This divvies up batches into ~~batches~~ bags
  - I think this is a list of torchtext.data.Batch
- Batches are passed to Trainer._gradient_accumulation(...)
- Batches are iterated
- L364-5:
```
outputs, attns = self.model(src, tgt, src_lengths, bptt=bptt,
                            with_align=self.with_align)`
```
- Loss function Trainer.train_loss is an argument: onmt.utils.loss.LossComputeBase
  - This is implemented as onmt.utils.loss.NMTLossCompute
  - This is passed the criterion argument, which is the actual loss function
  - criterion is specified in build_loss_compute()
  - We are currently using positive label smoothing parameter (0.1), so it selects LabelSmoothingLoss (L39)
  - LabelSmoothingLoss.forward is key

Creating IRMLoss class

Subclassing LabelSmoothingLoss
Need to use raw logits to compute penalty: use_raw_logits = isinstance(criterion, (SparsemaxLoss, IRMLoss))
- But LabelSmoothingLoss uses probabilities - so how do I get both?
- Well, we use model.generator[0] to get raw logits
- model.generator is a Sequential (see model_builder.build_base_model)
- I think we will need to assume the form of model.generator and replicate this manually in IRMLoss. That way, we can run the penalty on the logits, and LabelSmoothingLoss.forward on the probabilities
  - The two conditions for it being standard LogSoftmax are not model_opt.copy_attn and not model_opt.generator_function == "sparsemax". I am confident these are both true.
  - From opts.py:
```
group.add('--copy_attn', '-copy_attn', action="store_true",
          help='Train copy attention layer.')
...
group.add('--generator_function', '-generator_function', default="softmax",
          choices=["softmax", "sparsemax"],
          help="Which function to use for generating "
               "probabilities over the target vocabulary (choices: "
                "softmax, sparsemax)")
```
  - Can't find anywhere these arguments are set by force
Source uses weight decay, not sure if it's important for us
- Can do this manually at optimizers.py L53 (torch.optim.Adam takes weight_decay arg)
- Ok so this is a separate issue from implementing IRM loss
Hang on, maybe we don't need logits
- It's unclear. Phi is just a "data representation". It could be logits or probabilities.
- However, they use logits in their implementation. And the key thing is that we compute the gradient of the loss with respect to the classifier w. The classifier is a just a scalar, but it scales the logits, not the probabilities. That is important for the gradient computation, I think.
- The conservative assumption is to keep using logits.
Need a way of knowing the current training step, to schedule the penalty weight
- Option 1: accumulate an integer internally to the loss class
  - Easy
  - Not robust: loss may get called outside of training progression, which would make the accumulator invalid.
    - But this should be ok. Trainer.train_loss is separate from Trainer.valid_loss, which builds the loss with train=False and thus valid_loss will just be NLL. Trainer.train_loss is exclusively called in the standard training loop.
  - Still an issue: train_loss is called for each batch, for each timestep.
    - Hack: self.train_loss.criterion.step += 1 every time _gradient_accumulation is called.

2020.03.04

Loss in main training loop

There seems to be a way to avoid modifying code: use --accum_counts command line argument.
```
group.add('--accum_count', '-accum_count', type=int, nargs='+',
          default=[1],
          help="Accumulate gradient this many times. "
               "Approximately equivalent to updating "
               "batch_size * accum_count batches at once. "
               "Recommended for Transformer.")
```
- This will sum the gradients over accum_count number of batches.
- As of now, the multi-dataset functionality is an alternating yield: `[batch_d1_1, batch_d2_1, batch_d3_1, batch_d1_2, batch_d2_2, batch_d3_2, ...]
- By using accum_count (e.g. 3), _accum_batches will group this into lists: [[batch_d1_1, batch_d2_2, batch_d3_3], ...]
- One of these lists is passed to _gradient_accumulation. The loss for each batch (for each timestep) is computed, and gradients are accumulated by loss.backward() (this is only computing the gradients, not updating the parameters). Once all batches in the list are done, the parameters are updated.
- Therefore this is equivalent to averaging the loss over the different datasets (give or take a constant averaging factor).

Command line arguments

Which opts are passed to build_loss_compute?
- I think it's opts.config_opts(parser), opts.model_opts(parser), opts.train_opts(parser) (from bin.train._get_parser)
- Includes data_ids and accum_count in train_opts

2020.03.05

Testing

Start with a basic input-output test to check for bugs
- OK

CMNIST replication

Recording logits, targets, penalties, and loss for
- Step 50 (before penalty weight is increased)
- Step 100 (when penalty weight is increased)
Need to temporarily replace base_loss with the CMNIST one: binary_cross_entropy_with_logits
Running
- Penalties match perfectly for step 50 env 0, step 50 env 1, step 100 env 0
- Penalty mismatch for step 100 env 1: 6.6927e-05
- Loss mismatch
Removed weight decay, redoing values

Running

Complete match for step 50
Complete match for step 100, except the least (5th) significant digit on loss differs by 1
- I think we can safely take this as rounding error

Output

step 0
env 0
true penalty tensor(1.4635e-08)
actual penalty tensor(1.4635e-08, grad_fn=<SumBackward0>)
env loss tensor(5.6895e-06, grad_fn=<DivBackward0>)
env 1
true penalty tensor(5.2920e-08)
actual penalty tensor(5.2920e-08, grad_fn=<SumBackward0>)
env loss tensor(1.1093e-05, grad_fn=<DivBackward0>)
true loss tensor(1.6782e-05)
tensor(1.6782e-05, grad_fn=<AddBackward0>)

step 1
env 0
true penalty tensor(1.4924e-09)
actual penalty tensor(1.4924e-09, grad_fn=<SumBackward0>)
env loss tensor(9.0516e-10, grad_fn=<DivBackward0>)
env 1
true penalty tensor(1.3743e-08)
actual penalty tensor(1.3743e-08, grad_fn=<SumBackward0>)
env loss tensor(7.3900e-09, grad_fn=<DivBackward0>)
true loss tensor(8.2952e-09)
tensor(8.2951e-09, grad_fn=<AddBackward0>)

2020.03.06

Setting up test for IRM

Data
- Single module
  - OK but not easy for interpolation
  - Hard but not hopeless for extrapolation
  - Involves easily measurable extrapolated feature
  - arithmetic__div (0.68) -> arithmetic__div_big (0.53)
  - arithmetic__mul (0.47) -> arithmetic__mul_big (0.32)
  - comparison__closest (0.57) -> comparison__closest_more (0.30)
    - Larger gap
  - comparison__sort (0.98) -> comparison__sort_more (0.48)
    - Huge gap!
    - Choosing this
- 3 difficulties: easy, medium, hard (E, M, H)
- .pt files separated by difficulty: data.train.E.0.pt, data.train.M.0.pt, ...
Model
- Small
- Using heuristic: N^0.74 / D
  - Saxton et al.: NG=30M, DG=112M -> RG = NG^0.74 / DG = 0.110624598
  - Our baseline: NB=1.4M, DB=30M -> RB = NB^0.74 / DB = 0.042757617
  - This to Saxton: D=6M -> N = (D * RG)^(1/0.74) = 575K
  - This to our baseline: D=6M -> N = (D * RB)^(1/0.74) = 159K
  - Aiming for generalisation to our baseline, so aiming for 159K.
Splitting dataset
- Modified merge_for_processing.py to save filenames as: os.path.join(output_folder, 'merged_'+task+'_'+f_type+'.txt') e.g. merged_comparison__sort_src_test.txt
- My merge command for this dataset: python scripts/merge_for_processing.py -i comparison__sort -f ./data -o ./data/train-merged
Preprocessing data
- Modified config-preprocess.yml to be a template taking {{task}} as a variable. Use merged file for validation since that requires a single file.
- Modified preprocess.sh to use config-preprocess.yml, filling in template, and copying the filled version as config-preprocess.{{task}}.yml, e.g. config-preprocess.comparison_sort.yml
- Command: bash scripts/preprocess.sh comparison__sort

Running test for IRM

Parameters

238163

word_vec_size: 64
rnn_size: 64
layers: 2
transformer_ff: 256
heads: 4

91443

word_vec_size: 32
rnn_size: 32
layers: 3
transformer_ff: 128
heads: 4

201667

word_vec_size: 48
rnn_size: 48
layers: 3
transformer_ff: 192
heads: 4

Close enough

It has loaded [E, M, H] datasets before the first step
Laptop is freezing up
Process killed, probably due to excessive memory
- One of the train .pt files is about 90MB
- Confirmed that it reaches 100% RAM once all datasets are loaded
- If I close all programs (except vscode and system monitor) it uses about 80%
It prints out that it loads a batch from E, M, H each step (I am printing out every training step)
- Does this mean the examples per step are actually 3*1024? This is important so we are sure to run the same number of steps as the baseline.
- I think it is 3*1024. It is printing out the batch size attribute of each batch.
- So should we set batch size to 341? Keep it at 1024 for now
  - Keeping at 1024 means we ought to train only for 33300 steps though
Validation batch size of 32 means this new printout is excessive
- Setting valid_batch_size and max_generator_batches to 1024
- With 200K examples, there are about 195 batches to get through. Therefore validating every 1000 steps seems reasonable.
It's choosing LabelSmoothingLoss. This means all conditions are met for that but not IRM.
- Oh, of course. It checks for LabelSmoothingLoss conditions first. So I need to put IRMLoss within.
IRMLoss now active
- xent is now huge (~1e3 initially) but decreasing
- Perplexity overflows because it is the exponential of xent
- It would be good if we separated out the base loss and penalty for logging purposes
- Accuracy is increasing which is a good sign
- Nevermind, acc started decreasing at step 8. I suspect this is due to the penalty.
  - This isn't necessarily bad - in the proper experiment we should only apply this late in training
- Validation perplexity and accuracy are better because it doesn't use IRM
  - Accuracy at step 3, 6, 9: 13.7, 33.2, 25.2

Full validation set

Suppose we want validation to take 20% of the interval spent training
For the full dataset, with the full 10%, there would be 2930 batches to get through
So we would have to validate every 15K steps, which is too sparse
If we want to validate every 1000 steps, we should have 200 batches in validation
If we want to validate every 5000 steps, we should have 1000 batches in validation

Running baseline for IRM test

10+ minutes per 100 steps on laptop
Moving to cluster
JOBID=832983
- PGR-Standard
- I forgot to rebuild OpenNMT on cluster. This is ok for the baseline, because we are still using accum_steps: 3, but we must build before IRM.

2020.03.07

Reviewing baseline progress

Cancelled due to time limit
Reached step 6600/33300
Validation accuracy at 6000: 98.468
The main bottleneck is loading data. From step 6100 to 6200, loading data takes ~37 minutes.
- Would splitting into smaller shard help?
It gets through training steps in bursts: from 6200 to 6600 in 9 minutes. About 140 seconds per 100 steps on average (PGR-Standard).
Oh, just realised that 33300 steps doesn't make sense for this. The batch size is fixed, and the dataset is much smaller, so we shouldn't be running 100K batches (which is what 33.3K corresponds to for 3 corpora).
- 15 modules, 100000 steps
- 1 module, 6700 steps. So we basically did enough! But accuracy is still improving. So 10000 seems reasonable to try
- We should also reduce patience to 3 given the batches per step is tripled.

Modifying baseline configuration

Updating config file: patience 9->3, checkpoints 10->4, train steps 33300 -> 10000
Splitting data smaller
- Try 1/3 of the default shard size, to fit with the 3 datasets
- 1000000 -> 333334
- Now 2 shards per difficulty
Syncing project-dir
Actually I don't think the old OpenNMT is quite OK for the baseline, because it does not implement the batch-by-batch alternation of the datasets. Instead it is within-batch (I think). This probably doesn't make much difference, but we should rerun for the sake of control. I'm pretty sure the amount of data per step is the same though, due to gradient accumulation.
Installing OpenNMT-py fork

Rerun of baseline (from beginning)

JOBID=833134
Everything seems to be in order
Initial load of data was much faster - about 2.5 minutes compared to 10 minutes
Expect this to take about 4 hours

Setting up IRM

Only need to modify config? Also set up a new experiment folder
Modifying config-transformer-small-risk.yml
- Uncomment risk arguments
- risk_anneal_steps: 5000 (half of train_steps)
- risk_penalty_weight: 10000.0 (no idea if this is suitable, just going by default in CMNIST code)
- I think for now, 4 HP settings will suit:
  - steps: 5000, 8000 (baseline was still not overfitting or converged at 6600)
  - weight: 10000.0, 100.0 (minimum considered in CMNIST HP sweep was 1e2, max 1e6)

Running IRM

Hmm, the experiment files get transferred back as project-dir.saved.tar.gz. How can we rename or direct this when running multiple experiments?
- Just change PROJECT_FOLDER in scripts/run-model.sh. It will use this in the save name as well.
steps=5000, weight=10000.0 : ~~JOBID=833140~~
steps=5000, weight=100.0 :
steps=8000, weight=10000.0 : JOBID=833159 ~~JOBID=833158~~ ~~JOBID=833157~~ ~~JOBID=833155~~
- Accuracy is poor (15-20%) and getting worse. We need to wait longer, but this could mean either it doesn't work or we need to start with the penalty much lower, or zero.
- Loss has started to increase at 500
- Loss decrease again at 700, but accuracy still getting worse
- No improvement after 4000 steps, final validation accuracy 5.6%
steps=8000, weight=100.0 : JOBID=833160
- Error from OpenNMT saying address already in use. I'm guessing this is due to the baseline. But does this only happen on the same machine?
  - Probably. Both this and 8000-10000 were running on damnii11. 8000-10000 is still going.

Comparing typical magnitudes of penalty and base loss in CMNIST example

Gives an idea of whether our initial penalty weight should be less than 1.0
We are recording the base loss and scaled penalty, before overall scaling, for one environment
Ours
- Tiny test at step 1:
```
[2020-03-06 23:17:10,993 INFO] Latest base loss: 175480336.0
[2020-03-06 23:17:10,993 INFO] Latest penalty: 526372640.0
```
- Tiny test at step 5:
```
[2020-03-06 23:17:25,258 INFO] Latest base loss: 14517791.0
[2020-03-06 23:17:25,259 INFO] Latest penalty: 43507492.0
```
- Same order of magnitude
- Penalty is about 3x larger
- This relationship extends to the penalty weight increase, steps 6-10 (30,000x rather than 3x)
- I think onmt does not normalise raw loss by number of words until LossCompute and print-out:
```
# statistics.py
def update(...):
    ...
    self.loss += stat.loss  
...
def xent(self):
    """ compute cross entropy """
    return self.loss / self.n_words
```
- Currently normalization: sents, so actual loss is normalized by batch size. However, the print out for xent seems to be invariably normalized by the total number of tokens in the batch that are not padding.
- How many chars per sentence on average? wc for medium is 599999 lines, 24027660 words. Space-separated, so half that is 12013830. That gives 20.023083372 tokens per line. There are 1024 lines per batch, giving 20503.637372729 tokens per batch. Normalization accumulates over datasets, so multiply by 3, giving 61510.912118187 tokens for normalization.
- Tiny step 5 would have had total loss 58025283 if we assume the same loss per partition (not a great assumption). Normalizing by above gives 943. But xent: 196.03. Same order of magnitude, but loss would have to differ a lot between datasets. Plausible since that loss seems to be for the hard dataset (the last one loaded).
- I'm suspicious at the ~3x relationship between base loss and penalty.
- LabelSmoothingLoss uses a sum reduction for KL-divergence

CMNIST

mean_nll uses binary_cross_entropy_with_logits which automatically takes the mean. There are 25000 examples per dataset so this is a large normalization factor.
penalty uses the same mean_nll function, so it is also based on this normalized loss.

First few steps

Base loss: 6.7325e-01
Penalty  : 2.3887e-04
Base loss: 6.6687e-01
Penalty  : 4.7711e-04
Base loss: 7.2201e-01
Penalty  : 1.0817e-03
0               0.67006         0.58408         0.00036         0.41900      
Base loss: 6.0852e-01
Penalty  : 4.4062e-03
Base loss: 5.7582e-01
Penalty  : 9.8305e-03
Base loss: 8.3714e-01
Penalty  : 2.5737e-02
Base loss: 5.6095e-01
Penalty  : 6.8704e-03
Base loss: 5.0142e-01
Penalty  : 2.0315e-02
Base loss: 9.7255e-01
Penalty  : 1.0545e-01
Base loss: 5.2897e-01
Penalty  : 4.5915e-03
Base loss: 4.4080e-01
Penalty  : 2.4361e-02
Base loss: 1.1352e+00
Penalty  : 2.8289e-01
Base loss: 5.1328e-01
Penalty  : 5.1451e-04
Base loss: 3.9398e-01
Penalty  : 2.0211e-02
Base loss: 1.3303e+00
Penalty  : 6.1654e-01
Base loss: 5.1454e-01
Penalty  : 2.2197e-03
Base loss: 3.6207e-01
Penalty  : 1.1126e-02
Base loss: 1.5557e+00
Penalty  : 1.1603e+00
Base loss: 5.2920e-01
Penalty  : 1.5913e-02
Base loss: 3.4508e-01
Penalty  : 3.3535e-03
Base loss: 1.7844e+00
Penalty  : 1.8739e+00

First few steps after penalty jump

4.4544e-01
4.4479e-01
4.4416e-01

3.0675e-01
3.0671e-01
3.0673e-01

Base loss: 4.4544e-01
Penalty  : 2.2744e+01
Base loss: 3.0675e-01
Penalty  : 6.1256e+01
Base loss: 1.4705e+00
Penalty  : 9.5397e+03
100             0.37610         0.85030         0.00420         0.10000      
Base loss: 4.4479e-01
Penalty  : 2.1740e+01
Base loss: 3.0671e-01
Penalty  : 6.1806e+01
Base loss: 1.4663e+00
Penalty  : 9.4390e+03
Base loss: 4.4416e-01
Penalty  : 2.0874e+01
Base loss: 3.0673e-01
Penalty  : 6.2090e+01
Base loss: 1.4621e+00
Penalty  : 9.3380e+03

Initially: penalty ~1-3 orders of magnitude lower
After penalty jump: penalty ~2 orders of magnitude higher

Should we have been normalizing loss by tokens?
- I think we figured batch_type and normalization went together. batch_type: sents makes sense (default). But now that I understand length normalization, maybe normalization: tokens would be better. Anyway, water under the bridge.
What's the difference when you normalise the loss inside the penalty vs. outside?
- A factor of N, where N is the normalisation factor. Because the gradients are squared, one factor of N remains hanging.
- This suggests that (unless we change the normalization to match) we should set our penalty weight to 1/batch_size times their penalty weight, for all training. So the initial weight would be 1/batch_size and the final weight would be 10000/batch_size.
- Example: tiny step 5
  - Base 1.4517791e07, penalty 4.3507492e07, total 5.8025283e07
  - If initial penalty weight is 1e-3, this becomes: base 1.4517791e07, penalty 4.3507492e04, total 1.4561298e07. Penalty has less than 1% effect on loss. This seems reasonable.

Updating loss implementation to have lower initial penalty

Allow risk_penalty_weight argument to be a list, so we can specify initial and final values.
Testing
- Doesn't match what I expect. Penalty is still higher or on-par with base for the first 5 steps.
- Penalty decreases relative to base, and overall. If I move the annel step to 11, then penalty is down to ~10^1-10^3 while base loss remains about 51054
- Yet overall, loss is down 3 orders of magnitude.
- It could be the feedback loop of updating the parameters that causes this difference.
- I can't see how base loss could be downscaled directly, therefore it seems to be the feedback loop.
- So if we account for the normalisation, the orders of magnitude are more in line with CMNIST now.
- But wait. Should we be going by our batch size or CMNIST batch size to get a reasonable penalty weight? I think it should be CMNIST, which is 25000. That means it's more like 4e-5 and 4e-1 for the weights! Or if we go with their best penalty weight of ~90000, it's about 4e0.
  - But note that I recorded the values above with weight 10000.
- In a 20-step tiny experiment, the validation accuracy after the penalty jump actually increased, from 33 to 35%. That's nice, though not an indication the method is working as desired.
- Let's compromise and round, so: [1e-4, 1e0]
After the penalty weight is increased, (at least for the 10 steps I am running), the loss fluctuates quite wildly. Normally this would suggest the learning rate is too high. CMNIST results above don't fluctuate so much, though env 1 goes down and env 2 goes up. What to do? The REx paper did say they dropped the learning rate at the same time as increasing the penalty weight.
- Penalty goes down about 5 orders of magnitude in CMNIST from start of jump, to end
REx paper says

learning rate is simultaneously decreased proportionally [to penalty weight increase]
- Decreasing proportionally...that's a big decrease!
- Hang on though...this could just be referring to the normalisation when the penalty weight is greater than 1. That's equivalent to decreasing the learning rate proportionally.
- I tried decreasing the learning rate proportionally, and it still fluctuates a lot. Understandably, accuracy takes less of a hit. But I'm not sure this is what we want.

Running IRM test

Penalty weight [0.0001, 1.0]
JOBID=833276
- xent is never crazy. But it still fluctuates, relative to the magnitude ~1e-1
- Accuracy decreases but not a lot. As we know, small differences in token accuracy can transfer to large differences in binary accuracy.

2020.03.08

Checking small IRM experiment

Seems to be in order
Still have the varying xent issue
- Could it be that the base loss is the main thing moving around while the penalty is decreasing? Let's check the CMNIST example.

Running inference for small IRM experiment

Command

onmt_translate \
    -model /home/ben/projects/mlp-project/backup/risk_test/risk_test_baseline/experiments/risk_test_baseline/model/model_step_8000.pt \
    -src dataset/interpolate-split/comparison__sort_src_test.txt \
    -output experiments/risk_test_baseline/pred_interpolation_comparison__sort_8000.txt \
    -replace_unk -verbose

No smoothing on BLEU
Selecting checkpoint for IRM
- Around where the lowest loss was?
- 16-17000 steps seems like the lowest average loss -> step 17000
- But should we go with the best average over the 1000 steps, or the best at the checkpoint steps?
- The top two at 1000x steps are 12000 (0.20) and 19000 (0.18). 19000 seems like too much additional training to ask for.
- Decision: 12000

Baseline

interpolate

Average sentence BLEU: 0.9922
Corpus BLEU: 0.9907
Average accuracy: 0.9137

extrapolate

Average sentence BLEU: 0.8947
Corpus BLEU: 0.8966
Average accuracy: 0.2481

IRM

interpolate

Average sentence BLEU: 0.9423
Corpus BLEU: 0.9342
Average accuracy: 0.7055

extrapolate

Average sentence BLEU: 0.3740
Corpus BLEU: 0.2783
Average accuracy: 0.0026

Bugger...
IRM on extrapolation outputs a lot of single-digit answers. This makes me think of length normalisation.

2020.03.09

Post mortem on first IRM experiment

Things to check
- How number of numbers in comparison__sort varies for easy, medium, hard, interpolate, extrapolate (should have done this first!)
- Whether a different weight penalty improves the result
  - The fact that it completely failed on extrapolation suggests that the model is penalised too much
  - 1e-2 seems reasonable. If it is too small, there won't be as much degradation of performance.
  - It seems OK to keep the initial weight at 1e-4 since CMNIST only varies the final weight, and we get almost as-good results up until the switch.
- The normalization of the loss
- Whether normalization: tokens fixes it (currently sents)
- How the gradient accumulation works (are my assumptions correct?)
  - For example, is adding the gradients equivalent to adding the losses? I assume so, because the gradient of a sum is the sum of the gradients
  - See "Distributive properties" here: https://en.wikipedia.org/wiki/Vector_calculus_identities
  - Make sure I understand the distributed nature of this operation
Things to change
- Print out the base loss and penalty separately after report_every steps (currently we just print the total)
Alternatives
- Risk extrapolation
  - This may also work with gradient accumulation - check the equations

Changing IRM penalty weight

Changes to config
- Final penalty weight of 0.01
- Save 13 checkpoints (see below)
If we had saved checkpoint 8000 we would have been able to resume from there, since there is no change in the initial penalty weight. Oh well, maybe next time.
Modifying ONMT to print base loss and penalty every report_every steps
- We should print for each difficulty, which means we need to keep a list

Reviewing dataset analysis

Need to figure out how I'd like it to be processed and presented
I deleted the split data on my laptop so I have just redone that

Testing extra ONMT reporting

Experiment name: risk_test_tiny_reporting
It all fluctuates, even before penalty increase (though on average, it seems to decrease)
Trying to reconcile the xent value with the loss and penalty
- It can't be a final vs. average issue, because we are reporting at every step
xent is self.loss / self.n_words
- n_words is output/target words
I'm going to print out self.loss as well so we can see exactly what's going on
By adding the base losses and weighted, I get 50396. The reported loss is ~~53675~~. Close, but more than a rounding error.
- Wait no, the loss is printed after training is reported. So we compare to 78632. That's very different!
- Oh wait, the penalty is already scaled. It just starts off large.
- Ok, it makes sense now.
- Although, I'm not sure why it prints the loss to be added three times each...
Woops, I was saving in risk_test_tiny

Preparing full dataset

First: merge the data
- Gah, I merged all the modules. I just need the fifteen.
Next: preprocess
- Modifying config-preprocess.yml
- bash scripts/preprocess.sh merged
- It may be that OpenNMT shuffles the data. But what about the shards?
  - See here in onmt.utils.parse.py:
```
@classmethod
def validate_preprocess_args(cls, opt):
    ...
    assert opt.shuffle == 0, \
        "-shuffle is not implemented. Please shuffle \
        your data before pre-processing."
```
- Ok, so we need to shuffle:
```
paste -d '|' src_train.txt tgt_train.txt | shuf | awk -v FS="|" '{ print $1 > "src_train_shuf.txt" ; print $2 > "tgt_train_shuf.txt" }'
```
- We need to create a reduced validation set
  - So we should shuffle to ensure all tasks are represented
  - How big? Currently it is 3000015. I think 270k is reasonable - 1% of training.
  - Ok, that's done. 270000 samples.

Idea for what causes differences in performance

Similarity of the question and answer
Could measure this by passing the answer over the question, counting the characters that match at each position in the question, then average

Reduced penalty experiment

1e-2 max
JOBID=840834

Running full model

Writing the config file
- Copy the actual baseline config
- One-third train_steps
  - But we want to train longer
  - 120000 -> 40000
- No early stopping
- Penalty weight [0.0001, 0.01]
- Make sure we save enough checkpoints -> 14 = 1 + (40000 - 27000) / 1000
- When to apply the penalty?
  - Validation accuracy is pretty much flat after 80000 steps. So I choose that.
  - Rounding to 27000 steps for the 1/3 conversion
Transferred to cluster and project-dir set up
JOBID=840910

Checking small experiment with smaller weight

Now it seems like the penalty is too small to influence the learning
I haven't evaluated it, but it doesn't look promising to me because the penalty fluctuates a lot.
TODO
- In the IRM CMNIST example, they normalised the loss by the penalty weight. I thought that we didn't have to do this, because our penalty weight doesn't exceed 1.0 at the moment (the condition is still in the code though). But I think this is wrong. If we don't normalise, and we want the penalty to become much more significant than the base loss, the magnitude of the loss should be kept at a similar scale for training stability. I thought "but dividing by penalty weight of 1.0 has no effect". But what we actually need to do is divide by the ratio of the initial and final penalty weight: e.g. 1.0 / 0.0001 = 10000.0
  - No, that isn't right either. If the base loss is 3000 and the increased penalty is 1200, we don't want to normalise by 10000.0.
- Earlier application of penalty: maybe the model's representations are already too entrenched by step 8000?
- Resume the baseline to 20000 steps. Even though it looked like it was overfitting, it's possible it could improve again. We would have full confirmation this way.

2020.03.11

Byte-pair encoding

The lua scripts are for the old OpenNMT. For OpenNMT-py, we just have bpe_pipeline.sh
Not sure if input is tokenized already or not
- There is a .src file in data which looks like a names dataset, space-separated characters. But this is not tokenized, just prepared for tokenization. So I think we give the raw file.
Need to run split, but with whole word tokenization
- This is the default
- Done
Setting up bpe_pipeline.sh config

Full model experiment

23000 steps
I'm curious what the average base losses are for each dataset
- Average easy base loss: 2238.85
- Average medium base loss: 3491.49
- Average hard base loss: 5013.37
- Note that at any specific training step this ordering does not necessarily hold by any means. It moves around a lot.
- Also, each data point is for one step (every 100 of them).
32500 steps
- Like the smaller experiment, penalty jumps around a lot. I don't quite know what to expect but it seems like a bad sign.
- I think a good thing to try is rewind training to where the penalty increases, apply an increased penalty weight (0.1, or 1.0; I'm thinking 1.0 so IRM dominates in a 10-100x way), but scale down the learning rate in proportion. So 0.000006 instead of 0.0006.
  - As a hack we can make the 3rd item in the penalty weights option the learning rate factor, e.g. 100, then divide by this quantity
Implementing option to reduce learning rate
- It will be the last value in the list of penalty_weight
- I assume scaling it on the inside is equivalent, but maybe there are subtleties with the gradient and optimizer (e.g. momentum) calculations that make this invalid. I will look at the options
```
elif opt.start_decay_steps is not None:
    return functools.partial(
        exponential_decay,
        rate=opt.learning_rate_decay,
        decay_steps=opt.decay_steps,
        start_step=opt.start_decay_steps)
```
```
def exponential_decay(step, rate, decay_steps, start_step=0):
    """A standard exponential decay, scaling the learning rate by :obj:`rate`
    every :obj:`decay_steps` steps.
    """
    return rate ** (max(step - start_step + decay_steps, 0) // decay_steps)
```
  - Note it returns the scale to multiply the base learning rate by
  - Is this default?
    - Yes: learning_rate_decay=0.5, start_decay_steps=50000, decay_steps=10000
    - Ah crap. This means baseline had LR reduced at 50000, and IRM should have the equivalent 16667, but we didn't specify that.
- On checking the Adam update rule, there is a term that squares the gradient, which would square the inner learning rate. This seems problematic, so I should avoid this method.
- Default schedule adjusted for triple batch
  - 16667:2,20000:4,23333:8,26667:16,30000:32,33333:64,36667:128
- Suppose the following schedule: 0.5, 20000, 1000. With 1.0 penalty weight.
  - 22000:32,23000:64,24000:128,25000:256,26000:512,27000:1024

Reviewing dataset stats

Some significant differences in performance
The key stat varies between modules (key stat = stat that gives biggest difference in performance between splits across that stat)
The key stat is sometimes expected, sometimes surprising. For example, number length is key for division (expected). Sentence length appears key for sorting on extrapolation, while number of numbers isn't really valid because it's either 10 or 11. Something seems fishy about that, because sentence length doesn't matter nearly as much in interpolation.
We could try unifying the interpolation and extrapolation to get a smooth overall picture. But that already assumes the stats we are measuring are responsible for the difference in performance.
We need to measure all these stats on train-easy, medium, hard to validate their use as a proxy

Length normalization

It seems like a good idea to use. I wish we had done it for the baseline.
Some of the modules (in extrapolation) vary in performance just by sentence length. Token normalisation may improve this.
Do Saxton use length normalization?

We minimize the sum of log probabilities of the correct character
- No active mention of it in the paper
- As far as we know, besides the settings they explicitly specify, they match the original Transformer
- The original Transformer paper makes no mention of it and I haven't found any reference to it in the tensor2tensor hyperparameter file (transformer.py)
- Therefore, we can say with about 90% confidence: no
  - The only way we are wrong is if they are not telling the whole truth when they say "sum of log probabilities"

2020.03.12

Transferring first full IRM experiment files

Lowest xent after penalty:
- 0.11 : 30900 (first; occurs many times)
- 0.10: 38200, 39600
  - So the final checkpoint of 40000 is a reasonable choice

Learning rate adjustment

Given the restrictions on standard learning rate scheduling, I've changed my mind and I'm going to keep the rescaling within the loss class.
Testing rescaling in tiny test again...OK
Testing learning rate decay config setting...
- start_decay_steps set to 5 but it went from 0.0006 to 0.0003 at step 4
- If I set it to 6, it halves at 5. So a consistent 1-behind. Oh well, I'll just work around that.
Oh goodness, I forgot I wrote a function to modify the learning rate directly!
Ok, looking good now.
Now, let's consider the learning rate divisor schedule again:
- 16667:2,20000:4,23333:8,26667:16,30000:32,33333:64,36667:128
And where the penalty was at with weight 1e-2
- Average: 106 (std 370)
- Side note: although everything fluctuates a lot, on average the hard penalty (163) is about double the easy (74) and medium (81). Easy has higher std (277) than medium (242).
Ok, so the learning rate will be 16x smaller already, if we follow the original schedule
Then, if we changed the final penalty weight to be 1e0, that would bring the effective multiplier to 100/16 = 6.25. So then it would be reasonble to add a further scaling factor of 10, reducing the effective multiplier to 0.625.
Ok, let's write the config.
- learning_rate_decay: 0.5 -> 0.5
- start_decay_steps: 50000 -> 16667
- decay_steps: 10000 -> 3333
- risk_penalty_weight: [0.0001, 0.01] -> [0.0001, 1.0, 10.0]

Running full model IRM experiment with learning rate decay

JOBID=846369

Number segmentation

Maybe https://github.com/OpenNMT/Tokenizer
pip install pyonmttok
Seems to work as desired with this setting: tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
- "aggressive" rather than "conservative" to force numbers to be segmented
['Sort', '-￭', '3', '￭,', '3￭', '2￭', '5', '￭,', '3￭', '2', '￭,', '4￭', '5', '￭,', '-￭', '1', '￭,', '0', 'in', 'descending', 'order'] ['Let', '-￭', 'm', '￭*', '￭￭', '2', '￭/￭', '3', '-', '1￭', '9￭', '9￭', '5￭', '0￭', '1', '￭￭', 'm', '￭/￭', '3', '-', '3￭', '9￭', '8￭', '9￭', '9￭', '8', '￭/￭', '3', '=', '0', '￭.', 'Calculate', 'm', '￭.'] ['What', 'is', 'prob', 'of', 'picking', '2', 'j', 'and', '2', 'c', 'when', 'four', 'letters', 'picked', 'without', 'replacement', 'from', '{￭', 'x', '￭:', '1', '￭,', 'f', '￭:', '1￭', '4', '￭,', 'j', '￭:', '2', '￭,', 'c', '￭:', '2', '￭}', '￭?'] ['What', 'is', '(￭', '8', '-', '(', '￭-￭', '5', '-', '-￭', '1￭', '0', '￭)', '￭)', '+', '-￭', '9', '+', '-￭', '1', '+', '2￭', '3￭', '0', '+', '-￭', '2￭', '4￭', '4', '￭?']
- Above, with special characters, is due to joiner_annotate=True. It gives more information about where the symbol is occurring (good), but increases vocab size (bad).
Remember to shuffle this data too, once merged

2020.03.13

Updated full IRM experiment

Going fine

Number segmentation

How much would the joiner annotation increase vocab?
- Using vocab.py on full dataset under dataset/bin/merged
  - Why doesn't { and } show up?
- Source (non-alpha): 20
- Target (non-alpha): 18
- Left-side annotation: ?, ,, }, ), :, *, .
- Right-side annotation: [0-9], -, {, (
- Middle annotation: -, /, *, .
- + and = are not annotated because they are always middle
- Total: ~24 additional. Roughly doubles the non-word vocab size.
How would joiner annotation help?
- Learn different representation for symbol depending on its role relative to adjacent tokens.
  - Don't know if this matters much for Transformer, given it has self-attention
Decision: joiner_annotate=False
split_dataset.sh
merge_for_preprocessing.sh
shuffle.sh
- Modified to use ns postfix
Truncation of validation files to small using head -27000
- Now part of shuffle.sh
Update config-preprocess.yml
- Modified to use ns postfix
preprocess.sh
- Modified to title merged-ns
vocab.py
- Source: enormous
  - Ah crap, it's all those sequences of letters for probability
- Target: 44
  - Same as before except e included and _ is excluded. Maybe because of the different validation truncation? That's its own issue by the way...
How can we segment those letter sequences?
- Exploit the fact they are always (?) preceded by "from"
  - From the generating code, can confirm this always hold and the match is even longer: "letters picked without replacement from"
  - This is the same for both probability modules
  - Ok, I've figured out how to find and separate that part.
  - What about when it's not a stream of letters? It will match the next letter; that's bad.
    - So we need to match full stop or question mark then shave it off once matched
- What about for sequence, when the event is also a string of letters?
  - Always preceded by 'prob of sequence '. So same technique. Except the sequence can be followed by a space as well as a full stop or question mark.
- Testing
  - Oops! Need to catch when it doesn't find the prefix. That returns -1 so it works as an index even though you don't want it to.
  - Looks like it works a charm now!
Rerunning preprocessing...
- Source vocab 279, that's a good sign.
  - "Calculat". Ah crap. When the event sequence is "t e" it replaces "Calculate" with "Calculat e".
    - 60 cases in easy
    - 45 cases in medium
    - 68 cases in hard
    - 3 cases in valid
  - "Wh"=3, "Wha"=204. Another probability mistake.
  - "fr"=9, "fro"=379
  - "picke"
  - "pr", "pro"
  - "replaceme", "replacemen"
  - There are more. And could be tiny ones, e.g. "i" for "in" or "is"
  - Need to change the algorithm to replace at the original position.
  - Ok fixed that.
Rerunning preprocessing...
- Source vocab 257
- Target vocab 44
- Good to go, now weird words found

Running number segmentation experiment

Updating medium config
- Done
- But we need to rebuild the data again. I want it to be as controlled as possible with the baseline, so no data ids.
Didn't get it set up today

Updated IRM experiment progress

Achieves reasonable accuracy before penalty increase, again
Loss is going down steady and significantly after penalty increase!
If anything the penalty is too low; about the same magnitude as the base loss even with the 100x increase in weight. Could go up another order, but this may require a counterbalance in learning rate recaling given that it is currently stable.

2020.03.14

Updated IRM experiment progress

Complete
Continued to lower xent, but because less steady at around 0.55-0.60. Reaches about 0.48 average in the last 1000 steps, bouncing around 0.46-0.50.
Penalty ends up usually on-par or lower order of magnitude than base loss. So I would like to try 10x increased penalty weight, 10x decreased learning rate. We can resume from step 27000.
Am I running enough steps to expect improvement?
- The CMNIST example kicks in the penalty at 190 steps, and virtually all improvement is made by step 300. 27000 vs. 40000 steps is pretty close to this proportion of steps.

Running IRM experiment with higher penalty

Penalty weight: 1.0 -> 10.0
Learning rate counterweight: 10.0 -> 100.0
Resume from: 27000
JOBID=849937
The learning rate wasn't adjusting because it starts at 27001 (849936), so I modified onmt to check if the step is greater than and apply a toggle switch to set it only once.

Number segmentation

Concatenating train difficulty files
Reshuffling so difficulties are not segregated
Preprocessing
Compressing
Transferring
Updating config
Expectation: slower to learn initially, then marginally better final performance, because it can be more efficient (less tokens per sentence, helps the attention mechanism).
Story for why it is worse: the increase vocabulary, especially with some rare words, impedes the ability to improve upon the baseline.
JOBID=849949

2020.03.15

Number segmentation experiment

Crap, wrong learning rate schedule
Correct schedule: JOBID=851155

Baseline with difficulty-split dataset and triple batches

I think it's important to try this, to see how much results vary simply due to the change in data handling and increased validation set size
JOBID=851433
NOTE THIS WILL BE NAMED project-dir-irm !!!

2020.03.17

Implementing risk extrapolation

Assuming if I just call penalty.backward() on its own, it will add to the overall gradients
Testing
- Assumption false! RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
- Penalty is ~10^7 after jump in this early stage (step 10-20)
- Since it's directly derived from the losses, and losses go to ~10^3 by the time we increase the penalty in the full experiment, variance is expected to be ~10^6. That's ~1000x what it was for IRM. We were already rescaling LR down by 10x for that IRM. So scale down by 10,000x?
Note: because we aren't doing truncated backprop, the loop over the target sequence is actually just one iteration, with the entire sequence processed in that iteration
JOBID=853660
- Bugger, I left early stopping active. Oh well, we will have to extract the model files into the experiment directory, change the config, and resume
Realised that the 3-batch baseline, based on the most recent config, probably was set to resumed from checkpoint 27000, when I wanted it to start from scratch. So I've cancelled that to make room for this experiment.

2020.03.18

REx experiment

Ok, a few problems
The loss printed out for xent doesn't include the variance, because variance is not included in the stats object
Variance fluctuates a lot, in the range ~10^5-10^6. But that's just every 100th step. It would be good to get the aggregate over 100 steps.
As we knew yesterday, early stopping was active. So it quit at 31k steps
Accuracy does not get as high by step 27000
- This suggests REx has too strong an influence initially. I will decrease weights by 10x, to 1e-5 and 1e-1.
JOBID=853924

2020.03.21

Reviewing REx experiment

Validation accuracy max is 79.9194 at 33k steps
- Odd that even with a 1e-5 weight on the penalty, the optimization is this much worse (1-2% accuracy compared to baseline or IRM)
- After penalty is applied, xent jumps from 0.29 to 8.29. Then it hovers between 7.8 and 9.5, with no apparent trend.
- This seems bad. The learning rate is tiny but loss still moves around significantly, unstable.
- I don't think it's worth continuing this method unless we have a sentence length-based data split.

Splitting data by length

How to do this?
- Measure the length of each sequence
- Sort the lengths (preserving index)
- Divide the lengths into three partitions
- Write out the partitions to separate folders
What can split_dataset.py do at the moment?
- Can take an input folder - but we need multiple folders
- How about we specify an input folder template, plus a flag to activate length splitting, then the length splitting function combines the data across difficulties?
- Alternatively, we merge the modules for each difficulty first. Then we run the length splitting on the difficulty-merged modules, writing out to new folders. Finally, we merge the modules together into one file per length partition.
  - I like this because we can work with the data already split and tokenized.
Ok first step is to merge difficulties. Can we use merge_for_processing.py for that?
- Not as-is. Need an interface like
```
python merge_difficulty.py train-<difficulty>-split comparison__sort
```
- The script replaces <difficulty> with each difficulty
- But this can still work with merge_files(); we just need to change the interface above that
- We will need to shuffle before splitting into training and validation.
  - Or will we? Putting the highest lengths into validation could provide a better indication of how it will do on extrapolation.
- Script done. Merges from the original question-answer alternating files. Had to iterate over pairs of lines. Tested on one module and it works. Seems to be working
- The validation set will end up being the longer sentences of each partition. So it will still be a mix, but longer on average.
Now to split by question-answer, and tokenize by character
- We ought to test character-level first because this controls the experiment as much as possible.
- If this experiment is successful and we have time, we can combine IRM with number segmentation in the hope of maximum gain.
Now to merge
- Done
Now to shuffle
- Done
Now to preprocess
- Done
JOBID=855461

2020.03.25

Wrapping up IRM length-split experiment

Checking progress
- Validation accuracy peaked at 77.85% at 28000 steps
- Validation accuracy converged on 77.71%
  - Not a great loss (0.14%). If it were above 0.5%, I'd be concerned.
- Best normalised training loss was 0.28 at step 27000. This jumped to 2.17 at step 27100, then converged to about 0.41.
  - After the penalty is applied, loss seems to be more volatile, but still converges steadily overall.
- Checkpoint with lowest training loss was 37000 at 0.38. Validation accuracy decreased thereafter, but was also higher beforehand. This seems like a reasonable bet as the best checkpoint.
Copying experiment archive
Done
Decided to wait for results on length-split before combining with NS. Because maybe length-split is even worse. In that case the best bet would be combining without length-split.

Note while I remember: run the baseline with token normalization. I think this is really important in expectation, in case it solves the problem with length dependence.

Replaced data.valid.0.pt with 270k line version that I've been using for later experiments. This could give an error due to incompatibility with the rest of the data files, but I want to test whether it works.
- I imagine if the new validation data had some new vocabulary, this would cause an error. Or it may fail silently by being an unknown token. This is extremely unlikely given the size of the training data.
JOBID=858514 ~~JOBID=857452~~

2020.03.26

Checking token normalization experiment

Darn, it failed because I didn't include gpu_test.py. Assumptions! Always do it the same way.

Setting up IRM-ns experiment

What do we need to do with data?
- Take the NS data: train-easy-split-ns etc.
- Merge the NS data by difficulty
- Keep the merged validation data the same
- Shuffle training data
- Preprocess
Given that the length split gave results no better, and probably worse, than the difficulty split, I will stick with the difficulty split.
Ok, merge...done
Shuffle...done
Preproces...
Writing config
- Which penalty weight setting is best?
- Checking the results
- The choice is between 20200312 and 20200314
  - Interpolation: 20200314 is better in 4/15 cases
  - Extrapolation: 20200314 is better in 8/15 cases
  - We care more about extrapolation, so I will go with 20200314
- Fetching config of 20200314
  - Penalty weight: [0.0001, 1.0, 10.0]
Now just need to compress data, rsync to cluster, extract, and rsync to project dir...Done
JOBID=858535

2020.03.27

Just recording this in case it matters

There exists ~/projects/mlp-project/mlp-project/dataset/merged-valid-split with merged_src_valid.txt and merged_tgt_valid.txt saved March 21, 15:40. (For other people reading this, I changed the directory of this project to somewhere else.)
I am pretty sure this was a mistaken output directory which I fixed, and the correct data in this directory was ultimately used.
Besides, it was the validation set for the length split experiment, which ultimately does not affect performance (besides deciding when to stop training, but we didn't do that).

Ben project log

2020.01.06

2020.01.08

2020.01.09

2020.01.17

2020.01.19

2020.01.23

2020.01.24

2020.01.27

2020.01.28

2020.01.29

2020.01.31

2020.01.31

2020.02.03

2020.02.04

2020.02.05

2020.02.09

2020.02.10

2020.02.11

2020.02.12

2020.02.19

2020.02.23

2020.02.24

2020.02.26

2020.03.01

2020.03.02

2020.03.03

2020.03.04

2020.03.05

2020.03.06

2020.03.07

2020.03.08

2020.03.09

2020.03.11

2020.03.12

2020.03.13

2020.03.14

2020.03.15

2020.03.17

2020.03.18

2020.03.21

2020.03.25

2020.03.26

2020.03.27

Clone this wiki locally