-
Notifications
You must be signed in to change notification settings - Fork 0
Ben project log
Installing mathematics dataset
- Example generate command:
python -m mathematics_dataset.generate --filter=linear_1d
- Get an error from
python -m mathematics_dataset.generate
:train/calculus__differentiate_composed Traceback (most recent call last): File "/home/ben/miniconda3/envs/nlp/lib/python3.7/site-packages/sympy/core/compatibility.py", line 419, in as_int raise TypeError TypeError During handling of the above exception, another exception occurred: ...
-
sympy
is 1.5.1, theirsetup.py
sayssympy>=1.2
- Downgraded to 1.2
- Now it works
-
How does BERT work?
- Some or all of it is an encoder. It encodes a sequence of tokens into some tensor.
- Maybe the decoder is the part that is fine-tuned and the encoder is just the pretrained part that ships
Is transformers
the right library?
- So far isn't really meant for "seq2seq" tasks, which is what we're doing here
-
fairseq
is for seq2seq but gets a worse wrap
https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/translation.py
fairseq
translation example
- Read through it. It extends the general training API so it's not what I was looking for.
Installing fairseq
from source so we can use examples
Executing "training a new model" preprocessing
- https://fairseq.readthedocs.io/en/latest/getting_started.html#training-a-new-model
- Preprocessing looks complicated
- I need to know what the final format should be, and how I can get there
- The mathematics dataset repo may have insight
- They have a pregenerated data tar
- Files are
.txt
Preprocessing math data
- Use
fairseq-preprocess
to pre-process and binarise text files, e.g.
fairseq-preprocess \
--trainpref math/train-easy --validpref math/valid-easy --testpref math/interpolate \
--source-lang question --target-lang answer \
--destdir math-bin --dataset-impl raw
- Character-level example: https://fairseq.readthedocs.io/en/latest/tutorial_classifying_names.html
- Data space-separates the characters. Will we be able to do this? We want spaces to be recognised, but for the model to still be character-level.
- Tentative plan to preprocess math data
- Original file format is
difficulty/subject.txt
- Each
.txt
has a series of pairs of lines of text. First line in pair is question - space-separated words. Second line in pair is answer - space-separated words (not sure if it's ever more than one word). - Separate out questions from answers
- Read each text file line-by-line
- Every first line written to source file
- Every second line written to target file
- Prepare for tokenization
- Space-separate
- Tokenize
- Organise the data files
- Access the data files (e.g. by subclassing
FairseqTask
)
- Original file format is
How to build upon original work
- Use multi-modal data: Tokenise English by words, but symbols by characters
Ideas
- It's common to use pretrained word embeddings, or at least train an embedding as the first module of the model. What can we do with math embeddings? Is an Embedding module already part of standard architectures?
Learning
- Attention
- Focus on a particular part of the input at a given step of the forward pass
- Encoder passes every hidden state to the decoder, rather than just the last one
- For each decoder step, a weight (softmaxed attention score) is assigned to each hidden state
- The context vector for a decoder step is the weighted sum of encoder hidden states
- Multi-headed: expands ability to focus on different positions, and gives multiple "representation subspaces"
Mechanics
- We can guarantee that spaces are included in the character tokenisation by converting them to underscores
Spinning up a minimal working example with OpenNMT
- Preprocess
- OpenNMT command default
onmt_preprocess -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo
- Split our data
python split_dataset.sh
- SRC file is ~19 MiB, TGT file is ~1.6 MiB
- Modifying
split_dataset.py
to further split into train and valid - Adapting preprocess command
onmt_preprocess -train_src data/mathematics_dataset-v1.0/train-easy-split/algebra__linear_1d_src_train.txt -train_tgt data/mathematics_dataset-v1.0/train-easy-split/algebra__linear_1d_tgt_train.txt -valid_src data/mathematics_dataset-v1.0/train-easy-split/algebra__linear_1d_src_valid.txt -valid_tgt data/mathematics_dataset-v1.0/train-easy-split/algebra__linear_1d_tgt_valid.txt -save_data data/demo/demo
- Executed
- Ok we also should add the option for character-level split in the script. Or it might be a good idea to write a separate script that overwrites the files.
- Underscores for spaces:
line = line.replace(' ', '_')
- Space separation:
line = ' '.join(line)
- The above ops do not affect single-character lines
- Underscores for spaces:
- OpenNMT command default
- Training
- OpenNMT command default
onmt_train -data data/demo -save_model demo-model
- Our command
onmt_train -data data/demo/demo -save_model demo-model -train_steps 250
- OpenNMT command default
- Inference
- OpenNMT command default
onmt_translate -model demo-model_XYZ.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose
- Our command
onmt_translate -model demo-model_step_250.pt -src data/mathematics_dataset-v1.0/interpolate-split/algebra__linear_1d_src_test.txt -output data/demo/pred.txt -replace_unk -verbose
- 1000 step model: predicts 4 every time!
- 250 step model: predicts 3 every time!
- 3 is only about 7% of the validation set answers.
- Either it needs more training, or it's not working at all.
- OpenNMT command default
Diagnosing degenerate model
- Last time we found with a default command that the model (apparently) learns a degenerate solution of the same output (a single digit) no matter what the input is.
- Possible causes
- The data is not being preprocessed as expected (e.g. character-level tokens)
- Overfitting / model too large
- But how could it achieve ~55% training accuracy with the same single digit?
- Something wrong with transfer to the test/inference regime
- Is the data in the same format?
- It would be good to see input/output examples during training, to check that it is in fact outputting the same digit
- Unknowns
- Whether the stdout for training report training or validation accuracy
- Most likely training, because validation is reported at end of epoch (but how is an epoch defined? The number of epochs argument is deprecated)
- Use
-valid_steps
argument: perform validation every X steps
- Use
- Most likely training, because validation is reported at end of epoch (but how is an epoch defined? The number of epochs argument is deprecated)
- Whether the stdout for training report training or validation accuracy
- The first thing I'm going to try is a smaller model. The model previously used was ~10M parameters, which I think is very excessive relative to the single file of training data.
- Assume each model parameter is FP32 -> 4 bytes.
algebra__linear_1d_src_train.txt
is 35770113 bytes. To memorise the data you thus need ~9M parameters. - Reducing RNN hidden states size from 500 to 50, which reduces to 300k parameters.
- I think a critical oversight I may have had is just how few training steps were performed. Sounds like training batch size is dynamic with "sents"---sentences?---but validation batch size is 32. So 250 steps is not nearly enough to cover 600k samples (600k/32 = 18750)
- Now trying 10k train steps with validation every 1k steps.
- Validation accuracy % is (46 vs. 64) then (60 vs. 62) then (62 vs 62). So if this model ends up still outputting the same digit, the accuracy metric is not doing what I think it's doing.
- Yep, tested inference on checkpoint 5000 and it always outputs 2.
- Woah, hang on a tick. Checkpoint 10000 outputs different digits!
- So it initially learns to output the same digit, then diversifies? I've never seen that kind of learning. From experience with image generation nets, once it learns to output the same thing every time there's no hope. Maybe outputting the same digit is just a product of the parameters still being somewhat random and small from initialisation? I need to understand sequence models better...
- 5000 performance:
PRED AVG SCORE: -2.1865, PRED PPL: 8.9042
- 10000 performance:
PRED AVG SCORE: -2.0108, PRED PPL: 7.4691
- 15000:
PRED AVG SCORE: -1.7671, PRED PPL: 5.8539
- Validation accuracy 54%
- 20000:
PRED AVG SCORE: -1.9983, PRED PPL: 7.3768
- Validation accuracy 68%, yet worse...
- Although the output diversifies, it still doesn't show any sign of understanding the input (i.e. getting correct answers). This highlights the need to change the performance metrics to suit the task, and/or change the preprocessing.
- Assume each model parameter is FP32 -> 4 bytes.
- Is it because there are start and end tokens, and they are included in the accuracy? Or perhaps the space character is included?
- The inference predictions in
pred.txt
don't have any whitespace
- The inference predictions in
- It would be insightful to try a set of problems that require longer answers!
- In DeepMind paper,
calculus__differentiate
is one of the best-performing: P(correct) ~= 93% for Simple LSTM. It has long answers with symbols, e.g.8*d**3 - 70*d
- In DeepMind paper,
- We need to look into how accuracy is measured, because it is clearly not what we expect. Is there a way to have custom metrics?
Next
- Try
calculus__differentiate
data
Reviewing calculus__differentiate
size 100 model
- Latest checkpoint 85000
- It has learned syntax well
- Occasionally it produces an answer with partial mathematical correctness
- The correctness is more in terms of getting some of the correct digits in order, but producing too few digits (based on BLEU evaluation, length ratio is very low, e.g. ~20% at step 85000)
- Length ratio was due to a mismatch in the BLEU command; ignore
- Made up example to illustrate the kind of thing it does:
Differentiate 814530*x**2 -> 16291*x
. So it does some kind of doubling of the coefficient, but misses some of the digits.
- The correctness is more in terms of getting some of the correct digits in order, but producing too few digits (based on BLEU evaluation, length ratio is very low, e.g. ~20% at step 85000)
- Run BLEU evaluation
perl /home/ben/projects/nlp/OpenNMT-py/tools/multi-bleu.perl <path/to/reference/file> < <path/to/prediction/file>
- Create smaller validation data (1000 lines)
head -1000 data/mathematics_dataset-v1.0/train-easy-split/calculus__differentiate_tgt_valid.txt > data/mathematics_dataset-v1.0/train-easy-split/calculus__differentiate_tgt_valid_1000.txt
- We don't think
omnt-translate
compares to the test answers - you don't specify a reference file. It just computes the model's log likelihood (PRED SCORE) and perplexity (PPL) - We ran the BLEU Perl command wrong -- mismatching src with tgt. It turns out that BLEU is pretty good relative to the amount of training and model size. So we are basically ready to run on the cluster!
Testing transformer command
- OpenNMT recommended (for Google WMT replication)
python train.py -data /tmp/de2/data -save_model /tmp/extra \ -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \ -encoder_type transformer -decoder_type transformer -position_encoding \ -train_steps 200000 -max_generator_batches 2 -dropout 0.1 \ -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 \ -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \ -max_grad_norm 0 -param_init 0 -param_init_glorot \ -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \ -world_size 4 -gpu_ranks 0 1 2 3
- Leave out
world_size
for non-parallel, and leave outgpu_ranks
for non-GPU - This results in 44208683 parameters (44M)
- Saxton et al. (2019) parameters
- Model
- Transformer
- embedding size: 512
- heads: 8
- ff size: 2048
- Optimiser
- Adam
- learning rate: 6e-4
- beta1: 0.9 (default)
- beta2: 0.995
- epsilon: 1e-9 (default)
- Training
- batch size: 1024
- hardware: 8x NVIDIA P100
- batches: 500k
- asolute gradient value clipping: 0.1
- Model
- Switching to using config file based on
config-transformer-base-4GPU.yml
example in OpenNMT-py sourceonmt_train -config config/config-transformer-base-4GPU.yml
- Set according to Saxton settings above
- Also a small version for initial testing:
config-transformer-small.yml
Reading deep learning for symbolic mathematics paper
- Polish notation is interesting. It would be a challenging text processing problem to convert all relevant expressions in the mathematics dataset to Polish notation. We could manually select a subset of problems that have easily convertible notation.
- Would it make more sense for sequence model to have input in opposite order to Polish notation? So that the first operations that need to be computed come first.
Where is the reported ONMT accuracy metric defined
-
utils.statistics.Statistics.accuracy()
return 100 * (self.n_correct / self.n_words)
- As expected
Training small transformer
- Note that batch size (AFAICT) is the number of question-answer pairs, rather than tokens (which would be specified by an explicit setting in config)
- Validation accuracy was much lower (IIRC ~50% compared to ~65%) after 1000 steps; running again with more frequent monitoring
- Still improving after 500 steps
Tutorial
- 10am Minto
- Tutor is a computer vision specialist
- Keep a google doc shared with tutor for progress
- Tutor email: [email protected]
Reviewing DeepMind paper
- Curriculum learning: train over many (all) topics, but not all at once. Find the best order of topics to train for the best final performance.
- Focus on how to improve extrapolation, even if the dataset/model are very limited compared to this paper.
- Compare embedding numeracy with a language model e.g. BERT. Is numeracy much better given we train on explicitly mathematical data? Is extrapolation better?
- I recall a Kaggle competition on prediction the next value in a sequence of numbers. There is such a task under
train-hard/algebra__sequence_next_term.txt
andtrain-hard/algebra__sequence_nth_term.txt
. The best Kaggle results could have good insight.
Reviewing DeepMind paper
- Differentiable Neural Computer did not work well - but why? In principle it has good capability, by storing intermediate results in the memory bank
- Simple LSTM: one-hot encoding input
- For LSTM models: additional steps added (encoder or decoder?) with zero input, to allow further computations before outputting the answer
- "we observed that increasing the number of “thinking” steps (as defined above) from 0 up to 16 increased the performance."
- Adaptive Computation Time does not help performance
- Relational memory core (RMC) also tested, but is not better than LSTMs, and best hyperparameter setting yielded 1 memory slot (this somewhat defeats the purpose of RMC).
- "perhaps it is hard for the RMC to learn to use slots for manipulating mathematical entities"
- Perhaps notably, Attentional RMC with bidir LSTM encoder gives the best extrapolation performance besides Transformer, but only 1% above next best of Attentional LSTM with bidir LSTM encoder
- Models predict answers autoregressively, i.e. predict the current token knowing the previously predicted tokens
- What about refining the prediction once it is complete? Like multiple passes over the answer. I'm skeptical that it would help though.
- What do humans do to check their answer? I check that I have understood the question correctly. I check the answer to see if it is sane. Then I might go through my steps of working, checking for local validity.
- AFAICT they do not use beam search, but they don't mention it so maybe they do by default. Beam search is very common.
- Results
- "RMCs were more data efficient but trained more slowly"
- "LSTMs had better asymptotic performance"
- "attentional LSTM and the simple LSTM have similar performance...We speculate that the attentional model is not learning to algorithmically parse the question"
- Something to investigate further
- "the Transformer has various advantages over LSTM architectures, such as (1) doing more calculations with the same number of parameters, (2) having a shallower architecture (with better gradient propagation), and (3) having an internal "memory" that is sequential, which is more pre-disposed to mathematical objects like sequences of digits."
- "Overall it seems that magnitude is easy for neural networks to learn."
- Add/subtract and multiply/divide are good separately (>90%), but poor together (~50%)
- Evidence that models learn relatively shallow tricks, rather than algebraic/algorithmic manipulation
- Transformer is much better at polynomial manipulation, attributed to parallel sequential architecture being able to hold multiple coefficients in memory
- Models cannot add ones together correctly for n >= 7 ones
- May be relying on the numbers being different to align subsums
- We could test this further by trying other repeated numbers, and a mix of repeated and distinct numbers
- May be relying on the numbers being different to align subsums
- Totally consistent question phrasing induces fragility, e.g. the presence of a full stop is the difference between correct and incorrect answers
- Extrapolation: "models completely failed to add together more numbers than seen during training, which agrees with the suspicion that models have learnt to add numbers in parallel rather than calculating subsums"
- So there are competing hypotheses about whether it adds subsums or adds in parallel
- In general this paper speculates several reasons for its results. A good research question could be to test one of these speculations.
- Idea: visualising attention for these models
Reading symbolic calculus paper
- How much did they test functions that are are rare under their generation methods? This would indicate generalisation
- They simplify expressions to their shortest unique form, and replace coefficients on like terms with a single coefficient on a single term
- Transformer
- 8 heads
- 6 layers
- 512 units
- Adam
- lr 1e-4
- batch size 256 (equations)
- Inference: beam search with early stopping
- Beam width: 1, 10, 50
- Output is correct if at least one in beam is correct
- This is considered OK when the solution can be easiliy verified, e.g. differentiating back the integral
- No constraints enforced on output (it tends to learn syntax very well)
- Solutions evaluated by comparing to reference solution (in simplest form) in SymPy
- Not clear that they actually use SymPy; they say "we can"
Current summary of potential research objectives
- Use beam search with early stopping on DeepMind dataset
- Tokenise English as words
- What improves extrapolation (even relatively, at the cost of interpolation performance)
- Visualise attention of trained model to analyse reasoning
- Probe hidden states of trained model and/or generate counterfactual training examples to analyse reasoning
- Does Polish notation improve performance on DeepMind dataset (where applicable)?
- Use insight from "Physics as Inverse Graphics" to improve extrapolation
- Apply neurosymbolic model
Reading Wallace et al.
- Adds further evidence for NNs being bad at extrapolation
- The probe model cannot predict numbers outside the training range from embeddings
- Big accuracy drops when question-answering dataset is modified with bigger numbers, or conversion from digits to words
Reading Do et al.
- The difficulty with multi-step problems can be seen as non-smooth loss - small changes in input can give a big change in the output which isn't correlated in a direct way
- Hypothesis: supervision on intermediate steps smooths the loss to improve learning and in turn performance
- Data: DeepMind Mathematics, limited to particular hard problems
- We could adopt this approach
- Evaluating and simplifying polynomials, evaluating arithmetic expressions using order of operations, finding polynomial roots, and finding remainders.
- Augment with intermediate steps
- Oh...this is a real amateur paper. They don't even have results for their method. Let's hope that's not us!
- At least they open-source it
Setting up on cluster
- Bridge from local to mlp via DICE
ssh -N -L localhost:3306:mlp:3306 s1000116@dice
-
-N
is to not execute a remote command, useful for just forwarding ports - Can't get this to work
-
- Set up the repo and env on cluster
Reviewing run-model.sh
- Currently the expected file structure appears to be
project-dir config config1.yml config2.yml ... exp exp1 data.train.0.pt data.valid.0.pt data.vocab.pt exp2 ... train.sh
- Considerations for data pipeline
- Organise experiment directories by data then by model, or model then data, or everything in the same directory with distinct file names?
- The minimal file structure for the node would just have the necessary files for the particular experiment, i.e.
project-dir config.yml data.train.0.pt data.valid.0.pt data.vocab.pt
- To do this, we would copy the filled-in config template into
project-dir
as the generically namedconfig.yml
.
- To do this, we would copy the filled-in config template into
- Perhaps instead of storing the training data binaries with each experiment, we keep it under
data/bin
by category, copy over the relevant binaries intoproject-dir
when executing, and delete them fromproject-dir
afterwards.
Research question
- Scope: neural (particularly seq2seq) models to solve diverse high-school level worded mathematics problems
- Part 1: what are the mechanisms underlying worse extrapolation?
- Part 2: how can extrapolation performance be improved?
- Procedure:
- Train baseline Transformer on problems that have corresponding extrapolation data
- This is small relative to the full DeepMind dataset; a Transformer of modest size is expected to be sufficient
- Analyse trained Transformer
- Probe outputs for different edge cases
- Visualise attention
- Compare failure
- Problem: may not be interpretable, distinguishable
- Based on insights, implement a modification that is expected to improve extrapolation
- Different preprocessing
- SymPy
- Tokenization
- Hard attention
- An extra module after the transformer - look into broader barelyliterature on extrapolation
- Regularisation
- Whether or not the extrapolation difference is interpretable (see above), we need to design something here, something original in some way (it can be just slightly novel)
- Different preprocessing
- Could focus around the edges of extrapolation, because that could be easier
- Train baseline Transformer on problems that have corresponding extrapolation data
Ashwani's cluster directory structure
project-dir
config
config1.yml
...
exp
calculus__differentiate
data
processed_data # don't need this
data.train.pt
...
logs
log.txt
model
model_step_10000.pt
inference.sh
train.sh
Running on cluster
- Tar command for project directory:
tar -czvf project-dir.tar.gz project-dir
- SLURM job 740897
- Reduce to 4 hours to try to get running quickly: 740898
- New config data subdir: 740899/740900/740901
- Prints out many message like this between training progress (e.g. 5-6 times per 100 training steps):
[2020-02-05 14:31:57,248 INFO] Loading dataset from exp/calculus__differentiate/data/data.train.0.pt [2020-02-05 14:32:00,935 INFO] number of examples: 79137
- Yet tokens per second is fast: e.g.
207651/41274 tok/s
(first number is source, second is target)- Ashwani reported similar magnitude in his run:
356742/68644 tok/s
. This is roughly 10x what I got in my laptop CPU run. - Ashwani's run took 1836 seconds for 1000 steps, on my laptop 1628 seconds.
- Ashwani reported similar magnitude in his run:
- So despite having an order of magnitude higher token processing rate, the total time is about the same between 4 GPUs and 1 CPU. This is highly suspicious.
- This suggests there is a bottleneck other than the model execution. The other main source of computation we have thought of is preparing the data file(s).
-
calculus__differentiate/data.train.0.pt
is 12860868 bytes (~13 MB) - For comparison: the zipped CIFAR-10 is 163 MB. So raw size should not be a problem.
- Model memory?
- Using a ~250k parameter Transformer
-
- Separate issue: where is the job accessing and saving data?
- Is there a way to access a node's directories outside slurm?
- The slurm
.out
prints that it saved a model checkpoint:[2020-02-05 14:52:25,099 INFO] Saving checkpoint exp/calculus__differentiate/model/model_step_2000.pt
- But this is not in
disk/scratch
, nor my home directory.
- But this is not in
- Also, I renamed
exp/calculus__differentiate/data
(the directory given intrain.sh
) toexp/calculus__differentiate/data1
just as a test. Yet it reports finding a fileexp/calculus__differentiate/data/data.train.0.pt
. I don't know where this could be, or if it somehow infers the directory by regex.
Working out cluster execution
- Let's print out the device
python -c "import torch;print('DEVICE:', torch.cuda.current_device())"
- Job: 773829, 773890
- Prints
0
, which indicates a valid CUDA device - It throws an error because of
data1
now...weird - Changed
data1
todata
underproject-dir
, and re-tar-ed it
- Moved more device info into Python script: 774285
- No error for data now
- Perhaps because the files stayed on scratch? But if I
ls /disk/scratch/s1000116
, there is nothing...But why the error now? Maybe because scratch is deleted periodically, and it stayed there long enough. - http://computing.help.inf.ed.ac.uk/cluster-tips
Each node has its own local scratch space, and each node can only access its own scratch space. Scratch space is faster than the distributed filesystem because it's always local to the machine.
- Perhaps because the files stayed on scratch? But if I
- No error for data now
- It's odd that
data.train.0.pt
is ~12 MiB whiledata.valid.0.pt
is ~15 MiB. This is also the case on local. Does it contain the complete dataset? Seems suspicious.- Default
-shard_size
is 1000000 - Specify
-shard_size 0
: same result - Specify
-shard_size 10000
: roughly same total sizes, but train is split into 60 shards of size ~210KB while valid is split into 7 shards of size ~2.3MB (except for last) - Naturally I should check the source files...
-
src_train
: ~85.7 MB -
src_valid
: ~9.5 MB -
tgt_train
: ~19.0 MB -
tgt_valid
: ~2.1 MB
-
- Default
- We should try a smaller batch size. By my calculation it's not as big a memory load as e.g. 100 CIFAR images, but it's plausible that it's making a difference. After all, why would DeepMind limit to a batch size of 1024 on 8x P100 GPUs, if they could easily go larger?
- Batch size 16: 775978
- Almost never the loading printout! But much slower overall: ~8k src tok/sec vs. ~180k (but that's very high variance: saw 70k and 300k))
- 1000 steps in 366 seconds -> 16000/366 = 43.72 samples/second
- Compare batch size 1024: 1000 steps in 931 seconds -> 1024000/931 = 1100 samples/second
- Factor: ~25, which is close to the tok/sec ratio of 22.5
- Suggests that the data bottleneck (if there is one) is no different in total
- I'm assuming 1 step is 1 batch.
- Also added
export CUDA_VISIBLE_DEVICES=0,1,2,3
torun-model.sh
for this- Without: 776374; about the same, so slurm is probably already doing this under the hood, or it's unnecessary anyway
- Almost never the loading printout! But much slower overall: ~8k src tok/sec vs. ~180k (but that's very high variance: saw 70k and 300k))
- Batch size 128: 776780
- 1000 steps in 370 seconds -> 128000/370 = 346 samples/second
- Suggests diminishing returns or some optimum between 128 and 1024
- Batch size 256: 777060
- 1000 steps in 415 seconds -> 256000/415 = 617 samples/second
- Batch size 512: 777631
- 1000 steps in 568 seconds -> 512000/568 = 901 samples/second
- So it's diminishing returns, but not negative returns. 1024 is still best.
- Batch size 16: 775978
Recovering 512 batch size 10000-step run
- Tar files were not consistent at the end of
run-model.sh
:saved
vs.saved_base
. I have made them bothsaved
- The saved tar preserves the full directory structure from root. This seems overkill; just use the working directory.
- We should use early stopping
- Log files are split into chunks; number is appended e.g.
log.txt.1
. So don't bother using.txt
- Note
-shuffle
is not implemented foronmt-preprocess
-- have to do it ourselves. But data should be essentially shuffled anyway; it's just that it will be in a consistent order between experiments. - Validation perplexity is lowest at 1000 steps (4.5) and increases except for 2000-3000
- Compare to training perplexity which starts at 6.9 and ends at 1.14.
- Validation accuracy peaks at 5000 steps (this is ~4 epochs)
Preparing revised experiment
- Early stopping criteria:
ppl
(perplexity) oraccuracy
ppl
- Early stopping patience: 3
Combined dataset
- Extrapolation files
algebra__polynomial_roots_big.txt arithmetic__add_or_sub_big.txt arithmetic__add_sub_multiple_longer.txt arithmetic__div_big.txt arithmetic__mixed_longer.txt arithmetic__mul_big.txt arithmetic__mul_div_multiple_longer.txt comparison__closest_more.txt comparison__kth_biggest_more.txt comparison__sort_more.txt measurement__conversion.txt numbers__place_value_big.txt numbers__round_number_big.txt probability__swr_p_level_set_more_samples.txt probability__swr_p_sequence_more_samples.txt
- Corresponding training modules
algebra__polynomial_roots arithmetic__add_or_sub arithmetic__add_sub_multiple arithmetic__div arithmetic__mixed arithmetic__mul arithmetic__mul_div_multiple comparison__closest comparison__kth_biggest comparison__sort measurement__conversion numbers__place_value numbers__round_number probability__swr_p_level_set probability__swr_p_sequence
- 15 modules vs. 56: ~27%
- 500k -> 134k batches. Let's say 100k.
- Model: very roughly 25% capacity relative to 30M model, if it fits. One-half settings gives 5.6M parameters which would do.
- Command to shuffle two files in the same way
paste -d '|' src_train.txt tgt_train.txt | shuf | awk -v FS="|" '{ print $1 > "src_train_shuf.txt" ; print $2 > "tgt_train_shuf.txt" }'
- Checked that problems don't contain
|
character: `grep -rn './' -e '|'
- Checked that problems don't contain
Attempting baseline
- Unpack tar into specific directory (and remove 1 layer of folder nesting)
tar -xzvf file.tar.gz -C folder --strip-components=1
- 790283 (typo in macro), 790287
- Out of memory
- Will try quarter FF size (512), everything else half: 3,988,779 parameters
- 790298
- Out of memory
- Quarter hidden size (128): 1,406,123 parameters
- I think it is more important to preserve number of heads at 4, and number of layers at 3. This is based on intuitions that (1) number of heads is important to the relative success of the Transformer, (2) number of layers important to allow the model to perform multi-step reasoning.
- Interactive session
- Looks OK - 30 steps
- Has stagnated for about 30 minutes (30 steps was for validation and it hasn't reached that yet). Not sure if this is a problem on my end, or the cluster is just overloaded.
- Full experiment: 790599
- Forgot to recompress project-dir
- Full experiment: 790626
- At 300 steps
- Timed out, forgot about 1 hour limit I put for testing!
- New full experiment: 792747
- Interesting note, worth watching: number of parameters on cluster is slightly different to local. Cluster: 1,409,200; local: 1,406,123.
- Using PGR-Standard partition (probably not supposed to). Got GeForce RTX 2080 Ti which is 11GB. Much better!
- P100s are 12GB or 16GB but I presume DeepMind have the best, so 16GB (which is conservative anyway)
- Seems like these 2080s could handle the quarter-size model.
- We should use a smaller validation set. It seems to be a big bottleneck at this scale.
- Training 1000 steps is taking 5-6 minutes, while validation is taking ~50 minutes!
- Resuming from checkpoint 3000, 66667-size validation set: 794676
Reviewing extrapolation baseline
- Early stopping at 14000 (best perplexity at 11000)
- Is this best? We should train it longer with more patience, just to see
- Running step-11000 predictions on truncated validation set (66667 samples)
- We should also run on inference and extrapolation, and then compare to DeepMind
- Validation average sentence BLEU: 28.45%
- Validation binary accuracy: 23.21%
- Validation corpus BLEU: 59.81%
- Why so different?
Extrapolation set
- 15 modules
- 20,000 samples per module
- 300,000 samples total
Resume extrapolation baseline
- 802891
- 803054
Reading SATNet paper
- Main contribution: a MAXSAT layer
- MAXSAT is a generalisation of SAT: find the maximum number of clauses you can make true by some assignment to the variables. SAT is all the clauses.
- Differentiable
- Input: vector of bits or probabilities
- Transformation: MAXSAT SDP relaxation
- Output: vector of bits or probabilities
- Input of probabilities means that MAXSAT layer can interface with softmax
- They use this to combine a ConvNet with a SATNet to learn Sudoku
Each cell-wise probabilistic output of this convolutional layer is then fed as logical input to the SATNetlayer, along with an input mask
- Does this mean every probability at each cell? So 9x9x10 probabilities in total? Or the maximum probability at each cell, giving 9x9 probabilities in total?
- The Transformer uses softmax
- In self and context attention
- In final generator (LogSoftmax)
- It probably does not make theoretical sense to combine the Transformer softmax with the intended functionality of MAXSAT layer. You would want to frame it as a constraint satisfaction problem somehow.
- They use this to combine a ConvNet with a SATNet to learn Sudoku
Baseline progress
- Resumed with ppl early stopping, patience of 9, and saving 10 checkpoints: 803054
- I'm assuming it does not save the best model no matter what - you have to ensure the checkpoint range exceeds the patience. But this could be wrong.
- Stopped at 74000 steps
- Best at 65000 steps: accuracy 81.95% ppl 2.273
- Accuracy is still improving: 82.03% at step 74000. Really not sure which one is better to measure early stopping.
- Resuming with accuracy early stopping, patience of 9: 809323
Scaling laws for neural language models https://arxiv.org/pdf/2001.08361.pdf
- Performance penalty for varying model size (N parameters) relative to dataset size (D samples) is predictable: N^0.74/D.
- Saxton: 56 modules, 30M parameters
- Us: 15 modules => ~5M parameters to get no comparative penalty. This is much more than the ~1.4M we used last. Suggests we should either reduce data or increase model size, but the latter probably isn't feasible since we were hitting memory limits.
- Us: 1.4M parameters => 5% of the full dataset. 15 modules would be ~27%, so need to cut that down by a factor of ~5.4. Either limit to a single difficulty (1/3) or cut math topics or cut the samples per topic. Perhaps a mix of all three is best.
Running evaluation on higher-patience baseline
-
Last time:
Resuming with accuracy early stopping, patience of 9: 809323
-
Modified inference interpolation script for my details
-
Checkpoint 100000
- Interpolation:
827879, 827880 827881, 827882, 827886,827898 - Extrapolation: 828140
- Interpolation:
-
In saved prediction directory:
for f in *.tar.gz; do tar xzvf $f --strip=6; done
-
Wrote a script to run
metrics.py
on all the prediction files in one go:run_metrics.sh
-
Results compared to 14,000 checkpoint
- Interpolation
- Accuracy: 0.21 to 0.44 (0.23)
- Sentence BLEU: 0.27 to 0.32 (0.05)
- Corpus BLEU: 0.34 to 0.41 (0.07)
- Extrapolation
- Accuracy: 0.10 to 0.26 (0.16)
- Sentence BLEU: 0.22 to 0.27 (0.05)
- Corpus BLEU: 0.27 to 0.33 (0.06)
- Big improvement
- The int-exp accuracy ratio increased: 48% to 59%
- But absolute gap increased: 0.11 to 0.18
- Which is the more relevant metric?
- Disproportionate improvement in binary accuracy relative to BLEU
- Interpolation
Informatics VPN (for transferring to and from cluster)
- Start root terminal:
sudo -i
- Start OpenVPN:
openvpn --config /home/ben/scripts/Informatics-InfNets-AT.ovpn
- Transfor to local:
bash scripts/transfer_data_mlp_to_local.sh -s s1000116 -m /home/s1000116/experiments/extrapolation_baseline/results.txt -l /home/ben/projects/mlp-project
Updated thoughts on LM scaling laws
- Scaling Laws for Neural Language Models is not specific to our topic, but provides several useful heuristics to guide training of models like ours. For example, the penalty for mismatching model parameters N with dataset size D is predictable as ~N^0.74/D.
- Assuming Saxton et al. (our seed paper) as a gold standard, N=30M and D=112M examples. We are using a ~1.4M parameter model - about the biggest we can go on the cluster. So the heuristic suggests we train on ~12M examples to get comparable performance. We have instead trained on 30M examples.
- Some of the modules achieved virtually 0 accuracy on our baseline, so we could cut these from the dataset as one way of approaching 12M. On the other hand, it is not essential for our research question to get better absolute performance - we are just interested in improving performance relative to our baseline. Leaving the baseline as-is would save time.
- NMT-GAN
Investigating invariant risk minimisation
- Downloaded the code, running CMNIST experiment
- IRM is less complicated than I thought - basically just add a penalty as the norm of the gradients.
Understanding S4.2.1 of risk extrapolation paper
- Key point is that the scale of the penalty needs to be scaled as a function of training time. Specifically, it should be scaled up (like a step function) when the model begins to overfit. At least that is the claim in this paper, because this coincides with peak performance on CMNIST.
- Overfitting is considered as when the gap between training and validation performance begins to increase significantly
- Waterfall schedule
- Desjardins et al. (2015)
In both cases, learning rates were decreased using a "waterfall" annealing schedule, which divided the learning rate by 10 when the validation error failed to improve after a set number of evaluations.
- This
increasing the relative weight of the penalty term after 100 epochs of training (using a so-called "waterfall" schedule (Desjardins et al., 2015)) is critically important to performance on the colored MNIST task
- Desjardins et al. (2015)
- In IRM, this is where they apply the step-change in scale:
penalty_weight = (flags.penalty_weight if step >= flags.penalty_anneal_iters else 1.0) loss += penalty_weight * train_penalty
-
flags.penalty_weight
is 91257 andflags.penalty_annel_iters
is 190 (epochs) - To be clear: the penalty term is weighted by 1 for the first 100 epochs, then weighted by 10000 for the rest of training
-
- The problem for us is, the train-valid gap is present from the beginning. The gap in perplexity (proportional to loss) actually starts higher at nearly 2.0, then decreases and stabilises at roughly 1.0 for most of training. Meanwhile, the accuracy gap starts at roughly 2%, increases, and flattens out at roughly 8-9%. All the while, validation accuracy improves on average throughout the 100 epochs.
- This seems like qualitatively different regime to the CMNIST domain. By one interpretation, since there is always a generalisation gap, we could argue that the strong penalty be applied from the begining. By another interpretation, we haven't reached the overfitting regime yet. And by a third interpretation this won't work at all...
- We could test the second interpretation by running from checkpoint 100 to, say, 200. If we find that it does start to overfit in a qualitatively different way, we may have to traing our models longer...
- Overfitting in a qualitatively different way means: either validation accuracy decreases significantly, or training accuracy increases while validation accuracy stays the same.
- We could test the second interpretation by running from checkpoint 100 to, say, 200. If we find that it does start to overfit in a qualitatively different way, we may have to traing our models longer...
- This seems like qualitatively different regime to the CMNIST domain. By one interpretation, since there is always a generalisation gap, we could argue that the strong penalty be applied from the begining. By another interpretation, we haven't reached the overfitting regime yet. And by a third interpretation this won't work at all...
Trying V-REx in original CMNIST code
- Simply set
train_penalty
to the variance of the loss, i.e.torch.stack([envs[0]['nll'], envs[1]['nll']]).var()
- This will not necessarily work well out of the box - it depends on sensitivity to hyperparameters
- It works! Results for one trial:
IRM (ours): Flags: grayscale_model: False hidden_dim: 390 l2_regularizer_weight: 0.00110794568 lr: 0.0004898536566546834 n_restarts: 1 penalty_anneal_iters: 190 penalty_weight: 91257.18613115903 steps: 501 Restart 0 step train nll train acc train penalty test acc 0 0.67671 0.53322 4.83176e-06 0.48310 100 0.38461 0.85098 0.00921 0.10160 200 0.88005 0.46266 0.00216 0.82160 300 0.60410 0.68430 2.66043e-10 0.69870 400 0.60134 0.68728 8.43274e-10 0.69840 500 0.59862 0.69084 8.92941e-10 0.69990 Final train acc (mean/std across restarts so far): 0.69084 0.0 Final test acc (mean/std across restarts so far): 0.6999 0.0
Running baseline to test overfitting
- 100k more steps (i.e. to 200k total)
- 830755
Exploratory data analysis
- Within each question, measure average float order of magnitude, minimum, or maximum?
- Arithmetically, the minimum indicates how many digits need to be added together (give or take a carry)
ONMT mod
- There is an
_accum_batches()
function where batches are formed- Batches are currently a list. Modify this to be a list of lists: [[batch1_dataset1, batch1_dataset2, batch1_dataset3], ...]
- Model is executed and loss computed in
_gradient_accumulation()
- TODO: work out how to implement the loss part
- Ashwani will work out how to go from dataset to batches
batches = [[batch1_dataset1, batch1_dataset2, batch1_dataset3], ...]
for batch in batches:
losses = []
for dataset_batch in batch:
output = model(dataset_batch)
standard_loss = standard_loss_fn(output, target)
penalty_loss = penalty_loss_fn(output, target) # IRM
loss = standard_loss + beta * penalty_loss
losses.append(loss)
avg_loss = sum(losses) / len(losses)
avg_loss.backward()
batches = [[batch1_dataset1, batch1_dataset2, batch1_dataset3], ...]
for batch in batches:
losses = []
for dataset_batch in batch:
output = model(dataset_batch)
standard_loss = standard_loss_fn(output, target)
loss = standard_loss
losses.append(loss)
losses = np.array(losses)
var_loss = losses.var()
avg_loss = sum(losses) / len(losses)
total_loss = avg_loss + beta * var_loss
total_loss.backward()
- Training loop:
onmt.trainer.train
- Iterates batches from
Trainer._accum_batches(train_iter)
- This divvies up batches into
batchesbags - I think this is a
list
oftorchtext.data.Batch
- This divvies up batches into
- Batches are passed to
Trainer._gradient_accumulation(...)
- Batches are iterated
- L364-5:
outputs, attns = self.model(src, tgt, src_lengths, bptt=bptt, with_align=self.with_align)`
- Loss function
Trainer.train_loss
is an argument:onmt.utils.loss.LossComputeBase
- This is implemented as
onmt.utils.loss.NMTLossCompute
- This is passed the
criterion
argument, which is the actual loss function -
criterion
is specified inbuild_loss_compute()
- We are currently using positive label smoothing parameter (0.1), so it selects
LabelSmoothingLoss
(L39) -
LabelSmoothingLoss.forward
is key
- This is implemented as
- Iterates batches from
Creating IRMLoss
class
- Subclassing
LabelSmoothingLoss
- Need to use raw logits to compute penalty:
use_raw_logits = isinstance(criterion, (SparsemaxLoss, IRMLoss))
- But
LabelSmoothingLoss
uses probabilities - so how do I get both? - Well, we use
model.generator[0]
to get raw logits -
model.generator
is aSequential
(seemodel_builder.build_base_model
) - I think we will need to assume the form of
model.generator
and replicate this manually inIRMLoss
. That way, we can run the penalty on the logits, andLabelSmoothingLoss.forward
on the probabilities- The two conditions for it being standard
LogSoftmax
arenot model_opt.copy_attn
andnot model_opt.generator_function == "sparsemax"
. I am confident these are both true. - From
opts.py
:group.add('--copy_attn', '-copy_attn', action="store_true", help='Train copy attention layer.') ... group.add('--generator_function', '-generator_function', default="softmax", choices=["softmax", "sparsemax"], help="Which function to use for generating " "probabilities over the target vocabulary (choices: " "softmax, sparsemax)")
- Can't find anywhere these arguments are set by force
- The two conditions for it being standard
- But
- Source uses weight decay, not sure if it's important for us
- Can do this manually at
optimizers.py
L53 (torch.optim.Adam
takesweight_decay
arg) - Ok so this is a separate issue from implementing IRM loss
- Can do this manually at
- Hang on, maybe we don't need logits
- It's unclear. Phi is just a "data representation". It could be logits or probabilities.
- However, they use logits in their implementation. And the key thing is that we compute the gradient of the loss with respect to the classifier
w
. The classifier is a just a scalar, but it scales the logits, not the probabilities. That is important for the gradient computation, I think. - The conservative assumption is to keep using logits.
- Need a way of knowing the current training step, to schedule the penalty weight
- Option 1: accumulate an integer internally to the loss class
- Easy
- Not robust: loss may get called outside of training progression, which would make the accumulator invalid.
- But this should be ok.
Trainer.train_loss
is separate fromTrainer.valid_loss
, which builds the loss withtrain=False
and thusvalid_loss
will just be NLL.Trainer.train_loss
is exclusively called in the standard training loop.
- But this should be ok.
- Still an issue:
train_loss
is called for each batch, for each timestep.- Hack:
self.train_loss.criterion.step += 1
every time_gradient_accumulation
is called.
- Hack:
- Option 1: accumulate an integer internally to the loss class
Loss in main training loop
- There seems to be a way to avoid modifying code: use
--accum_counts
command line argument.group.add('--accum_count', '-accum_count', type=int, nargs='+', default=[1], help="Accumulate gradient this many times. " "Approximately equivalent to updating " "batch_size * accum_count batches at once. " "Recommended for Transformer.")
- This will sum the gradients over
accum_count
number of batches. - As of now, the multi-dataset functionality is an alternating yield: `[batch_d1_1, batch_d2_1, batch_d3_1, batch_d1_2, batch_d2_2, batch_d3_2, ...]
- By using
accum_count
(e.g. 3),_accum_batches
will group this into lists:[[batch_d1_1, batch_d2_2, batch_d3_3], ...]
- One of these lists is passed to
_gradient_accumulation
. The loss for each batch (for each timestep) is computed, and gradients are accumulated byloss.backward()
(this is only computing the gradients, not updating the parameters). Once all batches in the list are done, the parameters are updated. - Therefore this is equivalent to averaging the loss over the different datasets (give or take a constant averaging factor).
- This will sum the gradients over
Command line arguments
- Which
opts
are passed tobuild_loss_compute
?- I think it's
opts.config_opts(parser), opts.model_opts(parser), opts.train_opts(parser)
(frombin.train._get_parser
) - Includes
data_ids
andaccum_count
intrain_opts
- I think it's
Testing
- Start with a basic input-output test to check for bugs
- OK
- CMNIST replication
- Recording logits, targets, penalties, and loss for
- Step 50 (before penalty weight is increased)
- Step 100 (when penalty weight is increased)
- Need to temporarily replace
base_loss
with the CMNIST one:binary_cross_entropy_with_logits
- Running
- Penalties match perfectly for step 50 env 0, step 50 env 1, step 100 env 0
- Penalty mismatch for step 100 env 1:
6.6927e-05
- Loss mismatch
- Removed weight decay, redoing values
- Running
- Complete match for step 50
- Complete match for step 100, except the least (5th) significant digit on loss differs by 1
- I think we can safely take this as rounding error
- Output
step 0 env 0 true penalty tensor(1.4635e-08) actual penalty tensor(1.4635e-08, grad_fn=<SumBackward0>) env loss tensor(5.6895e-06, grad_fn=<DivBackward0>) env 1 true penalty tensor(5.2920e-08) actual penalty tensor(5.2920e-08, grad_fn=<SumBackward0>) env loss tensor(1.1093e-05, grad_fn=<DivBackward0>) true loss tensor(1.6782e-05) tensor(1.6782e-05, grad_fn=<AddBackward0>) step 1 env 0 true penalty tensor(1.4924e-09) actual penalty tensor(1.4924e-09, grad_fn=<SumBackward0>) env loss tensor(9.0516e-10, grad_fn=<DivBackward0>) env 1 true penalty tensor(1.3743e-08) actual penalty tensor(1.3743e-08, grad_fn=<SumBackward0>) env loss tensor(7.3900e-09, grad_fn=<DivBackward0>) true loss tensor(8.2952e-09) tensor(8.2951e-09, grad_fn=<AddBackward0>)
- Recording logits, targets, penalties, and loss for
Setting up test for IRM
- Data
- Single module
- OK but not easy for interpolation
- Hard but not hopeless for extrapolation
- Involves easily measurable extrapolated feature
-
arithmetic__div
(0.68) ->arithmetic__div_big
(0.53) -
arithmetic__mul
(0.47) ->arithmetic__mul_big
(0.32) -
comparison__closest
(0.57) ->comparison__closest_more
(0.30)- Larger gap
-
comparison__sort
(0.98) ->comparison__sort_more
(0.48)- Huge gap!
- Choosing this
- 3 difficulties: easy, medium, hard (E, M, H)
-
.pt
files separated by difficulty:data.train.E.0.pt
,data.train.M.0.pt
, ...
- Single module
- Model
- Small
- Using heuristic: N^0.74 / D
- Saxton et al.: NG=30M, DG=112M -> RG = NG^0.74 / DG = 0.110624598
- Our baseline: NB=1.4M, DB=30M -> RB = NB^0.74 / DB = 0.042757617
- This to Saxton: D=6M -> N = (D * RG)^(1/0.74) = 575K
- This to our baseline: D=6M -> N = (D * RB)^(1/0.74) = 159K
- Aiming for generalisation to our baseline, so aiming for 159K.
- Splitting dataset
- Modified
merge_for_processing.py
to save filenames as:os.path.join(output_folder, 'merged_'+task+'_'+f_type+'.txt')
e.g.merged_comparison__sort_src_test.txt
- My merge command for this dataset:
python scripts/merge_for_processing.py -i comparison__sort -f ./data -o ./data/train-merged
- Modified
- Preprocessing data
- Modified
config-preprocess.yml
to be a template taking{{task}}
as a variable. Use merged file for validation since that requires a single file. - Modified
preprocess.sh
to useconfig-preprocess.yml
, filling in template, and copying the filled version asconfig-preprocess.{{task}}.yml
, e.g.config-preprocess.comparison_sort.yml
- Command:
bash scripts/preprocess.sh comparison__sort
- Modified
Running test for IRM
- Parameters
- 238163
word_vec_size: 64 rnn_size: 64 layers: 2 transformer_ff: 256 heads: 4
- 91443
word_vec_size: 32 rnn_size: 32 layers: 3 transformer_ff: 128 heads: 4
- 201667
word_vec_size: 48 rnn_size: 48 layers: 3 transformer_ff: 192 heads: 4
- Close enough
- 238163
- It has loaded [E, M, H] datasets before the first step
- Laptop is freezing up
- Process killed, probably due to excessive memory
- One of the train
.pt
files is about 90MB - Confirmed that it reaches 100% RAM once all datasets are loaded
- If I close all programs (except vscode and system monitor) it uses about 80%
- One of the train
- It prints out that it loads a batch from E, M, H each step (I am printing out every training step)
- Does this mean the examples per step are actually 3*1024? This is important so we are sure to run the same number of steps as the baseline.
- I think it is 3*1024. It is printing out the batch size attribute of each batch.
- So should we set batch size to 341? Keep it at 1024 for now
- Keeping at 1024 means we ought to train only for 33300 steps though
- Validation batch size of 32 means this new printout is excessive
- Setting
valid_batch_size
andmax_generator_batches
to 1024 - With 200K examples, there are about 195 batches to get through. Therefore validating every 1000 steps seems reasonable.
- Setting
- It's choosing
LabelSmoothingLoss
. This means all conditions are met for that but not IRM.- Oh, of course. It checks for
LabelSmoothingLoss
conditions first. So I need to put IRMLoss within.
- Oh, of course. It checks for
- IRMLoss now active
- xent is now huge (~1e3 initially) but decreasing
- Perplexity overflows because it is the exponential of xent
- It would be good if we separated out the base loss and penalty for logging purposes
- Accuracy is increasing which is a good sign
- Nevermind, acc started decreasing at step 8. I suspect this is due to the penalty.
- This isn't necessarily bad - in the proper experiment we should only apply this late in training
- Validation perplexity and accuracy are better because it doesn't use IRM
- Accuracy at step 3, 6, 9: 13.7, 33.2, 25.2
Full validation set
- Suppose we want validation to take 20% of the interval spent training
- For the full dataset, with the full 10%, there would be 2930 batches to get through
- So we would have to validate every 15K steps, which is too sparse
- If we want to validate every 1000 steps, we should have 200 batches in validation
- If we want to validate every 5000 steps, we should have 1000 batches in validation
Running baseline for IRM test
- 10+ minutes per 100 steps on laptop
- Moving to cluster
-
JOBID=832983
- PGR-Standard
-
I forgot to rebuild OpenNMT on cluster. This is ok for the baseline, because we are still using
accum_steps: 3
, but we must build before IRM.
Reviewing baseline progress
- Cancelled due to time limit
- Reached step 6600/33300
- Validation accuracy at 6000: 98.468
- The main bottleneck is loading data. From step 6100 to 6200, loading data takes ~37 minutes.
- Would splitting into smaller shard help?
- It gets through training steps in bursts: from 6200 to 6600 in 9 minutes. About 140 seconds per 100 steps on average (PGR-Standard).
- Oh, just realised that 33300 steps doesn't make sense for this. The batch size is fixed, and the dataset is much smaller, so we shouldn't be running 100K batches (which is what 33.3K corresponds to for 3 corpora).
- 15 modules, 100000 steps
- 1 module, 6700 steps. So we basically did enough! But accuracy is still improving. So 10000 seems reasonable to try
- We should also reduce patience to 3 given the batches per step is tripled.
Modifying baseline configuration
- Updating config file: patience 9->3, checkpoints 10->4, train steps 33300 -> 10000
- Splitting data smaller
- Try 1/3 of the default shard size, to fit with the 3 datasets
- 1000000 -> 333334
- Now 2 shards per difficulty
- Syncing project-dir
- Actually I don't think the old OpenNMT is quite OK for the baseline, because it does not implement the batch-by-batch alternation of the datasets. Instead it is within-batch (I think). This probably doesn't make much difference, but we should rerun for the sake of control. I'm pretty sure the amount of data per step is the same though, due to gradient accumulation.
- Installing OpenNMT-py fork
Rerun of baseline (from beginning)
- JOBID=833134
- Everything seems to be in order
- Initial load of data was much faster - about 2.5 minutes compared to 10 minutes
- Expect this to take about 4 hours
Setting up IRM
- Only need to modify config? Also set up a new experiment folder
- Modifying
config-transformer-small-risk.yml
- Uncomment risk arguments
-
risk_anneal_steps: 5000
(half oftrain_steps
) -
risk_penalty_weight: 10000.0
(no idea if this is suitable, just going by default in CMNIST code) - I think for now, 4 HP settings will suit:
- steps: 5000, 8000 (baseline was still not overfitting or converged at 6600)
- weight: 10000.0, 100.0 (minimum considered in CMNIST HP sweep was 1e2, max 1e6)
Running IRM
- Hmm, the experiment files get transferred back as
project-dir.saved.tar.gz
. How can we rename or direct this when running multiple experiments?- Just change
PROJECT_FOLDER
inscripts/run-model.sh
. It will use this in the save name as well.
- Just change
- steps=5000, weight=10000.0 :
JOBID=833140 - steps=5000, weight=100.0 :
- steps=8000, weight=10000.0 : JOBID=833159
JOBID=833158JOBID=833157JOBID=833155- Accuracy is poor (15-20%) and getting worse. We need to wait longer, but this could mean either it doesn't work or we need to start with the penalty much lower, or zero.
- Loss has started to increase at 500
- Loss decrease again at 700, but accuracy still getting worse
- No improvement after 4000 steps, final validation accuracy 5.6%
- steps=8000, weight=100.0 : JOBID=833160
- Error from OpenNMT saying address already in use. I'm guessing this is due to the baseline. But does this only happen on the same machine?
- Probably. Both this and 8000-10000 were running on damnii11. 8000-10000 is still going.
- Error from OpenNMT saying address already in use. I'm guessing this is due to the baseline. But does this only happen on the same machine?
Comparing typical magnitudes of penalty and base loss in CMNIST example
- Gives an idea of whether our initial penalty weight should be less than 1.0
- We are recording the base loss and scaled penalty, before overall scaling, for one environment
- Ours
- Tiny test at step 1:
[2020-03-06 23:17:10,993 INFO] Latest base loss: 175480336.0 [2020-03-06 23:17:10,993 INFO] Latest penalty: 526372640.0
- Tiny test at step 5:
[2020-03-06 23:17:25,258 INFO] Latest base loss: 14517791.0 [2020-03-06 23:17:25,259 INFO] Latest penalty: 43507492.0
- Same order of magnitude
- Penalty is about 3x larger
- This relationship extends to the penalty weight increase, steps 6-10 (30,000x rather than 3x)
- I think onmt does not normalise raw loss by number of words until LossCompute and print-out:
# statistics.py def update(...): ... self.loss += stat.loss ... def xent(self): """ compute cross entropy """ return self.loss / self.n_words
- Currently
normalization: sents
, so actual loss is normalized by batch size. However, the print out forxent
seems to be invariably normalized by the total number of tokens in the batch that are not padding. - How many chars per sentence on average?
wc
for medium is 599999 lines, 24027660 words. Space-separated, so half that is 12013830. That gives 20.023083372 tokens per line. There are 1024 lines per batch, giving 20503.637372729 tokens per batch. Normalization accumulates over datasets, so multiply by 3, giving 61510.912118187 tokens for normalization. - Tiny step 5 would have had total loss
58025283
if we assume the same loss per partition (not a great assumption). Normalizing by above gives 943. Butxent: 196.03
. Same order of magnitude, but loss would have to differ a lot between datasets. Plausible since that loss seems to be for the hard dataset (the last one loaded). - I'm suspicious at the ~3x relationship between base loss and penalty.
- LabelSmoothingLoss uses a
sum
reduction for KL-divergence
- Tiny test at step 1:
- CMNIST
-
mean_nll
usesbinary_cross_entropy_with_logits
which automatically takes the mean. There are 25000 examples per dataset so this is a large normalization factor. -
penalty
uses the samemean_nll
function, so it is also based on this normalized loss. - First few steps
Base loss: 6.7325e-01 Penalty : 2.3887e-04 Base loss: 6.6687e-01 Penalty : 4.7711e-04 Base loss: 7.2201e-01 Penalty : 1.0817e-03 0 0.67006 0.58408 0.00036 0.41900 Base loss: 6.0852e-01 Penalty : 4.4062e-03 Base loss: 5.7582e-01 Penalty : 9.8305e-03 Base loss: 8.3714e-01 Penalty : 2.5737e-02 Base loss: 5.6095e-01 Penalty : 6.8704e-03 Base loss: 5.0142e-01 Penalty : 2.0315e-02 Base loss: 9.7255e-01 Penalty : 1.0545e-01 Base loss: 5.2897e-01 Penalty : 4.5915e-03 Base loss: 4.4080e-01 Penalty : 2.4361e-02 Base loss: 1.1352e+00 Penalty : 2.8289e-01 Base loss: 5.1328e-01 Penalty : 5.1451e-04 Base loss: 3.9398e-01 Penalty : 2.0211e-02 Base loss: 1.3303e+00 Penalty : 6.1654e-01 Base loss: 5.1454e-01 Penalty : 2.2197e-03 Base loss: 3.6207e-01 Penalty : 1.1126e-02 Base loss: 1.5557e+00 Penalty : 1.1603e+00 Base loss: 5.2920e-01 Penalty : 1.5913e-02 Base loss: 3.4508e-01 Penalty : 3.3535e-03 Base loss: 1.7844e+00 Penalty : 1.8739e+00
- First few steps after penalty jump
4.4544e-01 4.4479e-01 4.4416e-01 3.0675e-01 3.0671e-01 3.0673e-01 Base loss: 4.4544e-01 Penalty : 2.2744e+01 Base loss: 3.0675e-01 Penalty : 6.1256e+01 Base loss: 1.4705e+00 Penalty : 9.5397e+03 100 0.37610 0.85030 0.00420 0.10000 Base loss: 4.4479e-01 Penalty : 2.1740e+01 Base loss: 3.0671e-01 Penalty : 6.1806e+01 Base loss: 1.4663e+00 Penalty : 9.4390e+03 Base loss: 4.4416e-01 Penalty : 2.0874e+01 Base loss: 3.0673e-01 Penalty : 6.2090e+01 Base loss: 1.4621e+00 Penalty : 9.3380e+03
- Initially: penalty ~1-3 orders of magnitude lower
- After penalty jump: penalty ~2 orders of magnitude higher
-
- Should we have been normalizing loss by tokens?
- I think we figured
batch_type
andnormalization
went together.batch_type: sents
makes sense (default). But now that I understand length normalization, maybenormalization: tokens
would be better. Anyway, water under the bridge.
- I think we figured
- What's the difference when you normalise the loss inside the penalty vs. outside?
- A factor of N, where N is the normalisation factor. Because the gradients are squared, one factor of N remains hanging.
- This suggests that (unless we change the normalization to match) we should set our penalty weight to
1/batch_size
times their penalty weight, for all training. So the initial weight would be1/batch_size
and the final weight would be10000/batch_size
. - Example: tiny step 5
- Base 1.4517791e07, penalty 4.3507492e07, total 5.8025283e07
- If initial penalty weight is 1e-3, this becomes: base 1.4517791e07, penalty 4.3507492e04, total 1.4561298e07. Penalty has less than 1% effect on loss. This seems reasonable.
Updating loss implementation to have lower initial penalty
- Allow
risk_penalty_weight
argument to be a list, so we can specify initial and final values. - Testing
- Doesn't match what I expect. Penalty is still higher or on-par with base for the first 5 steps.
- Penalty decreases relative to base, and overall. If I move the annel step to 11, then penalty is down to ~10^1-10^3 while base loss remains about 51054
- Yet overall, loss is down 3 orders of magnitude.
- It could be the feedback loop of updating the parameters that causes this difference.
- I can't see how base loss could be downscaled directly, therefore it seems to be the feedback loop.
- So if we account for the normalisation, the orders of magnitude are more in line with CMNIST now.
- But wait. Should we be going by our batch size or CMNIST batch size to get a reasonable penalty weight? I think it should be CMNIST, which is 25000. That means it's more like
4e-5
and4e-1
for the weights! Or if we go with their best penalty weight of ~90000, it's about4e0
.- But note that I recorded the values above with weight 10000.
- In a 20-step tiny experiment, the validation accuracy after the penalty jump actually increased, from 33 to 35%. That's nice, though not an indication the method is working as desired.
- Let's compromise and round, so: [1e-4, 1e0]
- After the penalty weight is increased, (at least for the 10 steps I am running), the loss fluctuates quite wildly. Normally this would suggest the learning rate is too high. CMNIST results above don't fluctuate so much, though env 1 goes down and env 2 goes up. What to do? The REx paper did say they dropped the learning rate at the same time as increasing the penalty weight.
- Penalty goes down about 5 orders of magnitude in CMNIST from start of jump, to end
- REx paper says
learning rate is simultaneously decreased proportionally [to penalty weight increase]
- Decreasing proportionally...that's a big decrease!
- Hang on though...this could just be referring to the normalisation when the penalty weight is greater than 1. That's equivalent to decreasing the learning rate proportionally.
- I tried decreasing the learning rate proportionally, and it still fluctuates a lot. Understandably, accuracy takes less of a hit. But I'm not sure this is what we want.
Running IRM test
- Penalty weight [0.0001, 1.0]
-
JOBID=833276
- xent is never crazy. But it still fluctuates, relative to the magnitude ~1e-1
- Accuracy decreases but not a lot. As we know, small differences in token accuracy can transfer to large differences in binary accuracy.
Checking small IRM experiment
- Seems to be in order
- Still have the varying xent issue
- Could it be that the base loss is the main thing moving around while the penalty is decreasing? Let's check the CMNIST example.
Running inference for small IRM experiment
- Command
onmt_translate \ -model /home/ben/projects/mlp-project/backup/risk_test/risk_test_baseline/experiments/risk_test_baseline/model/model_step_8000.pt \ -src dataset/interpolate-split/comparison__sort_src_test.txt \ -output experiments/risk_test_baseline/pred_interpolation_comparison__sort_8000.txt \ -replace_unk -verbose
- No smoothing on BLEU
- Selecting checkpoint for IRM
- Around where the lowest loss was?
- 16-17000 steps seems like the lowest average loss -> step 17000
- But should we go with the best average over the 1000 steps, or the best at the checkpoint steps?
- The top two at 1000x steps are 12000 (0.20) and 19000 (0.18). 19000 seems like too much additional training to ask for.
- Decision: 12000
- Baseline
- interpolate
Average sentence BLEU: 0.9922 Corpus BLEU: 0.9907 Average accuracy: 0.9137
- extrapolate
Average sentence BLEU: 0.8947 Corpus BLEU: 0.8966 Average accuracy: 0.2481
- interpolate
- IRM
- interpolate
Average sentence BLEU: 0.9423 Corpus BLEU: 0.9342 Average accuracy: 0.7055
- extrapolate
Average sentence BLEU: 0.3740 Corpus BLEU: 0.2783 Average accuracy: 0.0026
- interpolate
- Bugger...
- IRM on extrapolation outputs a lot of single-digit answers. This makes me think of length normalisation.
Post mortem on first IRM experiment
- Things to check
- How number of numbers in
comparison__sort
varies for easy, medium, hard, interpolate, extrapolate (should have done this first!) - Whether a different weight penalty improves the result
- The fact that it completely failed on extrapolation suggests that the model is penalised too much
- 1e-2 seems reasonable. If it is too small, there won't be as much degradation of performance.
- It seems OK to keep the initial weight at 1e-4 since CMNIST only varies the final weight, and we get almost as-good results up until the switch.
- The normalization of the loss
- Whether
normalization: tokens
fixes it (currentlysents
) - How the gradient accumulation works (are my assumptions correct?)
- For example, is adding the gradients equivalent to adding the losses? I assume so, because the gradient of a sum is the sum of the gradients
- See "Distributive properties" here: https://en.wikipedia.org/wiki/Vector_calculus_identities
- Make sure I understand the distributed nature of this operation
- How number of numbers in
- Things to change
- Print out the base loss and penalty separately after
report_every
steps (currently we just print the total)
- Print out the base loss and penalty separately after
- Alternatives
- Risk extrapolation
- This may also work with gradient accumulation - check the equations
- Risk extrapolation
Changing IRM penalty weight
- Changes to config
- Final penalty weight of 0.01
- Save 13 checkpoints (see below)
- If we had saved checkpoint 8000 we would have been able to resume from there, since there is no change in the initial penalty weight. Oh well, maybe next time.
- Modifying ONMT to print base loss and penalty every
report_every
steps- We should print for each difficulty, which means we need to keep a list
Reviewing dataset analysis
- Need to figure out how I'd like it to be processed and presented
- I deleted the split data on my laptop so I have just redone that
Testing extra ONMT reporting
- Experiment name:
risk_test_tiny_reporting
- It all fluctuates, even before penalty increase (though on average, it seems to decrease)
- Trying to reconcile the
xent
value with the loss and penalty- It can't be a final vs. average issue, because we are reporting at every step
-
xent
isself.loss / self.n_words
-
n_words
is output/target words
-
- I'm going to print out
self.loss
as well so we can see exactly what's going on - By adding the base losses and weighted, I get 50396. The reported loss is
53675. Close, but more than a rounding error.- Wait no, the loss is printed after training is reported. So we compare to 78632. That's very different!
- Oh wait, the penalty is already scaled. It just starts off large.
- Ok, it makes sense now.
- Although, I'm not sure why it prints the loss to be added three times each...
- Woops, I was saving in
risk_test_tiny
Preparing full dataset
- First: merge the data
- Gah, I merged all the modules. I just need the fifteen.
- Next: preprocess
- Modifying
config-preprocess.yml
bash scripts/preprocess.sh merged
- It may be that OpenNMT shuffles the data. But what about the shards?
- See here in
onmt.utils.parse.py
:@classmethod def validate_preprocess_args(cls, opt): ... assert opt.shuffle == 0, \ "-shuffle is not implemented. Please shuffle \ your data before pre-processing."
- See here in
- Ok, so we need to shuffle:
paste -d '|' src_train.txt tgt_train.txt | shuf | awk -v FS="|" '{ print $1 > "src_train_shuf.txt" ; print $2 > "tgt_train_shuf.txt" }'
- We need to create a reduced validation set
- So we should shuffle to ensure all tasks are represented
- How big? Currently it is 3000015. I think 270k is reasonable - 1% of training.
- Ok, that's done. 270000 samples.
- Modifying
Idea for what causes differences in performance
- Similarity of the question and answer
- Could measure this by passing the answer over the question, counting the characters that match at each position in the question, then average
Reduced penalty experiment
- 1e-2 max
- JOBID=840834
Running full model
- Writing the config file
- Copy the actual baseline config
- One-third
train_steps
- But we want to train longer
- 120000 -> 40000
- No early stopping
- Penalty weight [0.0001, 0.01]
- Make sure we save enough checkpoints -> 14 = 1 + (40000 - 27000) / 1000
- When to apply the penalty?
- Validation accuracy is pretty much flat after 80000 steps. So I choose that.
- Rounding to 27000 steps for the 1/3 conversion
- Transferred to cluster and project-dir set up
- JOBID=840910
Checking small experiment with smaller weight
- Now it seems like the penalty is too small to influence the learning
- I haven't evaluated it, but it doesn't look promising to me because the penalty fluctuates a lot.
- TODO
- In the IRM CMNIST example, they normalised the loss by the penalty weight. I thought that we didn't have to do this, because our penalty weight doesn't exceed 1.0 at the moment (the condition is still in the code though). But I think this is wrong. If we don't normalise, and we want the penalty to become much more significant than the base loss, the magnitude of the loss should be kept at a similar scale for training stability. I thought "but dividing by penalty weight of 1.0 has no effect". But what we actually need to do is divide by the ratio of the initial and final penalty weight: e.g. 1.0 / 0.0001 = 10000.0
- No, that isn't right either. If the base loss is 3000 and the increased penalty is 1200, we don't want to normalise by 10000.0.
- Earlier application of penalty: maybe the model's representations are already too entrenched by step 8000?
- Resume the baseline to 20000 steps. Even though it looked like it was overfitting, it's possible it could improve again. We would have full confirmation this way.
- In the IRM CMNIST example, they normalised the loss by the penalty weight. I thought that we didn't have to do this, because our penalty weight doesn't exceed 1.0 at the moment (the condition is still in the code though). But I think this is wrong. If we don't normalise, and we want the penalty to become much more significant than the base loss, the magnitude of the loss should be kept at a similar scale for training stability. I thought "but dividing by penalty weight of 1.0 has no effect". But what we actually need to do is divide by the ratio of the initial and final penalty weight: e.g. 1.0 / 0.0001 = 10000.0
Byte-pair encoding
- The lua scripts are for the old OpenNMT. For OpenNMT-py, we just have
bpe_pipeline.sh
- Not sure if input is tokenized already or not
- There is a
.src
file indata
which looks like a names dataset, space-separated characters. But this is not tokenized, just prepared for tokenization. So I think we give the raw file.
- There is a
- Need to run split, but with whole word tokenization
- This is the default
- Done
- Setting up
bpe_pipeline.sh
config
Full model experiment
- 23000 steps
- I'm curious what the average base losses are for each dataset
- Average easy base loss: 2238.85
- Average medium base loss: 3491.49
- Average hard base loss: 5013.37
- Note that at any specific training step this ordering does not necessarily hold by any means. It moves around a lot.
- Also, each data point is for one step (every 100 of them).
- 32500 steps
- Like the smaller experiment, penalty jumps around a lot. I don't quite know what to expect but it seems like a bad sign.
- I think a good thing to try is rewind training to where the penalty increases, apply an increased penalty weight (0.1, or 1.0; I'm thinking 1.0 so IRM dominates in a 10-100x way), but scale down the learning rate in proportion. So 0.000006 instead of 0.0006.
- As a hack we can make the 3rd item in the penalty weights option the learning rate factor, e.g. 100, then divide by this quantity
- Implementing option to reduce learning rate
- It will be the last value in the list of
penalty_weight
- I assume scaling it on the inside is equivalent, but maybe there are subtleties with the gradient and optimizer (e.g. momentum) calculations that make this invalid. I will look at the options
elif opt.start_decay_steps is not None: return functools.partial( exponential_decay, rate=opt.learning_rate_decay, decay_steps=opt.decay_steps, start_step=opt.start_decay_steps)
def exponential_decay(step, rate, decay_steps, start_step=0): """A standard exponential decay, scaling the learning rate by :obj:`rate` every :obj:`decay_steps` steps. """ return rate ** (max(step - start_step + decay_steps, 0) // decay_steps)
- Note it returns the scale to multiply the base learning rate by
- Is this default?
- Yes:
learning_rate_decay=0.5, start_decay_steps=50000, decay_steps=10000
- Ah crap. This means baseline had LR reduced at 50000, and IRM should have the equivalent 16667, but we didn't specify that.
- Yes:
- On checking the Adam update rule, there is a term that squares the gradient, which would square the inner learning rate. This seems problematic, so I should avoid this method.
- Default schedule adjusted for triple batch
- 16667:2,20000:4,23333:8,26667:16,30000:32,33333:64,36667:128
- Suppose the following schedule: 0.5, 20000, 1000. With 1.0 penalty weight.
- 22000:32,23000:64,24000:128,25000:256,26000:512,27000:1024
- It will be the last value in the list of
Reviewing dataset stats
- Some significant differences in performance
- The key stat varies between modules (key stat = stat that gives biggest difference in performance between splits across that stat)
- The key stat is sometimes expected, sometimes surprising. For example, number length is key for division (expected). Sentence length appears key for sorting on extrapolation, while number of numbers isn't really valid because it's either 10 or 11. Something seems fishy about that, because sentence length doesn't matter nearly as much in interpolation.
- We could try unifying the interpolation and extrapolation to get a smooth overall picture. But that already assumes the stats we are measuring are responsible for the difference in performance.
- We need to measure all these stats on train-easy, medium, hard to validate their use as a proxy
Length normalization
- It seems like a good idea to use. I wish we had done it for the baseline.
- Some of the modules (in extrapolation) vary in performance just by sentence length. Token normalisation may improve this.
- Do Saxton use length normalization?
We minimize the sum of log probabilities of the correct character
- No active mention of it in the paper
- As far as we know, besides the settings they explicitly specify, they match the original Transformer
- The original Transformer paper makes no mention of it and I haven't found any reference to it in the tensor2tensor hyperparameter file (
transformer.py
) - Therefore, we can say with about 90% confidence: no
- The only way we are wrong is if they are not telling the whole truth when they say "sum of log probabilities"
Transferring first full IRM experiment files
- Lowest xent after penalty:
- 0.11 : 30900 (first; occurs many times)
- 0.10: 38200, 39600
- So the final checkpoint of 40000 is a reasonable choice
Learning rate adjustment
- Given the restrictions on standard learning rate scheduling, I've changed my mind and I'm going to keep the rescaling within the loss class.
- Testing rescaling in tiny test again...OK
- Testing learning rate decay config setting...
-
start_decay_steps
set to 5 but it went from 0.0006 to 0.0003 at step 4 - If I set it to 6, it halves at 5. So a consistent 1-behind. Oh well, I'll just work around that.
-
- Oh goodness, I forgot I wrote a function to modify the learning rate directly!
- Ok, looking good now.
- Now, let's consider the learning rate divisor schedule again:
- 16667:2,20000:4,23333:8,26667:16,30000:32,33333:64,36667:128
- And where the penalty was at with weight 1e-2
- Average: 106 (std 370)
- Side note: although everything fluctuates a lot, on average the hard penalty (163) is about double the easy (74) and medium (81). Easy has higher std (277) than medium (242).
- Ok, so the learning rate will be 16x smaller already, if we follow the original schedule
- Then, if we changed the final penalty weight to be 1e0, that would bring the effective multiplier to 100/16 = 6.25. So then it would be reasonble to add a further scaling factor of 10, reducing the effective multiplier to 0.625.
- Ok, let's write the config.
-
learning_rate_decay
: 0.5 -> 0.5 -
start_decay_steps
: 50000 -> 16667 -
decay_steps
: 10000 -> 3333 -
risk_penalty_weight
: [0.0001, 0.01] -> [0.0001, 1.0, 10.0]
-
Running full model IRM experiment with learning rate decay
- JOBID=846369
Number segmentation
- Maybe https://github.com/OpenNMT/Tokenizer
pip install pyonmttok
- Seems to work as desired with this setting:
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
- "aggressive" rather than "conservative" to force numbers to be segmented
['Sort', '-■', '3', '■,', '3■', '2■', '5', '■,', '3■', '2', '■,', '4■', '5', '■,', '-■', '1', '■,', '0', 'in', 'descending', 'order'] ['Let', '-■', 'm', '■*', '■■', '2', '■/■', '3', '-', '1■', '9■', '9■', '5■', '0■', '1', '■■', 'm', '■/■', '3', '-', '3■', '9■', '8■', '9■', '9■', '8', '■/■', '3', '=', '0', '■.', 'Calculate', 'm', '■.'] ['What', 'is', 'prob', 'of', 'picking', '2', 'j', 'and', '2', 'c', 'when', 'four', 'letters', 'picked', 'without', 'replacement', 'from', '{■', 'x', '■:', '1', '■,', 'f', '■:', '1■', '4', '■,', 'j', '■:', '2', '■,', 'c', '■:', '2', '■}', '■?'] ['What', 'is', '(■', '8', '-', '(', '■-■', '5', '-', '-■', '1■', '0', '■)', '■)', '+', '-■', '9', '+', '-■', '1', '+', '2■', '3■', '0', '+', '-■', '2■', '4■', '4', '■?']
- Above, with special characters, is due to
joiner_annotate=True
. It gives more information about where the symbol is occurring (good), but increases vocab size (bad).
- Remember to shuffle this data too, once merged
Updated full IRM experiment
- Going fine
Number segmentation
- How much would the joiner annotation increase vocab?
- Using
vocab.py
on full dataset underdataset/bin/merged
- Why doesn't
{
and}
show up?
- Why doesn't
- Source (non-alpha): 20
- Target (non-alpha): 18
- Left-side annotation:
?
,,
,}
,)
,:
,*
,.
- Right-side annotation:
[0-9]
,-
,{
,(
- Middle annotation:
-
,/
,*
,.
-
+
and=
are not annotated because they are always middle - Total: ~24 additional. Roughly doubles the non-word vocab size.
- Using
- How would joiner annotation help?
- Learn different representation for symbol depending on its role relative to adjacent tokens.
- Don't know if this matters much for Transformer, given it has self-attention
- Learn different representation for symbol depending on its role relative to adjacent tokens.
- Decision:
joiner_annotate=False
split_dataset.sh
merge_for_preprocessing.sh
-
shuffle.sh
- Modified to use
ns
postfix
- Modified to use
- Truncation of validation files to
small
usinghead -27000
- Now part of
shuffle.sh
- Now part of
- Update
config-preprocess.yml
- Modified to use
ns
postfix
- Modified to use
-
preprocess.sh
- Modified to title
merged-ns
- Modified to title
-
vocab.py
- Source: enormous
- Ah crap, it's all those sequences of letters for probability
- Target: 44
- Same as before except
e
included and_
is excluded. Maybe because of the different validation truncation? That's its own issue by the way...
- Same as before except
- Source: enormous
- How can we segment those letter sequences?
- Exploit the fact they are always (?) preceded by "from"
- From the generating code, can confirm this always hold and the match is even longer: "letters picked without replacement from"
- This is the same for both probability modules
- Ok, I've figured out how to find and separate that part.
- What about when it's not a stream of letters? It will match the next letter; that's bad.
- So we need to match full stop or question mark then shave it off once matched
- What about for
sequence
, when the event is also a string of letters?- Always preceded by 'prob of sequence '. So same technique. Except the sequence can be followed by a space as well as a full stop or question mark.
- Testing
- Oops! Need to catch when it doesn't find the prefix. That returns -1 so it works as an index even though you don't want it to.
- Looks like it works a charm now!
- Exploit the fact they are always (?) preceded by "from"
- Rerunning preprocessing...
- Source vocab 279, that's a good sign.
- "Calculat". Ah crap. When the event sequence is "t e" it replaces "Calculate" with "Calculat e".
- 60 cases in easy
- 45 cases in medium
- 68 cases in hard
- 3 cases in valid
- "Wh"=3, "Wha"=204. Another probability mistake.
- "fr"=9, "fro"=379
- "picke"
- "pr", "pro"
- "replaceme", "replacemen"
- There are more. And could be tiny ones, e.g. "i" for "in" or "is"
- Need to change the algorithm to replace at the original position.
- Ok fixed that.
- "Calculat". Ah crap. When the event sequence is "t e" it replaces "Calculate" with "Calculat e".
- Source vocab 279, that's a good sign.
- Rerunning preprocessing...
- Source vocab 257
- Target vocab 44
- Good to go, now weird words found
Running number segmentation experiment
- Updating medium config
- Done
- But we need to rebuild the data again. I want it to be as controlled as possible with the baseline, so no data ids.
- Didn't get it set up today
Updated IRM experiment progress
- Achieves reasonable accuracy before penalty increase, again
- Loss is going down steady and significantly after penalty increase!
- If anything the penalty is too low; about the same magnitude as the base loss even with the 100x increase in weight. Could go up another order, but this may require a counterbalance in learning rate recaling given that it is currently stable.
Updated IRM experiment progress
- Complete
- Continued to lower xent, but because less steady at around 0.55-0.60. Reaches about 0.48 average in the last 1000 steps, bouncing around 0.46-0.50.
- Penalty ends up usually on-par or lower order of magnitude than base loss. So I would like to try 10x increased penalty weight, 10x decreased learning rate. We can resume from step 27000.
- Am I running enough steps to expect improvement?
- The CMNIST example kicks in the penalty at 190 steps, and virtually all improvement is made by step 300. 27000 vs. 40000 steps is pretty close to this proportion of steps.
Running IRM experiment with higher penalty
- Penalty weight: 1.0 -> 10.0
- Learning rate counterweight: 10.0 -> 100.0
- Resume from: 27000
- JOBID=849937
- The learning rate wasn't adjusting because it starts at 27001 (849936), so I modified onmt to check if the step is greater than and apply a toggle switch to set it only once.
Number segmentation
- Concatenating train difficulty files
- Reshuffling so difficulties are not segregated
- Preprocessing
- Compressing
- Transferring
- Updating config
- Expectation: slower to learn initially, then marginally better final performance, because it can be more efficient (less tokens per sentence, helps the attention mechanism).
- Story for why it is worse: the increase vocabulary, especially with some rare words, impedes the ability to improve upon the baseline.
- JOBID=849949
Number segmentation experiment
- Crap, wrong learning rate schedule
- Correct schedule: JOBID=851155
Baseline with difficulty-split dataset and triple batches
- I think it's important to try this, to see how much results vary simply due to the change in data handling and increased validation set size
- JOBID=851433
- NOTE THIS WILL BE NAMED
project-dir-irm
!!!
Implementing risk extrapolation
- Assuming if I just call
penalty.backward()
on its own, it will add to the overall gradients - Testing
- Assumption false!
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
- Penalty is ~10^7 after jump in this early stage (step 10-20)
- Since it's directly derived from the losses, and losses go to ~10^3 by the time we increase the penalty in the full experiment, variance is expected to be ~10^6. That's ~1000x what it was for IRM. We were already rescaling LR down by 10x for that IRM. So scale down by 10,000x?
- Assumption false!
- Note: because we aren't doing truncated backprop, the loop over the target sequence is actually just one iteration, with the entire sequence processed in that iteration
-
JOBID=853660
- Bugger, I left early stopping active. Oh well, we will have to extract the model files into the experiment directory, change the config, and resume
- Realised that the 3-batch baseline, based on the most recent config, probably was set to resumed from checkpoint 27000, when I wanted it to start from scratch. So I've cancelled that to make room for this experiment.
REx experiment
- Ok, a few problems
- The loss printed out for
xent
doesn't include the variance, because variance is not included in the stats object - Variance fluctuates a lot, in the range ~10^5-10^6. But that's just every 100th step. It would be good to get the aggregate over 100 steps.
- As we knew yesterday, early stopping was active. So it quit at 31k steps
- Accuracy does not get as high by step 27000
- This suggests REx has too strong an influence initially. I will decrease weights by 10x, to 1e-5 and 1e-1.
- JOBID=853924
Reviewing REx experiment
- Validation accuracy max is 79.9194 at 33k steps
- Odd that even with a 1e-5 weight on the penalty, the optimization is this much worse (1-2% accuracy compared to baseline or IRM)
- After penalty is applied, xent jumps from 0.29 to 8.29. Then it hovers between 7.8 and 9.5, with no apparent trend.
- This seems bad. The learning rate is tiny but loss still moves around significantly, unstable.
- I don't think it's worth continuing this method unless we have a sentence length-based data split.
Splitting data by length
- How to do this?
- Measure the length of each sequence
- Sort the lengths (preserving index)
- Divide the lengths into three partitions
- Write out the partitions to separate folders
- What can
split_dataset.py
do at the moment?- Can take an input folder - but we need multiple folders
- How about we specify an input folder template, plus a flag to activate length splitting, then the length splitting function combines the data across difficulties?
- Alternatively, we merge the modules for each difficulty first. Then we run the length splitting on the difficulty-merged modules, writing out to new folders. Finally, we merge the modules together into one file per length partition.
- I like this because we can work with the data already split and tokenized.
- Ok first step is to merge difficulties. Can we use
merge_for_processing.py
for that?- Not as-is. Need an interface like
python merge_difficulty.py train-<difficulty>-split comparison__sort
- The script replaces
<difficulty>
with each difficulty - But this can still work with
merge_files()
; we just need to change the interface above that - We will need to shuffle before splitting into training and validation.
- Or will we? Putting the highest lengths into validation could provide a better indication of how it will do on extrapolation.
- Script done. Merges from the original question-answer alternating files. Had to iterate over pairs of lines. Tested on one module and it works. Seems to be working
- The validation set will end up being the longer sentences of each partition. So it will still be a mix, but longer on average.
- Not as-is. Need an interface like
- Now to split by question-answer, and tokenize by character
- We ought to test character-level first because this controls the experiment as much as possible.
- If this experiment is successful and we have time, we can combine IRM with number segmentation in the hope of maximum gain.
- Now to merge
- Done
- Now to shuffle
- Done
- Now to preprocess
- Done
- JOBID=855461
Wrapping up IRM length-split experiment
- Checking progress
- Validation accuracy peaked at 77.85% at 28000 steps
- Validation accuracy converged on 77.71%
- Not a great loss (0.14%). If it were above 0.5%, I'd be concerned.
- Best normalised training loss was 0.28 at step 27000. This jumped to 2.17 at step 27100, then converged to about 0.41.
- After the penalty is applied, loss seems to be more volatile, but still converges steadily overall.
- Checkpoint with lowest training loss was 37000 at 0.38. Validation accuracy decreased thereafter, but was also higher beforehand. This seems like a reasonable bet as the best checkpoint.
- Copying experiment archive
- Done
- Decided to wait for results on length-split before combining with NS. Because maybe length-split is even worse. In that case the best bet would be combining without length-split.
Note while I remember: run the baseline with token normalization. I think this is really important in expectation, in case it solves the problem with length dependence.
- Replaced
data.valid.0.pt
with 270k line version that I've been using for later experiments. This could give an error due to incompatibility with the rest of the data files, but I want to test whether it works.- I imagine if the new validation data had some new vocabulary, this would cause an error. Or it may fail silently by being an unknown token. This is extremely unlikely given the size of the training data.
-
JOBID=858514
JOBID=857452
Checking token normalization experiment
- Darn, it failed because I didn't include
gpu_test.py
. Assumptions! Always do it the same way.
Setting up IRM-ns experiment
- What do we need to do with data?
- Take the NS data:
train-easy-split-ns
etc. - Merge the NS data by difficulty
- Keep the merged validation data the same
- Shuffle training data
- Preprocess
- Take the NS data:
- Given that the length split gave results no better, and probably worse, than the difficulty split, I will stick with the difficulty split.
- Ok, merge...done
- Shuffle...done
- Preproces...
- Writing config
- Which penalty weight setting is best?
- Checking the results
- The choice is between 20200312 and 20200314
- Interpolation: 20200314 is better in 4/15 cases
- Extrapolation: 20200314 is better in 8/15 cases
- We care more about extrapolation, so I will go with 20200314
- Fetching config of 20200314
- Penalty weight:
[0.0001, 1.0, 10.0]
- Penalty weight:
- Now just need to compress data, rsync to cluster, extract, and rsync to project dir...Done
- JOBID=858535
Just recording this in case it matters
- There exists
~/projects/mlp-project/mlp-project/dataset/merged-valid-split
withmerged_src_valid.txt
andmerged_tgt_valid.txt
saved March 21, 15:40. (For other people reading this, I changed the directory of this project to somewhere else.) - I am pretty sure this was a mistaken output directory which I fixed, and the correct data in this directory was ultimately used.
- Besides, it was the validation set for the length split experiment, which ultimately does not affect performance (besides deciding when to stop training, but we didn't do that).