-
Notifications
You must be signed in to change notification settings - Fork 0
Filippo project log
filippoferrari edited this page Feb 10, 2020
·
1 revision
/home/sXXXXXXX/
mlp-project/
config/
scripts/
dataset/
train-easy-split/
train-medium-split/
train-hard-split/
interpolate-split/
extrapolate-split/
-> make dataset.zip with the required files
experiments/
experiment_1/
config file
saved models (should we take only the latest one? how big they are?)
experiment_2/
...
/disk/scratch/sXXXXXXX
datasets/
experiment_1/
onmt preprocessed files
model checkpoints
experiment_2/
...
- Where to store the config files
- All together in the mlp-project repo?
- Config should be kept version controlled
- Storing things in separate experiments folder should make things easier
- Dataset: selecting files each time? agree on just a subset of tasks/problems?
- Slurm scripts
- They need some tweaking with paths/config/names -> might cause issues with version control
- Adapt the run_jobs_simple.py to run several jobs together and have the checks
-
Script now runs from end to end
- Zip and move to node
- Merge files into a single file
- Preprocessing
- Train
- Zip and move results to cluster
-
Main limitation: works for a single task
-
Slow reading of train data and validation data, why so many validation examples (200k vs. 96k training examples)?
[2020-02-10 15:28:23,608 INFO] Start training loop and validate every 100 steps...
[2020-02-10 15:28:23,609 INFO] Loading dataset from /disk/scratch/s1556895/datasets/test_folder/data/data.train.0.pt
[2020-02-10 15:28:24,624 INFO] number of examples: 96614
[2020-02-10 15:29:00,880 INFO] Step 10/ 1000; acc: 10.95; ppl: 17.72; xent: 2.87; lr: 0.00060; 12222/2658 tok/s; 37 sec
[2020-02-10 15:29:33,605 INFO] Step 20/ 1000; acc: 22.51; ppl: 10.80; xent: 2.38; lr: 0.00060; 13189/3153 tok/s; 70 sec
[2020-02-10 15:30:04,563 INFO] Step 30/ 1000; acc: 23.73; ppl: 8.89; xent: 2.19; lr: 0.00060; 13882/2802 tok/s; 101 sec
[2020-02-10 15:30:35,687 INFO] Step 40/ 1000; acc: 27.06; ppl: 7.73; xent: 2.04; lr: 0.00060; 14083/3069 tok/s; 132 sec
[2020-02-10 15:31:07,177 INFO] Step 50/ 1000; acc: 27.84; ppl: 6.96; xent: 1.94; lr: 0.00060; 14242/2840 tok/s; 164 sec
[2020-02-10 15:31:41,912 INFO] Step 60/ 1000; acc: 27.74; ppl: 6.82; xent: 1.92; lr: 0.00060; 13515/2779 tok/s; 198 sec
[2020-02-10 15:32:13,808 INFO] Step 70/ 1000; acc: 30.64; ppl: 6.43; xent: 1.86; lr: 0.00060; 14197/2851 tok/s; 230 sec
[2020-02-10 15:32:47,439 INFO] Step 80/ 1000; acc: 28.44; ppl: 6.52; xent: 1.87; lr: 0.00060; 13511/2909 tok/s; 264 sec
[2020-02-10 15:33:20,184 INFO] Step 90/ 1000; acc: 33.61; ppl: 5.87; xent: 1.77; lr: 0.00060; 13568/3169 tok/s; 297 sec
[2020-02-10 15:33:34,237 INFO] Loading dataset from /disk/scratch/s1556895/datasets/test_folder/data/data.train.1.pt
[2020-02-10 15:33:35,003 INFO] number of examples: 50620
[2020-02-10 15:33:52,965 INFO] Step 100/ 1000; acc: 27.36; ppl: 6.44; xent: 1.86; lr: 0.00060; 13141/2883 tok/s; 329 sec
[2020-02-10 15:33:52,966 INFO] Loading dataset from /disk/scratch/s1556895/datasets/test_folder/data/data.valid.0.pt
[2020-02-10 15:33:55,885 INFO] number of examples: 200001
[2020-02-10 15:42:25,565 INFO] Validation perplexity: 9.05345
[2020-02-10 15:42:25,566 INFO] Validation accuracy: 28.5543
[2020-02-10 15:43:02,170 INFO] Step 110/ 1000; acc: 27.88; ppl: 6.29; xent: 1.84; lr: 0.00060; 804/223 tok/s; 879 sec
Set EXPERIMENT_NAME
, TASK
, PROJECT_FILE
and CONFIG_FILE
:
-
EXPERIMENT_NAME
is the name of the experiment, it is used for the folder names both in the central node, the node used and the final zip file. -
TASK
is the name of the task on which to train the model, likecalculus__differentiate
. Looking at the previous usage only one task was used, this might have to change if several tasks are used. Right now, all difficulties are used and merged together. -
PROJECT_FILE
just the name of the archive, not particularly important. -
CONFIG_FILE
yml config file to use, it must be stored inmlp-project/config
.
The script assumes that the split dataset is already present in /home/${STUDENT_ID}/dataset
.
The main steps in the script are:
- Given a
TASK
, it selects the correct files, zip them together and rsync them to the/disk/scratch
experiment folder. - The files (for example,
file_easy.txt, file_medium.txt, file_hard.txt
) are merged together using a python script so to have 4 files- merged_src_train.txt
- merged_tgt_train.txt
- merged_src_valid.txt
- merged_tgt_valid.txt
- Run preprocessing on these 4 files.
- Get the selected config, add some experiment information and then run the training using the config file
- Zip together the config, the logs, the model steps. This zip is transferred in the
EXPERIMENT_NAME
folder in the/home/${STUDENT_ID}
central node. Data can then be transferred to local machines using thetransfer_data_mlp_to_local.sh
script.
The folder structure for a particular experiment in a node is:
txt_folder/
all the different text files used for a specific task
the merged txt files of step 2
data/
here are stored the results of the preprocessing step
config.yml
logs
model steps
Right now the script only saves the config, the logs and the model steps, this can be easily changed based on the need, maybe creating a models folder (?)
TODO: handle more tasks at the same time - this depends on the experiments run.