Skip to content

Filippo project log

filippoferrari edited this page Feb 10, 2020 · 1 revision

2020.02.08

Paths

/home/sXXXXXXX/
    mlp-project/
        config/
        scripts/
    dataset/
        train-easy-split/
        train-medium-split/
        train-hard-split/
        interpolate-split/
        extrapolate-split/
        -> make dataset.zip with the required files
    experiments/
        experiment_1/
            config file
            saved models (should we take only the latest one? how big they are?)
        experiment_2/
            ...

/disk/scratch/sXXXXXXX
    datasets/
        experiment_1/
            onmt preprocessed files
            model checkpoints
        experiment_2/
            ...

Things to consider

  • Where to store the config files
    • All together in the mlp-project repo?
    • Config should be kept version controlled
  • Storing things in separate experiments folder should make things easier
    • Dataset: selecting files each time? agree on just a subset of tasks/problems?
  • Slurm scripts
    • They need some tweaking with paths/config/names -> might cause issues with version control
    • Adapt the run_jobs_simple.py to run several jobs together and have the checks

2020.02.10

  • Script now runs from end to end

    • Zip and move to node
    • Merge files into a single file
    • Preprocessing
    • Train
    • Zip and move results to cluster
  • Main limitation: works for a single task

  • Slow reading of train data and validation data, why so many validation examples (200k vs. 96k training examples)?

[2020-02-10 15:28:23,608 INFO] Start training loop and validate every 100 steps...
[2020-02-10 15:28:23,609 INFO] Loading dataset from /disk/scratch/s1556895/datasets/test_folder/data/data.train.0.pt
[2020-02-10 15:28:24,624 INFO] number of examples: 96614
[2020-02-10 15:29:00,880 INFO] Step 10/ 1000; acc:  10.95; ppl: 17.72; xent: 2.87; lr: 0.00060; 12222/2658 tok/s;     37 sec
[2020-02-10 15:29:33,605 INFO] Step 20/ 1000; acc:  22.51; ppl: 10.80; xent: 2.38; lr: 0.00060; 13189/3153 tok/s;     70 sec
[2020-02-10 15:30:04,563 INFO] Step 30/ 1000; acc:  23.73; ppl:  8.89; xent: 2.19; lr: 0.00060; 13882/2802 tok/s;    101 sec
[2020-02-10 15:30:35,687 INFO] Step 40/ 1000; acc:  27.06; ppl:  7.73; xent: 2.04; lr: 0.00060; 14083/3069 tok/s;    132 sec
[2020-02-10 15:31:07,177 INFO] Step 50/ 1000; acc:  27.84; ppl:  6.96; xent: 1.94; lr: 0.00060; 14242/2840 tok/s;    164 sec
[2020-02-10 15:31:41,912 INFO] Step 60/ 1000; acc:  27.74; ppl:  6.82; xent: 1.92; lr: 0.00060; 13515/2779 tok/s;    198 sec
[2020-02-10 15:32:13,808 INFO] Step 70/ 1000; acc:  30.64; ppl:  6.43; xent: 1.86; lr: 0.00060; 14197/2851 tok/s;    230 sec
[2020-02-10 15:32:47,439 INFO] Step 80/ 1000; acc:  28.44; ppl:  6.52; xent: 1.87; lr: 0.00060; 13511/2909 tok/s;    264 sec
[2020-02-10 15:33:20,184 INFO] Step 90/ 1000; acc:  33.61; ppl:  5.87; xent: 1.77; lr: 0.00060; 13568/3169 tok/s;    297 sec
[2020-02-10 15:33:34,237 INFO] Loading dataset from /disk/scratch/s1556895/datasets/test_folder/data/data.train.1.pt
[2020-02-10 15:33:35,003 INFO] number of examples: 50620
[2020-02-10 15:33:52,965 INFO] Step 100/ 1000; acc:  27.36; ppl:  6.44; xent: 1.86; lr: 0.00060; 13141/2883 tok/s;    329 sec
[2020-02-10 15:33:52,966 INFO] Loading dataset from /disk/scratch/s1556895/datasets/test_folder/data/data.valid.0.pt
[2020-02-10 15:33:55,885 INFO] number of examples: 200001
[2020-02-10 15:42:25,565 INFO] Validation perplexity: 9.05345
[2020-02-10 15:42:25,566 INFO] Validation accuracy: 28.5543
[2020-02-10 15:43:02,170 INFO] Step 110/ 1000; acc:  27.88; ppl:  6.29; xent: 1.84; lr: 0.00060; 804/223 tok/s;    879 sec

slurm_template.sh

Set EXPERIMENT_NAME, TASK, PROJECT_FILE and CONFIG_FILE:

  • EXPERIMENT_NAME is the name of the experiment, it is used for the folder names both in the central node, the node used and the final zip file.
  • TASK is the name of the task on which to train the model, like calculus__differentiate. Looking at the previous usage only one task was used, this might have to change if several tasks are used. Right now, all difficulties are used and merged together.
  • PROJECT_FILE just the name of the archive, not particularly important.
  • CONFIG_FILE yml config file to use, it must be stored in mlp-project/config.

The script assumes that the split dataset is already present in /home/${STUDENT_ID}/dataset.

The main steps in the script are:

  1. Given a TASK, it selects the correct files, zip them together and rsync them to the /disk/scratch experiment folder.
  2. The files (for example, file_easy.txt, file_medium.txt, file_hard.txt) are merged together using a python script so to have 4 files
    1. merged_src_train.txt
    2. merged_tgt_train.txt
    3. merged_src_valid.txt
    4. merged_tgt_valid.txt
  3. Run preprocessing on these 4 files.
  4. Get the selected config, add some experiment information and then run the training using the config file
  5. Zip together the config, the logs, the model steps. This zip is transferred in the EXPERIMENT_NAME folder in the /home/${STUDENT_ID} central node. Data can then be transferred to local machines using the transfer_data_mlp_to_local.sh script.

The folder structure for a particular experiment in a node is:

txt_folder/
    all the different text files used for a specific task
    the merged txt files of step 2
data/
    here are stored the results of the preprocessing step
config.yml
logs
model steps

Right now the script only saves the config, the logs and the model steps, this can be easily changed based on the need, maybe creating a models folder (?)

TODO: handle more tasks at the same time - this depends on the experiments run.

Clone this wiki locally