This code is a re-implementation of the video classification experiments in the paper Non-local Neural Networks. The code is developed based on the Caffe2 framework.
The code and the models in this repo are released under the CC-BY-NC 4.0 LICENSE.
If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry.
@article{NonLocal2018,
author = {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
title = {Non-local Neural Networks},
journal = {CVPR},
year = {2018}
}
Please find installation instructions for Caffe2 in INSTALL.md
. We also suggest to check the Detectron installation and its issues if you had problems.
First go into the data folder:
cd data
mkdir pretrained_model
mkdir checkpoints
They can be downloaded from: pretrained_model.tar.gz. Extract the models to the current folder:
wget https://s3.amazonaws.com/video-nonlocal/pretrained_model.tar.gz
tar xzf pretrained_model.tar.gz
Please read DATASET.md
for downloading and preparing the Kinetics dataset.
Note: In this repo, we release the model which are trained with the same data as our paper.
All the training scripts with ResNet-50 backbone are here:
cd scripts
We report the benchmarks with ResNet-50 backbone as below. All the numbers are obtain via fully-convolutional testing. All the models and training logs are available for download (some logs might not contain the fully-convolutional testing numbers):
script | input frames | freeze bn? | 3D conv? | non-local? | top1 | in paper | top5 | model | logs |
---|---|---|---|---|---|---|---|---|---|
run_c2d_baseline_400k_32f.sh | 32 | - | - | - | 72.0 | 71.8 | 90.0 | link |
link |
run_c2d_nlnet_400k_32f.sh | 32 | - | - | Yes | 73.9 | 73.8 | 91.0 | link |
link |
run_i3d_baseline_400k_32f.sh | 32 | - | Yes | - | 73.6 | 73.3 | 90.8 | link |
link |
run_i3d_nlnet_400k_32f.sh | 32 | - | Yes | Yes | 74.9 | 74.9 | 91.6 | link |
link |
run_i3d_baseline_affine_400k_128f.sh | 128 | Yes | Yes | - | 75.2 | 74.9 | 92.0 | link |
link |
run_i3d_nlnet_affine_400k_128f.sh | 128 | Yes | Yes | Yes | 76.5 | 76.5 | 92.7 | link |
link |
Besides releasing the models following the exact parameter settings in the paper, we ablate a few different training settings which can significantly improve training/testing speed with almost the same performance.
- Sparser sampling of inputs. We sample N frames with a stride of M frames (so covering N * M frames in the raw view). In the paper we used (N, M) = (32, 2) for short clips and (N, M) = (128, 1) for long clips. The following experiments use (N, M) = (8, 8) for short clips and (N, M) = (32, 4) for long clips. The temporal strides are adjusted accordingly such that the feature map sizes are unchanged in res2 to res5. This modification is to reduce data I/O, which can significantly improve the speed.
script | input frames | freeze bn? | 3D conv? | non-local? | top1 | top5 | model | logs |
---|---|---|---|---|---|---|---|---|
run_c2d_baseline_400k.sh | 8 | - | - | - | 71.9 | 90.0 | link |
link |
run_c2d_nlnet_400k.sh | 8 | - | - | Yes | 74.4 | 91.4 | link |
link |
run_i3d_baseline_400k.sh | 8 | - | Yes | - | 73.4 | 90.9 | link |
link |
run_i3d_nlnet_400k.sh | 8 | - | Yes | Yes | 74.7 | 91.6 | link |
link |
run_i3d_baseline_affine_400k.sh | 32 | Yes | Yes | - | 75.5 | 92.0 | link |
link |
run_i3d_nlnet_affine_400k.sh | 32 | Yes | Yes | Yes | 76.5 | 92.6 | link |
link |
- Fewer training iterations. With sparser sampling of inputs, we further reduce the training time by reducing the training iterations. Instead of training for 400K iterations in the paper, we can train our model with 300K iterations. This reduces training epochs by 25% without losing much performance.
script | input frames | freeze bn? | 3D conv? | non-local? | top1 | top5 | model | logs |
---|---|---|---|---|---|---|---|---|
run_i3d_baseline_300k.sh | 8 | - | Yes | - | 73.2 | 90.8 | link |
link |
-
The following two models were run by Xiaolong Wang with 4-GPU (GTX 1080) machines outside of Facebook after the internship. The training data is downloaded on 12/20/2017 (see
DATASET.md
), which misses some videos due to invalid urls. The training schedule is shorter (4-GPU 600k vs. 8-GPU 400k above). These changes lead to a slight accuracy drop. -
We also provide training scripts/models with half iterations (300K with 4 GPUs) and less regularization. This baseline is fast and for sanity check: it only takes less than 3 days training on a machine with 4 GPUs (see "run_i3d_baseline_300k_4gpu.sh").
script | input frames | GPUs | freeze bn? | 3D conv? | non-local? | top1 | top5 | model | logs |
---|---|---|---|---|---|---|---|---|---|
run_i3d_baseline_600k_4gpu.sh | 8 | 4 | - | Yes | - | 73.0 | 90.4 | link |
link |
run_i3d_baseline_300k_4gpu.sh | 8 | 4 | - | Yes | - | 72.0 | 90.1 | link |
link |
We now explain the scripts taking the ones trained with 3D convolutions, 400k iterations, 8GPUs, and sparser inputs as examples (in Modifications for improving speed
).
-
The following script is the baseline i3d methods with ImageNet pre-trained network:
run_i3d_baseline_400k.sh
-
The following script trains the i3d model with 5 Non-local layers with ImageNet pre-trained network:
run_i3d_nlnet_400k.sh
-
To train the i3d Non-local Networks with longer clips (32-frame input), we first need to obtain the model trained from "run_i3d_baseline_400k.sh" as a pre-trained model. Then we convert the Batch Normalization layers into Affine layers by running:
cd ../process_data/convert_models python modify_caffe2_ftvideo.py ../../data/checkpoints/run_i3d_baseline_400k/checkpoints/c2_model_iter400000.pkl ../../data/pretrained_model/run_i3d_baseline_400k/affine_model_400k.pkl
Note that we have provided one example model (run_i3d_baseline_400k/affine_model_400k.pkl) in pretrained_model.tar.gz. Given this converted model, we run the script for training the i3d Non-local Networks with longer clips:
run_i3d_nlnet_affine_400k.sh
-
The models with ResNet-101 backbone can be trained by setting:
TRAIN.PARAMS_FILE ../data/pretrained_model/r101_pretrain_c2_model_iter450450_clean.pkl MODEL.DEPTH 101 MODEL.VIDEO_ARC_CHOICE 4 # 3 for c2d, and 4 for i3d
The models are tested immediately after training. For each video, we sample 10 clips along the temporal dimension as in the paper. For each video clip, we resize the shorter side to 256 pixels and use 3 crops to cover the entire spatial size. We use fully-convolutional testing on each of the 256x256 crops. This is a slower approximation of the fully convolutional testing (on the variable full size, e.g., 256x320) done in the paper, which requires specific implementation not provided in this repo.
Taking the model trained with "run_i3d_nlnet_400k.sh" as an example, we can run testing by specifying:
TEST.TEST_FULLY_CONV True
as in the script:
run_test_multicrop.sh
The authors would like to thank Haoqi Fan for training the models and re-producing the results at FAIR with this code.