- The experiments are run with PyTorch 1.0, CUDA 10.0, and CUDNN 7.5.
- Training times are measured on our servers with TITAN V GPUs (12 GB Memory).
- Testing times are measured on our local machine with TITAN Xp GPU.
- The models can be downloaded directly from Google drive.
Model | GPUs | Train time | Test time | Valication MOTA | Test MOTA | Download |
---|---|---|---|---|---|---|
mot17_fulltrain | 4 | 4h | 45ms | - | 67.3 (Private Detection) | model |
mot17_fulltrain_sc | 4 | 4h | 45ms | - | 61.4 (Public Detection) | model |
mot17_half | 4 | 2h | 45ms | 66.1 | - | model |
mot17_half_sc | 4 | 2h | 45ms | 60.7 | - | model |
crowdhuman | 4 | 21h | 45ms | 52.2 | - | model |
*_half
corresponds to the half-half video train/ val split mentioned in the paper.*_fulltrain
corresponds to train on the full training set, and evaluate on the official test server. These models are provided for arXiv and demo purposes. It is highly NOT recommended to submit our predictions to the test server, for not abusing the test set. Usually the validation results are all you need for developing.mot17_half
/mot17_fulltrain
are finetuned on thecrowdhuman
model, andmot17_half_sc
/mot17_fulltrain_sc
are trained from ImageNet initialization.- The validation results are both using private detection.
- All the MOT models are trained for 70 epochs, with learning rate dropped at the 60th epoch.
- The crowdhuman model is trained on CrowdHuman dataset with the "training on static image data" technic in our paper, and evaluate directly in MOT17 validation set. The crowdhuman pretraining uses 140 epochs, with learning rate dropped at 90 and 140 epochs.
- The training schedules are not well studies.
- We observe about 1 MOTA random noise for MOT models.
- If the resulting MOTA of your self-trained model is not desired, playing around with the
--track_thresh
and--pre_thresh
sometimes gives a better number (See Appendix H of the paper). - The MOT models, even trained on the full training set, still does not look great for in-the-wild videos. The crowdhuman model is a better choice for real world application. However, be aware that both datasets are in non-commercial licenses.
Model | GPUs | Train time | Test time | Validation MOTA | Test MOTA | Download |
---|---|---|---|---|---|---|
kitti_fulltrain (flip) | 2 | 9h | 66 | - | 89.44 | model |
kitti_half | 2 | 4.5h | 40 | 88.7 | - | model |
kitti_half_sc | 2 | 4.5h | 40 | 84.5 | - | model |
- We use flip-test for the model we submitted to the test server (kitti_fulltrain_flip).
kitti_fulltrain
are finetuned on the nuScenes_3Ddetection_e140 model (see below).- All the models are trained for 70 epochs.
- We observe up to 1.5 MOTA jittering due to randomness. The results are reported for the best model.
Model | GPUs | Train time | Test time | Val [email protected] | Val AMOTA | Val mAP | Download |
---|---|---|---|---|---|---|---|
nuScenes_3Ddetection_e140 | 8 | 72h | 28ms | - | - | 30.27 | model |
nuScenes_3Dtracking | 8 | 40h | 28ms | 28.3 | 6.8 | - | model |
- Both models are trained on our DGX servers with 8x 32G V100 GPUs.
- The 3D detection model is trained on all 6 camera images of the keyframes for 140 epochs. It does not include attributes and velocity prediction and is different from the model we used in the 3D detection leaderboard. See the CenterNet repo for details about the full 3D detection model we used for test set evaluation.
- The 3D tracking model is finetuned on the 3D detection model for 70 epochs.
- Training on 4 GPUs or 8x 12G GPUs with smaller batchsize is OK, if the linear learning rate rule is applied.
Model | GPUs | Train time | Test time | Download |
---|---|---|---|---|
coco_tracking | 8 | 39h | 30ms | model |
coco_pose_tracking | 8 | 19h | 33ms | model |
- Both models are trained with the "training on static image data" technic in our paper.
- The models are not evaluated on any benchmarks since there are no suitable ones in this setting. We provide them for demo purpose only.