Publication authored by Shan Wu, Amnir Hadachi, Chaoru Lu, and Damien Vivet. You can find our code on GitHub link.
Multi-object tracking (MOT) is one of the most essential and challenging tasks in computer vision (CV). It is a challenging task due to the large number of objects, occlusions, and the lack of prior knowledge about the number of objects. The Transformer, on the other hand, has been shown to be effective in CV tasks. Some works have leveraged the novel Transformer-based detector for MOT, such as TransTrack[1] and TrackFormer[2]. However, current solutions boost the overall model sizes by additional modules for tracking, as explained in Fig. 1.
In our recently published paper, we propose a new model for multi-object tracking based on purely Transformer modules, namely MOTT. We thoroughly evaluated the efficiency of model components of MOT systems and remove the redundant modules to improve the efficiency of the model. As a result, a unified MOT model consisting of only encoder and decoder is proposed while achieving performance gain compared with the state-of-the-art (SOTA) methods.
Fig. 1: A visualization of the architectures of the state-of-the-art Transformer-based MOT systems comprising a convolutional backbone, a Transformer encoder, and up to two Transformer decoders.
As shown in Fig. 2, we create a customized encoder–decoder architecture for MOT (as sequence prediction problems) by the interaction between object queries and memory embeddings. CSWin Transformer[3] is selected as the encoder module due to its efficiency and effectiveness in object detection. Compared with the conventional ResNet-based backbone, CSWin overtakes them by a large margin in terms of object detection accuracy while keeping on par complexity. Furthermore, due to its self-attention mechanism, CSWin can also function as an encoder for MOT. In our implementation, we modified CSWin-tiny model so that it can be used as a backbone for any arbitrary input sizes and produce multi-level feature pyramid based on configuration.
For the decoder, we leverage the decoder of deformable DETR[4], which is a Transformer-based object detector. Based on its deformable attention mechanism (we refer to the original paper for details), computation complexity is mitigated the most in decoder layers.
Fig. 2: Architecture of proposed MOTT model, where
is the input image, is the time step, and denote the detection queries and tracking queries, respectively.
At every time step, an image
We evaluate our proposed MOTT model on MOT17[5] dataset in comparison with the other two SOTA MOT systems, TransTrack and TrackFormer. MOT17, provided by MOTChallenge Benchmark, is a widely used dataset for MOT task. The dataset contains 14 videos in total, where seven are for training and the remaining seven are for testing. However, the ground truth of the testing set is not publicly available. In order to keep the same training procedure and evaluate locally only, we split the training set into training and validation set. In addition to MOT17, CrowdHuman[6] dataset is also used for pre-training the model as suggested in other SOTA methods. We firstly pre-train the model on CrowdHuman dataset for 80 epochs and then fine-tune on the first half of MOT17 training set for 40 epochs. Also, we keep the recommended training settings of the original papers for fair comparison.
All tracking performances are measured by MOT metrics[7] widely acknowledged among other works. Specifically, there are Multi-Object Tracking Accuracy (MOTA), ID F1 score (IDF1), False Positives (FP), False Negatives (FN), Mostly Tracked targets (MT), Mostly Lost targets (ML), and the number of Identity Switches (IDs).
Among them, MOTA (shown in Eq. 1) measures the overall accuracy of the tracking, taking into account FP, FN, and IDs. IDF1 ranks all methods on the same scale showing the balance of identification precision and recall.
Eq. 1: MOTA metric, where
is the number of ground truth objects at time step .
As shown in Table 1, MOTT outperforms the other two methods by a noticeable margin. Due to the local and global strip-shaped attention mechanism (in CSWin) of the new Transformer encoder, the model can track more objects precisely than other attention structures. The deformable decoder is also efficient and great at decoding the queries based on selected key embedding. As a result, two distinct Transformer modules are combined to form a more compact model with improved effectiveness.
Method | MOTA↑ | MOTP↑ | IDF1↑ | MT↑ | ML↓ |
---|---|---|---|---|---|
TransTrack | 66.5% | 83.4% | 66.8% | 134 | 61 |
TrackFormer | 67.0% | 84.1% | 69.5% | 152 | 57 |
MOTT | 71.6% | 84.5% | 71.7% | 166 | 41 |
Table 1: Comparison among Transformer-based methods. All models are trained using the same dataset and procedure.
Compared with ResNet50 which requires 3.8 GFLOPs for a single forward pass, CSWin-tiny requires 4.3 GFLOPs with a dominating object detection accuracy surpassing ResNet50. Moreover, CSWin works as a backbone and an encoder in our system, while additional encoder cost should be added on top of ResNet50 in other SOTA methods.
Compared with DETR’s encoder, which has the complexity of
In addition, we measured the FLOPS of different models on sequence 2 of MOT17 using PyTorch’s profiler[8] in testing mode. All frames are reshaped into 800 pixels in width. As shown in Table 2, MOTT has the least trainable parameters, which is the most lightweight among the Transformer-based models. Besides, it requires around 60% of the FLOPS compared with TransTrack and even less than 38% of FLOPS than TrackFormer. A better efficiency contributes to a faster inference speed and less energy consumption for green learning. (TrackFormer-CSWin is a customized model by substituting the ResNet50 backbone with CSWin-tiny.)
Model | #Params (M)↓ | CUDA time total (s)↓ | Avg. FLOPs (G)↓ |
---|---|---|---|
TransTrack | 46.9 | 8.17 | 428.69 |
TrackFormer | 44.0 | 13.67 | 674.92 |
TrackFormer-CSWin | 38.3 | 16.26 | 714.83 |
MOTT | 32.5 | 6.76 | 255.74 |
Table 2: Comparison of computing efficiency among Transformer-based methods.
From our customized TrackFormer, we can see that the performance of the model is improved by 5.9% in MOTA and 2.2% in IDF1. It is just because of choosing a better backbone for the model. However, the speed of the model is hindered due to increased complexity of CSWin than ResNet.
MOTT removes the ResNet backbone and only keeps the CSWin encoder and deformable decoder. As a result, the model achieves 71.9% MOTA and 72.6% IDF1, which is 5.1% and 1.9% higher than TrackFormer, respectively. The speed of the model is also improved drastically reaching 9.09 Hz.
Modules | MOTA↑ | IDF1↑ | Hz↑ |
---|---|---|---|
Res+DE+DD (TrackFormer) | 66.8% | 70.7% | 5.39 |
CSWin+DE+DD (TrackFormer-CSWin) | 72.7% | 72.9% | 4.73 |
CSWin+DD (MOTT) | 71.9% | 72.6% | 9.09 |
Table 3: Ablation study of different modules. Res=ResNet50, DE=Deformable DETR encoder, DD=Deformable DETR decoder.
We also evaluate our model on other datasets incorporating MOT20 and DanceTrack. As shown in Table 4, MOTT has slightly performance decrease on MOT20 due to its massive volume of objects. The model manages to track most people but loses some of the pedestrians in the crowd because of (Non-maximum suppression) NMS and high occlusion. Nevertheless, the model scores the highest in DanceTrack without training on the dataset at all. MOTT detects people in various poses, demonstrating invariance to translation, rotation, and scale.
Dataset | MOTA↑ | MOTP↑ | IDF1↑ | MT↑ | ML↓ |
---|---|---|---|---|---|
DanceTrack | 85.4% | 81.9% | 33.7% | 81.5% | 0.3% |
MOT20 | 66.5% | 81.1% | 57.9% | 52.1% | 13.8% |
MOT17 | 71.6% | 84.5% | 71.7% | 49.0% | 12.1% |
Table 4: MOTT performance on MOT17, MOT20, and DanceTrack dataset.
Finally, We visualize the tracking results of MOTT on MOT17, MOT20, and DanceTrack in Fig. 3, corroborating the statistics in Table 4.
Fig. 3: Qualitative results of MOTT on MOT17, MOT20, and DanceTrack dataset.
In this work, we proposed a new Transformer-based MOT architecture, namely MOTT, which could save much on the hardware cost and energy while retaining the state-of-the-art MOT performance. By only leveraging the effective modules based on its specification, the new model only contains an encoder and a decoder for the challenging MOT task. Our model achieves a competitive score in MOTA at 73.2% with up to 62% fewer FLOPS than a typical Transformer-based MOT model. The model shows the potential of moving towards a green learning paradigm in CV tasks.
Sun, Peize, et al. “Transtrack: Multiple object tracking with transformer.” arXiv preprint arXiv:2012.15460 (2020). ↩︎
Meinhardt, Tim, et al. “Trackformer: Multi-object tracking with transformers.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. ↩︎
Dong, Xiaoyi, et al. “Cswin transformer: A general vision transformer backbone with cross-shaped windows.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. ↩︎
Zhu, Xizhou, et al. “Deformable detr: Deformable transformers for end-to-end object detection.” arXiv preprint arXiv:2010.04159 (2020). ↩︎
Milan, A., Leal-Taixé, L., Reid, I., Roth, S. & Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv:1603.00831 [cs], 2016., (arXiv: 1603.00831). ↩︎
Shao, Shuai, et al. “Crowdhuman: A benchmark for detecting human in a crowd.” arXiv preprint arXiv:1805.00123 (2018). ↩︎
Bernardin, Keni, and Rainer Stiefelhagen. “Evaluating multiple object tracking performance: the clear mot metrics.” EURASIP Journal on Image and Video Processing 2008 (2008): 1-10. ↩︎
Paszke, Adam, et al. “Automatic differentiation in pytorch.” (2017). ↩︎