MOTT: multi-object tracking based on green learning

Publication authored by Shan Wu, Amnir Hadachi, Chaoru Lu, and Damien Vivet. You can find our code on GitHub link.

Highlights

A new Transformer-based MOT architecture has been proposed with improved efficiency.
We embrace green learning by leveraging only effective Transformer modules for MOT.
We modified a Transformer backbone for arbitrary downstream computer vision tasks.

1. Problem Statement

Multi-object tracking (MOT) is one of the most essential and challenging tasks in computer vision (CV). It is a challenging task due to the large number of objects, occlusions, and the lack of prior knowledge about the number of objects. The Transformer, on the other hand, has been shown to be effective in CV tasks. Some works have leveraged the novel Transformer-based detector for MOT, such as TransTrack^[1] and TrackFormer^[2]. However, current solutions boost the overall model sizes by additional modules for tracking, as explained in Fig. 1.

In our recently published paper, we propose a new model for multi-object tracking based on purely Transformer modules, namely MOTT. We thoroughly evaluated the efficiency of model components of MOT systems and remove the redundant modules to improve the efficiency of the model. As a result, a unified MOT model consisting of only encoder and decoder is proposed while achieving performance gain compared with the state-of-the-art (SOTA) methods.

fig1

Fig. 1: A visualization of the architectures of the state-of-the-art Transformer-based MOT systems comprising a convolutional backbone, a Transformer encoder, and up to two Transformer decoders.

2. Methodology

As shown in Fig. 2, we create a customized encoder–decoder architecture for MOT (as sequence prediction problems) by the interaction between object queries and memory embeddings. CSWin Transformer^[3] is selected as the encoder module due to its efficiency and effectiveness in object detection. Compared with the conventional ResNet-based backbone, CSWin overtakes them by a large margin in terms of object detection accuracy while keeping on par complexity. Furthermore, due to its self-attention mechanism, CSWin can also function as an encoder for MOT. In our implementation, we modified CSWin-tiny model so that it can be used as a backbone for any arbitrary input sizes and produce multi-level feature pyramid based on configuration.

For the decoder, we leverage the decoder of deformable DETR^[4], which is a Transformer-based object detector. Based on its deformable attention mechanism (we refer to the original paper for details), computation complexity is mitigated the most in decoder layers.

fig1

Fig. 2: Architecture of proposed MOTT model, where $I$ is the input image, $t$ is the time step, $q_d$ and $q_{tr}$ denote the detection queries and tracking queries, respectively.

At every time step, an image $I$ is fed into the model for feature extraction and encoding. The output of the encoder will be queried by learnable object queries comprised of fix-sized detection queries $q_d$ and tracking queries $q_{tr}$ in the decoder. In the post-process, Multi-Layer Perceptron (MLP) modules will process the decoded queries, which hold the target objects’ appearance information to predict the locations and classes of objects.

3. Experiments

We evaluate our proposed MOTT model on MOT17^[5] dataset in comparison with the other two SOTA MOT systems, TransTrack and TrackFormer. MOT17, provided by MOTChallenge Benchmark, is a widely used dataset for MOT task. The dataset contains 14 videos in total, where seven are for training and the remaining seven are for testing. However, the ground truth of the testing set is not publicly available. In order to keep the same training procedure and evaluate locally only, we split the training set into training and validation set. In addition to MOT17, CrowdHuman^[6] dataset is also used for pre-training the model as suggested in other SOTA methods. We firstly pre-train the model on CrowdHuman dataset for 80 epochs and then fine-tune on the first half of MOT17 training set for 40 epochs. Also, we keep the recommended training settings of the original papers for fair comparison.

3.1 Metrics

All tracking performances are measured by MOT metrics^[7] widely acknowledged among other works. Specifically, there are Multi-Object Tracking Accuracy (MOTA), ID F1 score (IDF1), False Positives (FP), False Negatives (FN), Mostly Tracked targets (MT), Mostly Lost targets (ML), and the number of Identity Switches (IDs).

Among them, MOTA (shown in Eq. 1) measures the overall accuracy of the tracking, taking into account FP, FN, and IDs. IDF1 ranks all methods on the same scale showing the balance of identification precision and recall.

MOTA = 1 - \frac{\sum_t (FP_t + FN_t + IDs{t})}{\sum_t y_t}

Eq. 1: MOTA metric, where $y_t$ is the number of ground truth objects at time step $t$ .

3.2 Evaluation of SOTA MOT Systems

As shown in Table 1, MOTT outperforms the other two methods by a noticeable margin. Due to the local and global strip-shaped attention mechanism (in CSWin) of the new Transformer encoder, the model can track more objects precisely than other attention structures. The deformable decoder is also efficient and great at decoding the queries based on selected key embedding. As a result, two distinct Transformer modules are combined to form a more compact model with improved effectiveness.

Method	MOTA↑	MOTP↑	IDF1↑	MT↑	ML↓
TransTrack	66.5%	83.4%	66.8%	134	61
TrackFormer	67.0%	84.1%	69.5%	152	57
MOTT	71.6%	84.5%	71.7%	166	41

Table 1: Comparison among Transformer-based methods. All models are trained using the same dataset and procedure.

3.3 Computing Efficiency

Compared with ResNet50 which requires 3.8 GFLOPs for a single forward pass, CSWin-tiny requires 4.3 GFLOPs with a dominating object detection accuracy surpassing ResNet50. Moreover, CSWin works as a backbone and an encoder in our system, while additional encoder cost should be added on top of ResNet50 in other SOTA methods.

Compared with DETR’s encoder, which has the complexity of $O(HWC^2+H^2W^2C)$ , CSWin only requires $O(HWC^2+swH^2WC+swHW^2C)$ , where $H$ , $W$ , and $C$ are the height, width, and channel of the input feature map, respectively, and $sw$ is the width of paralleled cross-shaped attention window in the CSWin encoder. The complexity of CSWin could be reduced by using smaller $sw$ when the input size is large.

In addition, we measured the FLOPS of different models on sequence 2 of MOT17 using PyTorch’s profiler^[8] in testing mode. All frames are reshaped into 800 pixels in width. As shown in Table 2, MOTT has the least trainable parameters, which is the most lightweight among the Transformer-based models. Besides, it requires around 60% of the FLOPS compared with TransTrack and even less than 38% of FLOPS than TrackFormer. A better efficiency contributes to a faster inference speed and less energy consumption for green learning. (TrackFormer-CSWin is a customized model by substituting the ResNet50 backbone with CSWin-tiny.)

Model	#Params (M)↓	CUDA time total (s)↓	Avg. FLOPs (G)↓
TransTrack	46.9	8.17	428.69
TrackFormer	44.0	13.67	674.92
TrackFormer-CSWin	38.3	16.26	714.83
MOTT	32.5	6.76	255.74

Table 2: Comparison of computing efficiency among Transformer-based methods.

3.4 Ablation Study

From our customized TrackFormer, we can see that the performance of the model is improved by 5.9% in MOTA and 2.2% in IDF1. It is just because of choosing a better backbone for the model. However, the speed of the model is hindered due to increased complexity of CSWin than ResNet.

MOTT removes the ResNet backbone and only keeps the CSWin encoder and deformable decoder. As a result, the model achieves 71.9% MOTA and 72.6% IDF1, which is 5.1% and 1.9% higher than TrackFormer, respectively. The speed of the model is also improved drastically reaching 9.09 Hz.

Modules	MOTA↑	IDF1↑	Hz↑
Res+DE+DD (TrackFormer)	66.8%	70.7%	5.39
CSWin+DE+DD (TrackFormer-CSWin)	72.7%	72.9%	4.73
CSWin+DD (MOTT)	71.9%	72.6%	9.09

Table 3: Ablation study of different modules. Res=ResNet50, DE=Deformable DETR encoder, DD=Deformable DETR decoder.

3.5 Performance on Other Datasets

We also evaluate our model on other datasets incorporating MOT20 and DanceTrack. As shown in Table 4, MOTT has slightly performance decrease on MOT20 due to its massive volume of objects. The model manages to track most people but loses some of the pedestrians in the crowd because of (Non-maximum suppression) NMS and high occlusion. Nevertheless, the model scores the highest in DanceTrack without training on the dataset at all. MOTT detects people in various poses, demonstrating invariance to translation, rotation, and scale.

Dataset	MOTA↑	MOTP↑	IDF1↑	MT↑	ML↓
DanceTrack	85.4%	81.9%	33.7%	81.5%	0.3%
MOT20	66.5%	81.1%	57.9%	52.1%	13.8%
MOT17	71.6%	84.5%	71.7%	49.0%	12.1%

Table 4: MOTT performance on MOT17, MOT20, and DanceTrack dataset.

3.6 Qualitative Results

Finally, We visualize the tracking results of MOTT on MOT17, MOT20, and DanceTrack in Fig. 3, corroborating the statistics in Table 4.

fig3

Fig. 3: Qualitative results of MOTT on MOT17, MOT20, and DanceTrack dataset.

4. Conclusion

In this work, we proposed a new Transformer-based MOT architecture, namely MOTT, which could save much on the hardware cost and energy while retaining the state-of-the-art MOT performance. By only leveraging the effective modules based on its specification, the new model only contains an encoder and a decoder for the challenging MOT task. Our model achieves a competitive score in MOTA at 73.2% with up to 62% fewer FLOPS than a typical Transformer-based MOT model. The model shows the potential of moving towards a green learning paradigm in CV tasks.

Sun, Peize, et al. “Transtrack: Multiple object tracking with transformer.” arXiv preprint arXiv:2012.15460 (2020). ↩︎
Meinhardt, Tim, et al. “Trackformer: Multi-object tracking with transformers.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. ↩︎
Dong, Xiaoyi, et al. “Cswin transformer: A general vision transformer backbone with cross-shaped windows.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. ↩︎
Zhu, Xizhou, et al. “Deformable detr: Deformable transformers for end-to-end object detection.” arXiv preprint arXiv:2010.04159 (2020). ↩︎
Milan, A., Leal-Taixé, L., Reid, I., Roth, S. & Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv:1603.00831 [cs], 2016., (arXiv: 1603.00831). ↩︎
Shao, Shuai, et al. “Crowdhuman: A benchmark for detecting human in a crowd.” arXiv preprint arXiv:1805.00123 (2018). ↩︎
Bernardin, Keni, and Rainer Stiefelhagen. “Evaluating multiple object tracking performance: the clear mot metrics.” EURASIP Journal on Image and Video Processing 2008 (2008): 1-10. ↩︎
Paszke, Adam, et al. “Automatic differentiation in pytorch.” (2017). ↩︎