Back to Blogs
MOT
Transformer
Locality
Machine Learning
Computer Vision
Read on Publication
Enhancing Multi-Object Tracking with Locality in Transformers
Photo of Shan WuShan Wu
Published on: 26.4.2023
Cover PhotoAbstract: This paper explores the possibilities of enhancing the MOT system by leveraging the prevailing convolutional neural network (CNN) and a novel vision transformer technique Locality. There are several deficiencies in the transformer adopted for computer vision tasks. While the transformers are good at modeling global information for a long embedding, the locality mechanism, which learns the local features, is missing. This could lead to negligence of small objects, which may cause security issues. We combine the TransTrack MOT system with the locality mechanism inspired by LocalViT and find that the locality-enhanced system outperforms the baseline TransTrack by 5.3% MOTA on the MOT17 dataset.

Publication authored by Shan Wu, Amnir Hadachi, Chaoru Lu, and Damien Vivet.

Highlights

  • Enhancement of Multi-Object Tracking (MOT) using a novel vision transformer technique named ‘Locality.’
  • The quantitative and qualitative research shows the strength of locality in transformer-based MOT systems.
  • The performance of pedestrian tracking is boosted when training the model with a mixture of detection and tracking datasets.

1. Introduction

The realm of multi-object tracking (MOT) is essential in fields like traffic analysis, surveillance, and autonomous vehicles. This paper introduces an innovative approach to enhance MOT systems by integrating the transformer model with a novel ‘Locality’ technique, addressing the gap by enhancing the MOT system with a blend of convolutional neural network (CNN) techniques and advanced vision transformers. This integration aims to overcome the limitations of transformers in computer vision tasks, particularly their tendency to neglect small objects due to a lack of local feature learning.

2. Methodology

The proposed model builds upon the TransTrack[1] architecture, a transformer-based MOT system. The key innovation lies in the integration of a locality mechanism within the transformer encoder. This mechanism (shown in Fig. 1), inspired by LocalViT[2] and MobileNet[3], leverages a locality module (Inverted residual block) instead of a classic feed-forward network. A depth-wise convolutional layer is included within the locality module inside the encoder, allowing the model to capture local features effectively. The model processes multi-scale feature maps, ensuring a comprehensive understanding of the scene.

fig1

Fig. 1: The figure shows the tweaked encoder of our system.

As shown in Fig. 2, the MOT pipeline consists of a CNN backbone for feature extraction, a transformer encoder with the locality mechanism, and parallel decoders for detection and tracking, respectively. Two consecutive video frames are used as input. Then, the encoder enriches global feature embeddings with local details, enhancing the model’s sensitivity to smaller or partially occluded objects. Both decoders will generate bounding boxes based on a trainable query and the features from the encoder. All bounding boxes are merged by an IoU matching technique to form the final object bounding boxes.

fig2

Fig. 2: The pipeline of our locality-enhanced MOT system.

3. Experiments

Experiments were conducted using the MOT17 dataset[4], focusing on pedestrian tracking. Due to that there is no official train-validation split, and it is suggested to fine-tune the model using the training set. Thus, we divide it into two halves for training and validation similar to TransTrack. The methodology was evaluated against various MOT metrics[5], including Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP), as shown in Equation (1) and (2) below.

MOTA=1t(FPt+FNt+IDSWt)tGTt\begin{equation} MOTA = 1 - \frac{\sum_t(FP_t + FN_t + IDSW_t)}{\sum_t GT_t} \end{equation}

where TPs are the detected instances which are also ground truths, FPs are the invalid instances detected, FNs are the missed instances that should be detected, IDSWs record the tracking ID switches, and GTs are the ground truths.

MOTP=t,idt,itct\begin{equation} MOTP = \frac{\sum_{t, i} d_{t, i}}{\sum_t c_t} \end{equation}

where dt,id_{t, i} are the overlapped bounding boxes between detected ones and the ground truths in time frame tt for all instances, and ctc_t is the number of those matched bounding boxes in time frame tt.

3.1. Overall Performance

The overall performance of our model is shown in Table 1, where TransTrack is trained by us with the official configuration, TransTrack* is the finetuned model provided by the authors as our baseline, TransTrack-mix is the model trained with a mixture of MOT17 and CrowdHuman datasets[6], Locality is the model using shared weights for the locality module, Locality+ uses dedicated locality modules for multi-scale feature maps, and Locality++ adds more layers in locality module for ablation study.

Model MOTA↑ MOTP↑ MT↑ PT- ML↓ FP↓ FN↓ IDs↓
TransTrack 66.5% 83.4% 39.5% 42.5% 18.0% 2.9% 30.1% 0.6%
TransTrack* 67.1% 83.5% 41.9% 39.8% 18.3% 3.1% 29.4% 0.5%
TransTrack-mix 72.0% 85.2% 49.3% 37.8% 13.0% 2.0% 25.5% 0.4%
Locality 68.5% 85.2% 45.1% 37.8% 17.1% 1.5% 29.6% 0.5%
Locality+ 72.4% 85.4% 54.0% 36.6% 9.4% 3.8% 23.1% 0.7%
Locality++ 72.1% 85.5% 54.3% 36.6% 9.1% 4.4% 23.0% 0.6%

Table 1: The performance of our locality-enabled MOT models compared with TransTrack variants.

The overall best performance is achieved by Locality+ model, which bolsters the effect of the locality mechanism, outperforming the baseline model by 5.3% in MOTA. Locality model, on the other hand, only achieves a 1.4% and 1.7% improvement in MOTA and MOTP, respectively, showing ineffectiveness of the locality mechanism without dedicated weights for multi-scale feature maps. The results (combined with Table 2) also show that the model trained with a mixture of datasets can achieve better performance, which indicates the variety of training data is beneficial for the model to generalize better, especially when MOT17 dataset is relatively small. Furthermore, increasing the number of layers in the locality module does not improve the performance, which is likely due to the overfitting of the model.

Dataset MOTA↑ MOTP↑ MT↑ PT- ML↓
Mixed data 68.5% 85.2% 45.1% 37.8% 17.1%
MOT17 half 60.8% 83.5% 33.0% 44.5% 22.4%

Table 2: The performance of the model trained with a mixture of datasets and MOT17 half dataset.

3.2 Computational Cost of Locality

The computational cost of the locality mechanism is shown in Table 3. MultiplyAccumulate operations (MACs) is used to measure the actual computational cost of the model because some hardware accelerators like GPUs and TPUs merge the multiplication and accumulate operations into one. The computational cost of the locality mainly comes from the squeeze-and-excitation (SE) module, the cost could be balanced by modifying the expansion ratio of the SE module. When the ratio is 4, the locality costs almost the same MACs as the conventional FFN module.

Module #param Mac
Locality 1.06M 0.42G
FFN 0.53M 0.41G

Table 3: The number of parameters and MACs of the locality module and the feed-forward network (FFN) in the encoder.

3.3 Qualitative Results

In the end, we visualized the qualitative results from our locality module, shown in Fig. 3. The upper row shows the results of the locality model, and the lower row shows the results of the baseline model.

fig3

Fig. 3: Qualitative results of locality model on MOT17 dataset.

4. Conclusion

The research presents a significant advancement in MOT systems by effectively integrating locality mechanisms within a transformer-based framework. The results on the MOT17 dataset underline the potential of combining global and local feature learning, setting a new possibility for MOT performance and paving the way for future advancements in object tracking techniques.


  1. Sun, Peize, et al. “Transtrack: Multiple object tracking with transformer.” arXiv preprint arXiv:2012.15460 (2020). ↩︎

  2. Li, Yawei, et al. “Localvit: Bringing locality to vision transformers.” arXiv preprint arXiv:2104.05707 (2021). ↩︎

  3. Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861 (2017). ↩︎

  4. Milan, A., Leal-Taixé, L., Reid, I., Roth, S. & Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv:1603.00831 ↩︎

  5. Bernardin, Keni, and Rainer Stiefelhagen. “Evaluating multiple object tracking performance: the clear mot metrics.” EURASIP Journal on Image and Video Processing 2008 (2008): 1-10. ↩︎

  6. Shao, Shuai, et al. “Crowdhuman: A benchmark for detecting human in a crowd.” arXiv preprint arXiv:1805.00123 (2018). ↩︎