Global Tracking Transformers, CVPR 2022

Overview

Global Tracking Transformers

Global Tracking Transformers,
Xingyi Zhou, Tianwei Yin, Vladlen Koltun, Philipp Krähenbühl,
CVPR 2022 (arXiv 2203.13250)

Features

  • Object association within a long temporal window (32 frames).

  • Classification after tracking for long-tail recognition.

  • "Detector" of global trajectories.

Installation

See installation instructions.

Demo

Run our demo using Colab (no GPU needed): Open In Colab

We use the default detectron2 demo interface. For example, to run TAO model on an example video (video source: TAO/YFCC100M dataset), download the model and run

python demo.py --config-file configs/GTR_TAO_DR2101.yaml --video-input docs/yfcc_v_acef1cb6d38c2beab6e69e266e234f.mp4 --output output/demo_yfcc.mp4 --opts MODEL.WEIGHTS models/GTR_TAO_DR2101.pth

If setup correctly, the output on output/demo_yfcc.mp4 should look like:

Benchmark evaluation and training

Please first prepare datasets, then check our MODEL ZOO to reproduce results in our paper. We highlight key results below:

  • MOT17 test set
MOTA IDF1 HOTA DetA AssA FPS
75.3 71.5 59.1 61.6 57.0 19.6
  • TAO test set
Track mAP FPS
20.1 11.2

License

The majority of GTR is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: trackeval in gtr/tracking/trackeval/, is licensed under the MIT license. FairMOT in gtr/tracking/local_tracker is under MIT license. Please see NOTICE for license details. The demo video is from TAO dataset, which is originally from YFCC100M dataset. Please be aware of the original dataset license.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{zhou2022global,
  title={Global Tracking Transformers},
  author={Zhou, Xingyi and Yin, Tianwei and Koltun, Vladlen and Kr{\"a}henb{\"u}hl, Philipp},
  booktitle={CVPR},
  year={2022}
}
Comments
  • Training memory issue & missing file

    Training memory issue & missing file

    Hello, Thanks for sharing the source code of nice work!

    I have tried the TAO training code (GTR_TAO_DR2101.yaml) but failed full training due to the memory overhead error. It seems the memory usage increases gradually during training, and reaches the max memory limit. As I am currently using A6000 with 48G gpu, it should be enough based on your training spec (4x 32G V100 gpu). Could you give any ideas? My initial solution is to reduce the video length 8 to 2.

    Moreover, I cannot find the move_tao_keyframes.py file. Could you please provide this file?

    Thanks,

    opened by tkdtks123 4
  • Error Running Demo

    Error Running Demo

    Hello, I'm having trouble running the inference (the "Demo" section in the README). Below is a notebook link showing the setup and error.

    Here is the link to the notebook.

    Let me know if anything else needs to be provided.

    Much appreciated!

    opened by alckasoc 3
  • Not able to run in x86 in CPU

    Not able to run in x86 in CPU

    Hi @xingyizhou @noahcao Thank you for sharing this work When I'm trying to run the script in my x86 machine in cpu $python demo.py --config-file configs/GTR_TAO_DR2101.yaml --video-input docs/yfcc_v_acef1cb6d38c2beab6e69e266e234f.mp4 --output output/demo_yfcc.mp4 --opts MODEL.WEIGHTS GTR_TAO_DR2101.pth, I'm getting the following error:

    Traceback (most recent call last): File "/home/sravan/SAT/Tracker/GTR/demo.py", line 161, in for vis_frame in demo.run_on_video(video): File "/home/sravan/SAT/Tracker/GTR/gtr/predictor.py", line 147, in run_on_video outputs = self.video_predictor(frames) File "/home/sravan/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/sravan/SAT/Tracker/GTR/gtr/predictor.py", line 103, in call predictions = self.model(inputs) File "/home/sravan/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/sravan/SAT/Tracker/GTR/gtr/modeling/meta_arch/gtr_rcnn.py", line 61, in forward return self.sliding_inference(batched_inputs) File "/home/sravan/SAT/Tracker/GTR/gtr/modeling/meta_arch/gtr_rcnn.py", line 81, in sliding_inference instances_wo_id = self.inference( File "/home/sravan/SAT/Tracker/GTR/gtr/modeling/meta_arch/custom_rcnn.py", line 107, in inference features = self.backbone(images.tensor) File "/home/sravan/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/sravan/anaconda3/lib/python3.9/site-packages/detectron2/modeling/backbone/fpn.py", line 126, in forward bottom_up_features = self.bottom_up(x) File "/home/sravan/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/sravan/SAT/Tracker/GTR/third_party/CenterNet2/centernet/modeling/backbone/res2net.py", line 630, in forward x = stage(x) File "/home/sravan/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/sravan/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home/sravan/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/sravan/SAT/Tracker/GTR/third_party/CenterNet2/centernet/modeling/backbone/res2net.py", line 457, in forward sp = self.convs[i](sp, offset, mask) File "/home/sravan/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/sravan/anaconda3/lib/python3.9/site-packages/detectron2/layers/deform_conv.py", line 474, in forward x = modulated_deform_conv( File "/home/sravan/anaconda3/lib/python3.9/site-packages/detectron2/layers/deform_conv.py", line 211, in forward raise NotImplementedError("Deformable Conv is not supported on CPUs!") NotImplementedError: Deformable Conv is not supported on CPUs!

    How can I solve this?

    opened by navaravan 2
  • A question about the speed

    A question about the speed

    Thanks for releasing this great work. May I ask for more details about the speed evaluation?

    For TAO data, as you used the default detectron2, may I know if you count the inference time of detectron2 for 11.2 FPS, or only the GTR inference time? Since the TAO video sampling rate may not be 30 FPS, does it need to consider this factor and transfer the inference speed?

    Thanks.

    opened by fandulu 2
  • can't evaluate on MOT17

    can't evaluate on MOT17

    Hi Xingyi,

    I believe the guidelines you write at the doc has some issue. To be precise, to directly evaluate on MOT17 by:

    python train_net.py --config-file configs/GTR_MOT_FPN.yaml --eval-only MODEL.WEIGHTS  output/GTR_MOT/GTR_MOT_FPN/model_0004999.pth
    

    we will get the error as: gtr.tracking.trackeval.utils.TrackEvalException: GT file not found for sequence: MOT17-02-FRCNN

    Besides, to evaluate on the self-splitted half-val, I assumed we need the files "gt_val_half.txt" under the directory of each sequence?

    Could you help to double check if your guideline can work fine with the current version and reach the requirement of the TrackEvallib you adopted? I thought you may miss some guidelines about data splitting and preparation?

    opened by noahcao 2
  • Typos in training guidelines?

    Typos in training guidelines?

    Hi Xingyi,

    Thanks for the wonderful job. I tried to run the training on MOT17 following the guidelines. But I found some potential typos making that doable.

    1. Should we rename the MOT17 train to trainval, which is not explained in the prepare datasets doc?
    2. Should the datasets for training be ("mot17_halftrain","crowdhuman_train") instead of ("mot17_halftrain","crowdhuman_amodal_train") in the config file? the later one would raises an error of unregistered dataset: image
    opened by noahcao 2
  • Joint or separate training

    Joint or separate training

    Nice work! Thank you for sharing the code.

    Is training of detector and tracker is joint or separate? It seems from the paper (Section 5.2) that the first detector needs to be trained then the detector is frozen and the tracker is finetuned after that? Is that right inference?

    Thanks Gurkirt

    opened by gurkirt 2
  • about lvis version

    about lvis version

    Hi there! Thanks for your work.

    Here I have 2 questions about the version of lvis dataset:

    1. Why did you use v1.0 instead of v10.5?
    2. Could you please show me the code which re-map the labels of v1 back to v0.5?

    Looking forward to your reply!

    opened by HanGuangXin 1
  • Add Web Demo & Docker environment

    Add Web Demo & Docker environment

    Hey @xingyizhou ! 👋

    Nice work on the global tracking transformer!

    I'm from Replicate, where we're trying to make machine learning reproducible. We noticed you have registered an account with us, and this pull request makes it possible to run your model inside a Docker environment, which makes it easier for other people to run it. We're using an open source tool called Cog to make this process easier.

    This also means we can make a web page where other people can try out your model! View it here: https://replicate.com/xingyizhou/gtr. The docker file can be found under the tab ‘run model with docker’. The demo makes it easy for anyone to upload a customised video and see the result effortless.

    We usually add some examples to the for un-registered users (it looks like the screenshot below), but we'd like to invite you to claim the page so you can own the page, customise the Example gallery as you like, push any future update to the web demo, and we'll feature it on our website and tweet about it too. You can find the 'Claim this model' button on the top of the page.

    Thank you!

    Screenshot 2022-06-01 at 10 29 14

    opened by chenxwh 1
  • OSError: [Errno 113] No route to host

    OSError: [Errno 113] No route to host

    Thank you for sharing your excellent work. But when I trained a model in a machine with 8 GPUs, I met OSError: [Errno 113] No route to host as follow: image image I did not enable firewall. image Now, I don't know how to solve it.

    opened by dongfengxijian 0
  • Failed on the long videos.

    Failed on the long videos.

    I tried to run the predictor on the long video (about 100k frames) using:

    python demo.py --config-file configs/GTR_TAO_DR2101.yaml --video-input docs/Long video.mp4 --output output/Long video.mp4 --opts MODEL.WEIGHTS models/GTR_TAO_DR2101.pth

    But the process always get "Killed". Are there any suggestions on this?

    opened by HanXuMartin 0
  • Can this project train with GTX3090 (24G)? I met out of memory

    Can this project train with GTX3090 (24G)? I met out of memory

    Hello, Thank you for sharing the excellent project!

    My lab don't have RTX 6000GPUs (24G, referenced in paper ) but some GTX 3090 GPUs (24G). So I want to train GTR with GTX3090 on the MOT dataset. Howevr, I met out of memory error even with batchsize = 1. Could you give me any idea?

    image

    opened by dongfengxijian 2
  • The choice of backbone on TAO

    The choice of backbone on TAO

    Hi, it seems that you are using Res2Net101 on TAO. I'm wondering that is it necessary to use such a heavy backbone instead of ResNet50? Will the performance decrease rapidly when using a smaller backbone like ResNet50?

    opened by HanGuangXin 0
  • results_per_category only contains 296 classes during evaluation on TAO val set

    results_per_category only contains 296 classes during evaluation on TAO val set

    Hi, sorry to bother you. But I have a small question about the evaluation on TAO val set.

    As we all know, there are 482 categories in LVISv0.5 which are included in TAO. So when using LVISv1.0 as in GTR, is it correct that there are only 296 classes in LVISv1.0 which are included in TAO? I'm not sure whether there are some categories are missed.

    The related code is here:

        precisions = lvis_eval.eval['precision']
        assert len(class_names) == precisions.shape[2]
        results_per_category = []
        id2apiid = sorted(lvis_gt.get_cat_ids())
        inst_aware_ap, inst_count = 0, 0
        for idx, name in enumerate(class_names):
            precision = precisions[:, :, idx, 0]
            precision = precision[precision > -1]
            ap = np.mean(precision) if precision.size else float("nan")
            inst_num = len(lvis_gt.get_ann_ids(cat_ids=[id2apiid[idx]]))
            if inst_num > 0:
                results_per_category.append(("{} {}".format(
                    name, 
                    inst_num if inst_num < 1000 else '{:.1f}k'.format(inst_num / 1000)), 
                    float(ap * 100)))
                inst_aware_ap += inst_num * ap
                inst_count += inst_num
    
    opened by HanGuangXin 0
  • Question about inference resolution

    Question about inference resolution

    During MOT training the input resolution is set to 1280x1280 while the test size is 1560 (longer edge). This mean that the input frames have an aspect-ratio (square) and a resolution (lower) compared to the test ones (rectangular aspect-ratio and bigger resolution). I have tried to test with videos of the same resolution and aspect-ratio of training (1280x1280) but the performances were the worst.

    My question is, how is it possible to obtain bad performances while maintaining the aspect-ratio and the same resolution of the training? Shouldn't the network perform better in that situation? If not, what is the reason (maybe I am missing some properties of the detector/transformer module)?

    opened by pietro-nardelli 0
  • Difference between GTR_MOT_FPN and GTR_MOTFull_FPN

    Difference between GTR_MOT_FPN and GTR_MOTFull_FPN

    Hi, I cannot find in the detail neither here nor in the paper what are the differences between these two models. The configurations are identical but they differ in the training dataset (half vs full).

    In the paper you said: "We follow CenterTrack [68] and split each training sequence in half. We use the first half for training and the second half for validation". But the results in table 3 seems obtained by GTR_MOTFull_FPN.

    Which of the two models should be considered the "best" one? May I have more information about this?

    Thank you so much in advance.

    opened by pietro-nardelli 1
Owner
Xingyi Zhou
CS Ph.D. student at UT Austin.
Xingyi Zhou
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Phil Wang 364 Jan 6, 2023
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 3, 2023
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Jianjie(JJ) Luo 13 Jan 6, 2023
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 21.2k Dec 30, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Highlights The strongest performances Tracker

Multimedia Research 485 Jan 4, 2023
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 7, 2022
Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle.

Google Research 457 Dec 23, 2022
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

deepset 6.4k Jan 9, 2023
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 8, 2023
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

null 342 Nov 21, 2022
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 40.9k Feb 18, 2021
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 1.4k Feb 18, 2021
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 903 Feb 17, 2021
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

null 327 Feb 18, 2021
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large ?? GitHub Repository ?? Documentat

Xing Han Lu 244 Dec 30, 2022