Language-Driven Semantic Segmentation

Overview

Language-driven Semantic Segmentation (LSeg)

The repo contains official PyTorch Implementation of paper Language-driven Semantic Segmentation.

Authors:

Overview

We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., ''grass'' or 'building'') together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., ''cat'' and ''furry''). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided.

Please check our Video Demo (4k) to further showcase the capabilities of LSeg.

Usage

Installation

Option 1:

pip install -r requirements.txt

Option 2:

conda install ipython
pip install torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2
pip install git+https://github.com/zhanghang1989/PyTorch-Encoding/
pip install pytorch-lightning==1.3.5
pip install opencv-python
pip install imageio
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
pip install altair
pip install streamlit
pip install --upgrade protobuf
pip install timm
pip install tensorboardX
pip install matplotlib
pip install test-tube
pip install wandb

Running interactive app

Download the model for demo and put it under folder checkpoints as checkpoints/demo_e200.ckpt.

Then streamlit run lseg_app.py

Training

Backbone = ViT-L/16, Text Encoder from CLIP ViT-B/32

bash train.sh

Testing

Backbone = ViT-L/16, Text Encoder from CLIP ViT-B/32

bash test.sh

Try demo model

Download the model for demo and put it under folder checkpoints as checkpoints/demo_e200.ckpt.

Then follow lseg_demo.ipynb to play around with LSeg. Enjoy!

Model Zoo

name backbone text encoder url
0 Model for demo ViT-L/16 CLIP ViT-B/32 download

If you find this repo useful, please cite:

@article{li2022lan,
  title={Language-driven Semantic Segmentation},
  author={Li, Boyi and Weinberger, Kilian Q and Belongie, Serge and Koltun, Vladlen and Ranftl, Rene},
  journal={arXiv preprint},
  year={2022}
}

Acknowledgement

Thanks to the code base from DPT, Pytorch_lightning, CLIP, Pytorch Encoding, Streamlit, Wandb

Comments
  • Huggingface Spaces

    Huggingface Spaces

    Hi, would you be interested in sharing a web demo on Huggingface Spaces for lang-seg?

    It would make this model more accessible as it would allow people to try out the model directly from the browser. Some other recent machine learning model repos have set up Spaces for easy access:

    github: https://github.com/salesforce/BLIP Spaces: https://huggingface.co/spaces/akhaliq/BLIP

    github: https://github.com/facebookresearch/omnivore Spaces: https://huggingface.co/spaces/akhaliq/omnivore

    Spaces is completely free, and I can help setup a Gradio Space. Here are some getting started instructions if you'd prefer to do it yourself: https://huggingface.co/blog/gradio-spaces

    opened by AK391 7
  • Questions about training and inference configuration

    Questions about training and inference configuration

    Hi,

    Thanks for open-sourcing such great work. I have some questions when using this code:

    1. Does the test_lseg.py script support multi-GPU inference? When using a single GPU, it takes about 2~3 hours for inference on ade20k.
    2. I tried to evaluate the provided demo_e200.ckpt on ade20k and got (pixAcc: 0.8078, mIoU: 0.3207), is that correct? It seems lower than the values in the paper.
    3. I trained a model on ade20k (the same config as train.sh, backbone is vit_l16_384) with 8*V100 but found it needs ~90 hours for training 240 epochs. Is it reasonable (it seems much longer than you said in #7)?
    4. When I use this code for other datasets like cityscapes, what changes should I make? The only difference I found is get_labels()in lseg_module.py. Have you evaluated the mIoU on cityscapes?

    Thanks in advance.

    opened by chufengt 6
  • Difference between the settings for demo and those in your paper

    Difference between the settings for demo and those in your paper

    Hi, I have a question on the difference of settings between the demo in your README and the experiment in your paper.

    In the README, you published the pre-trained weight for demo. It says while training the backbones for both image and text are ViT-L/16. The section 5.1 in your paper says

    We used LSeg with DPT and a smaller ViT-B/32 backbone together with the CLIP ViT-B/32 text encoder ...

    When reproducing your results in 5.1, does that require a full-scratch training with ViT-B/32 backbone for the images? Also, are there any other differences, such as batch size? More specifically, How do I change the arguments in train.sh ?

    Finally, is it possible to share with us (or me) the weight used for your results?

    Thank you in advance.

    opened by whiteking64 6
  • Pretrained LSeg on Pascal-5i, COCO-20i

    Pretrained LSeg on Pascal-5i, COCO-20i

    Congrats on your paper accepted to ICLR 2022!

    Do you have your pretrained models on 4 folds of Pascal-5i and COCO-20i? Can you share them?

    I really appreciate your response.

    opened by ducminhkhoi 5
  • error with torch-encoding

    error with torch-encoding

    Hi! Thanks for the great work

    I can't seem to import encoding after following the installation steps. The error I got is "cannot import name 'gpu' from partially initialized module 'encoding' (most likely due to a circular import)". Can you please let me know whether you know the cause of this issue? Thanks!

    opened by XAVILLA 4
  • Error when building Pytorch-encoding

    Error when building Pytorch-encoding

    Hi, I would like to tried your code, but the error shows up when I tried to install the pytorch-encoding. Could you give us your environment info of CUDA GPU python g++ and so on?Do you have any advice about installing it,which I see a lot of people get the error.

    My env: OS: Ubuntu 18.04 gcc: 7.5.0 GPU:3090 driver:515 CUDA: tried 11.7 and 10.2 pytorch: tried 1.12 and 1.7.1

    opened by StarsTesla 3
  • Reason on bad results of CLIP-based initialization of image encoder

    Reason on bad results of CLIP-based initialization of image encoder

    This is a question on an interesting report in the paper. The paper reported

    We also evaluated on a model initialized with the CLIP image encoder with the same setup and hyperparameters, but observed worse performance than using the ViT initialization.

    It seems surprising that CLIP image encoder, which is already well-aligned to the text encoder, is not helpful for the task. Do authors have any guesses about the reason? And, was the performance much worse or a little worse?

    opened by soskek 3
  • System requirements (GPU?)

    System requirements (GPU?)

    Hello,

    This is great work @Boyiliee ! I'm excited to try this out.

    I have a quick question: what kind of system requirements are necessary to train and run inference on this model? Specifically I am wondering about the type of GPU(s) needed to train LSeg.

    opened by DanielTakeshi 3
  • RuntimeError: Trying to backward through the graph a second time

    RuntimeError: Trying to backward through the graph a second time

    Hi Thanks for your great work! When I tried to add LSegNet into my own framework, there was a RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. My train function is(run on ADE20K):

    def train(self, cur_epoch, optim, train_loader, scheduler=None, print_int=10, logger=None):
            device = self.device
            model = self.model
            criterion = nn.CrossEntropyLoss(ignore_index=-1)
            model.train()
            for cur_step, (images, labels) in enumerate(train_loader):
                images = images.to(device, dtype=torch.float32)
                labels = labels.to(device, dtype=torch.long)
                optim.zero_grad()
                outputs = model(images, labelset='')
                loss = criterion(outputs, labels)
                self.scaler.scale(loss)
                loss.backward()
                optim.step()
                if scheduler is not None:
                    scheduler.step()
    

    The model is LSegNet and I didn't modify lseg_net.py. I think maybe some optimizations have been made by Pytorch-lighting. Could you give me some suggestions? Thank you!

    opened by zhengyuan-xie 2
  • How to train the zero-shot model?

    How to train the zero-shot model?

    Hi! Thanks for your interesting work! I am trying to reproduce the zero-shot experiments in the paper recently, but like https://github.com/isl-org/lang-seg/issues/19#issue-1213501618 , it gets mIoU much lower than yours.

    Here is my scripts:

    train_lseg_zs.py:

    from modules.lseg_module_zs import LSegModuleZS
    from utils import do_training, get_default_argument_parser
    
    if __name__ == "__main__":
        parser = LSegModuleZS.add_model_specific_args(get_default_argument_parser())
        args = parser.parse_args()
        do_training(args, LSegModuleZS)
    

    command:

    python -u train_lseg_zs.py --backbone clip_resnet101 --exp_name lsegzs_pascal_f0 --dataset pascal \
    --widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 --batch_size 8 \
    

    Default aruguments: base_lr=0.004, weight_decay=1e-4, momentum=0.9

    I wonder where the problem is. And could you please share your training scripts for the zero-shot experiment?

    opened by hwanyu112 2
  • Reproduction issue for Table 5

    Reproduction issue for Table 5

    Hi, I have a reproduction issue for Table 5. In the paper, the LSeg with ViT-B/32 backbone achieves 79.7 pixAcc and 37.8 mIoU. However, I only get 78.9 pixAcc and 33.7 mIoU by using the released code. The reproduced pixAcc/mIoU are not as expected.

    Our reproduction command is as follows on 8 GPU cards.

    python -u train_lseg.py --dataset ade20k --data_path datasets --batch_size 4 --exp_name lseg_ade20k_b32_240e --base_lr 0.004 --weight_decay 1e-4 --no-scaleinv --max_epochs 240 --widehead --accumulate_grad_batches 2 --backbone clip_vitb32_384

    So what is the reason of the performance gap? I may miss some detail settings.

    By the way, I encounter a warning when running the released code. [W reducer.cpp:283] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [512, 768, 1, 1], strides() = [768, 1, 768, 768]

    Have ever met this warning? May the performance gap caused by this warning?

    Look forward to your reply.

    opened by nowsyn 2
Owner
Intelligent Systems Lab Org
Intelligent Systems Lab Org
Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation, CVPR 2018

Learning Pixel-level Semantic Affinity with Image-level Supervision This code is deprecated. Please see https://github.com/jiwoon-ahn/irn instead. Int

Jiwoon Ahn 337 Dec 15, 2022
Code for the CVPR2022 paper "Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity"

Introduction This is an official release of the paper "Frequency-driven Imperceptible Adversarial Attack on Semantic Similarity" (arxiv link). Abstrac

Leo 21 Nov 23, 2022
Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP Abstract: We introduce a method that allows to automatically se

Daniil Pakhomov 134 Dec 19, 2022
TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

This project is a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

yifan liu 147 Dec 3, 2022
Mae segmentation - Reproduction of semantic segmentation using masked autoencoder (mae)

ADE20k Semantic segmentation with MAE Getting started Install the mmsegmentation

null 97 Dec 17, 2022
Code for the paper Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration

IMAGINE: Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration This repo contains the code base of the paper Language as a Cog

Flowers Team 26 Dec 22, 2022
Flybirds - BDD-driven natural language automated testing framework, present by Trip Flight

Flybird | English Version 行为驱动开发(Behavior-driven development,缩写BDD),是一种软件过程的思想或者

Ctrip, Inc. 706 Dec 30, 2022
A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.

A curated list of awesome resources related to Semantic Search?? and Semantic Similarity tasks.

null 224 Jan 4, 2023
Build upon neural radiance fields to create a scene-specific implicit 3D semantic representation, Semantic-NeRF

Semantic-NeRF: Semantic Neural Radiance Fields Project Page | Video | Paper | Data In-Place Scene Labelling and Understanding with Implicit Scene Repr

Shuaifeng Zhi 243 Jan 7, 2023
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Segmentation Transformer Implementation of Segmentation Transformer in PyTorch, a new model to achieve SOTA in semantic segmentation while using trans

Abhay Gupta 161 Dec 8, 2022
Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Exploring Cross-Image Pixel Contrast for Semantic Segmentation Exploring Cross-Image Pixel Contrast for Semantic Segmentation, Wenguan Wang, Tianfei Z

Tianfei Zhou 510 Jan 2, 2023
Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals.

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals This repo contains the Pytorch implementation of our paper: Unsupervised Seman

Wouter Van Gansbeke 335 Dec 28, 2022
Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

zhaohu xing 112 Dec 16, 2022
Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021)

Prototypical Pseudo Label Denoising and Target Structure Learning for Domain Adaptive Semantic Segmentation (CVPR 2021, official Pytorch implementatio

Microsoft 247 Dec 25, 2022
Shape-aware Semi-supervised 3D Semantic Segmentation for Medical Images

SASSnet Code for paper: Shape-aware Semi-supervised 3D Semantic Segmentation for Medical Images(MICCAI 2020) Our code is origin from UA-MT You can fin

klein 125 Jan 3, 2023
A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains (IJCV submission)

wsss-analysis The code of: A Comprehensive Analysis of Weakly-Supervised Semantic Segmentation in Different Image Domains, arXiv pre-print 2019 paper.

Lyndon Chan 48 Dec 18, 2022
[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Fudan Zhang Vision Group 897 Jan 5, 2023
Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CAC) Xin Lai*, Zhuotao Tian*, Li Jiang, Shu Liu, Hengshuang Zhao, Li

Jia Research Lab 137 Dec 14, 2022