A unified 3D Transformer Pipeline for visual synthesis

Microsoft

Last update: Jan 6, 2023

Related tags

Deep Learning NUWA

Overview

This is the official repo for the paper: NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion.

NÜWA is a unified multimodal pre-trained model that can generate new or manipulate existing visual data (i.e., images and videos) for 8 visual synthesis tasks (as shown above).

Samples

Text-To-Image (T2I)

SKetch-to-Image (S2I)

Image Completion (I2I)

Text-Guided Image Manipulation (TI2I)

Text-to-Video(T2V)

Video Prediction (V2V)

Sketch-to-Video (S2V)

Text-Guided Video Manipulation (TV2V)

Comments

Paper - Possible (minor) error

In this paper, we show that simply using 2D VQ-GAN to encode each frame of a video can also generate temporal consistency videos and at the same time benefit from both image and video data.

In the paper, I believe you mean "temporally consistent" here. Subtle change in wording.

opened by afiaka87 2
[Documentation] Video Prediction Labeled as a V2V process, despite taking only 1 frame

Judging by the results, the transformer is taking in a single frame, and would be considered an Image to Video process. Something like video inpainting or camera FOV extrapolation(like in FGVC) would be input video -> output video. Am I missing something in the documentation that maybe shows it as some sort of sparse video interpolation where it can input more than a (D1, D2, single frame); or was it called V2V in order to match the I2I label on the inpainting/image completion counterparts?

Additionally, there isn't a direct link to the paper, which documents that the V2V model only takes in a single image. https://arxiv.org/abs/2111.12417

opened by Sazoji 2
This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

opened by microsoft-github-policy-service[bot] 1
Adding Microsoft SECURITY.MD

Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

opened by microsoft-github-policy-service[bot] 0
This repo is missing a license file

This repository is currently missing a LICENSE.MD file outlining its license. A license helps users understand how to use your project in a compliant manner. You can find the standard MIT license text at the Microsoft repo templates LICENSE file: https://github.com/microsoft/repo-templates/blob/main/shared/LICENSE. If you would like to learn more about open source licenses, please visit the document at https://aka.ms/license.

opened by microsoft-github-policy-service[bot] 1
For T2V, is the 10 frames evenly sampled from the video or the first 10 frames in the video?

Thank you for your excellent work! From the paper, I know that you sample 10 frames from a 2.5 FPS video. I want to know how many frames per video in the dataset you use? Is the 10 frames evenly sampled from the video or the first 10 frames in the video?

opened by 962858249 0
Missing performance numbers in the paper

First off congratulation to this amazing work. I think you managed to find the closing gap to make generative Deep learning relevant for real-world application, besides being just a nice toy as previous work in this area.

However to truly judge the performance of your approach I have to say I was a bit disappointed after reading your paper there was not a single note on execution time for either training or more crucial actually sampling of a single final image.

Would you be able to provide some numbers on how long a sample generation takes for a 4kx1k images with 256^2 patch size and on which setup?

Also if possible could you also shed some light on training times and which setup was used.

Thank you!

opened by Mut1nyJD 0

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

105 Nov 7, 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Unified-EPT Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation. Installation Linux, CUDA>=10.0,

29 Aug 23, 2022

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

225 Nov 13, 2022

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

6 Nov 2, 2022

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ACTOR Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021. Please visit our we

248 Dec 23, 2022

git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li Accepted by CVPR

236 Dec 22, 2022

A unified 3D Transformer Pipeline for visual synthesis

Related tags

Overview

Overview

Samples

Text-To-Image (T2I)

SKetch-to-Image (S2I)

Image Completion (I2I)

Text-Guided Image Manipulation (TI2I)

Text-to-Video(T2V)

Video Prediction (V2V)

Sketch-to-Video (S2V)

Text-Guided Video Manipulation (TV2V)

Comments

Paper - Possible (minor) error

[Documentation] Video Prediction Labeled as a V2V process, despite taking only 1 frame

This repo is missing important files

Adding Microsoft SECURITY.MD

This repo is missing a license file

For T2V, is the 10 frames evenly sampled from the video or the first 10 frames in the video?

Missing performance numbers in the paper

Owner

Microsoft

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Learning Spatio-Temporal Transformer for Visual Tracking

So-ViT: Mind Visual Tokens for Vision Transformer

TrTr: Visual Tracking with Transformer

This is an official implementation for "ResT: An Efficient Transformer for Visual Recognition".

source code of “Visual Saliency Transformer” (ICCV2021)

DPT: Deformable Patch-based Transformer for Visual Recognition (ACM MM2021)

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+

SPT_LSA_ViT - Implementation for Visual Transformer for Small-size Datasets

[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds