A unified 3D Transformer Pipeline for visual synthesis

Related tags

Deep Learning NUWA
Overview

Overview

This is the official repo for the paper: NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion.

Overview

NÜWA is a unified multimodal pre-trained model that can generate new or manipulate existing visual data (i.e., images and videos) for 8 visual synthesis tasks (as shown above).

Samples

Text-To-Image (T2I)

t2i

SKetch-to-Image (S2I)

s2i

Image Completion (I2I)

i2i

Text-Guided Image Manipulation (TI2I)

ti2i

Text-to-Video(T2V)

t2v

Video Prediction (V2V)

v2v

Sketch-to-Video (S2V)

s2v

Text-Guided Video Manipulation (TV2V)

out_final

Comments
  • Paper - Possible (minor) error

    Paper - Possible (minor) error

    In this paper, we show that simply using 2D VQ-GAN to encode each frame of a video can also generate temporal consistency videos and at the same time benefit from both image and video data.

    In the paper, I believe you mean "temporally consistent" here. Subtle change in wording.

    opened by afiaka87 2
  • [Documentation] Video Prediction Labeled as a V2V process, despite taking only 1 frame

    [Documentation] Video Prediction Labeled as a V2V process, despite taking only 1 frame

    Judging by the results, the transformer is taking in a single frame, and would be considered an Image to Video process. Something like video inpainting or camera FOV extrapolation(like in FGVC) would be input video -> output video. Am I missing something in the documentation that maybe shows it as some sort of sparse video interpolation where it can input more than a (D1, D2, single frame); or was it called V2V in order to match the I2I label on the inpainting/image completion counterparts?

    Additionally, there isn't a direct link to the paper, which documents that the V2V model only takes in a single image. https://arxiv.org/abs/2111.12417

    opened by Sazoji 2
  • This repo is missing important files

    This repo is missing important files

    There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    Merge this pull request

    opened by microsoft-github-policy-service[bot] 1
  • Adding Microsoft SECURITY.MD

    Adding Microsoft SECURITY.MD

    Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    opened by microsoft-github-policy-service[bot] 0
  • This repo is missing a license file

    This repo is missing a license file

    This repository is currently missing a LICENSE.MD file outlining its license. A license helps users understand how to use your project in a compliant manner. You can find the standard MIT license text at the Microsoft repo templates LICENSE file: https://github.com/microsoft/repo-templates/blob/main/shared/LICENSE. If you would like to learn more about open source licenses, please visit the document at https://aka.ms/license.

    opened by microsoft-github-policy-service[bot] 1
  • For T2V, is the 10 frames evenly sampled from the video or the first 10 frames in the video?

    For T2V, is the 10 frames evenly sampled from the video or the first 10 frames in the video?

    Thank you for your excellent work! From the paper, I know that you sample 10 frames from a 2.5 FPS video. I want to know how many frames per video in the dataset you use? Is the 10 frames evenly sampled from the video or the first 10 frames in the video?

    opened by 962858249 0
  • Missing performance numbers in the paper

    Missing performance numbers in the paper

    First off congratulation to this amazing work. I think you managed to find the closing gap to make generative Deep learning relevant for real-world application, besides being just a nice toy as previous work in this area.

    However to truly judge the performance of your approach I have to say I was a bit disappointed after reading your paper there was not a single note on execution time for either training or more crucial actually sampling of a single final image.

    Would you be able to provide some numbers on how long a sample generation takes for a 4kx1k images with 256^2 patch size and on which setup?

    Also if possible could you also shed some light on training times and which setup was used.

    Thank you!

    opened by Mut1nyJD 0
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

Implementation of "Distribution Alignment: A Unified Framework for Long-tail Visual Recognition"(CVPR 2021)

null 105 Nov 7, 2022
Alex Pashevich 62 Dec 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Unified-EPT Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation. Installation Linux, CUDA>=10.0,

null 29 Aug 23, 2022
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

null 6 Nov 2, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ACTOR Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021. Please visit our we

Mathis Petrovich 248 Dec 23, 2022
git git《Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] 《Masksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li Accepted by CVPR

NingWang 236 Dec 22, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Hiring research interns for visual transformer

Multimedia Research 484 Dec 29, 2022
So-ViT: Mind Visual Tokens for Vision Transformer

So-ViT: Mind Visual Tokens for Vision Transformer        Introduction This repository contains the source code under PyTorch framework and models trai

Jiangtao Xie 44 Nov 24, 2022
TrTr: Visual Tracking with Transformer

TrTr: Visual Tracking with Transformer We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder a

趙 漠居(Zhao, Moju) 66 Dec 27, 2022
This is an official implementation for "ResT: An Efficient Transformer for Visual Recognition".

ResT By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software Technology at Nanjing University] This repo is the official implement

zhql 222 Dec 13, 2022
source code of “Visual Saliency Transformer” (ICCV2021)

Visual Saliency Transformer (VST) source code for our ICCV 2021 paper “Visual Saliency Transformer” by Nian Liu, Ni Zhang, Kaiyuan Wan, Junwei Han, an

null 89 Dec 21, 2022
DPT: Deformable Patch-based Transformer for Visual Recognition (ACM MM2021)

DPT This repo is the official implementation of DPT: Deformable Patch-based Transformer for Visual Recognition (ACM MM2021). We provide code and model

CASIA-IVA-Lab 111 Dec 21, 2022
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

null 41 Jan 3, 2023
PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+

PaddlePaddle Vision Transformers State-of-the-art Visual Transformer and MLP Models for PaddlePaddle ?? PaddlePaddle Visual Transformers (PaddleViT or

null 1k Dec 28, 2022
SPT_LSA_ViT - Implementation for Visual Transformer for Small-size Datasets

Vision Transformer for Small-Size Datasets Seung Hoon Lee and Seunghyun Lee and Byung Cheol Song | Paper Inha University Abstract Recently, the Vision

Lee SeungHoon 87 Jan 1, 2023
[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

3DVG-Transformer This repository is for the ICCV 2021 paper "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds" Our method "3DV

null 22 Dec 11, 2022