Official implementation for: Blended Diffusion for Text-driven Editing of Natural Images.

Last update: Dec 30, 2022

Related tags

Deep Learning blended-diffusion

Overview

Blended Diffusion for Text-driven Editing of Natural Images

Blended Diffusion for Text-driven Editing of Natural Images
Omri Avrahami, Dani Lischinski, Ohad Fried

Abstract: Natural language offers a highly intuitive interface for image editing. In this paper, we introduce the first solution for performing local (region-based) edits in generic natural images, based on a natural language description along with an ROI mask. We achieve our goal by leveraging and combining a pretrained language-image model (CLIP), to steer the edit towards a user-provided text prompt, with a denoising diffusion probabilistic model (DDPM) to generate natural-looking results. To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent at a progression of noise levels. In addition, we show that adding augmentations to the diffusion process mitigates adversarial results. We compare against several baselines and related methods, both qualitatively and quantitatively, and show that our method outperforms these solutions in terms of overall realism, ability to preserve the background and matching the text. Finally, we show several text-driven editing applications, including adding a new object to an image, removing/replacing/altering existing objects, background replacement, and image extrapolation.

Applications

Multiple synthesis results for the same prompt

Synthesis results for different prompts

Altering part of an existing object

Background replacement

Scribble-guided editing

Text-guided extrapolation

Composing several applications

Code availability

Full code will be released soon.

Comments

Question about training
Hi, this is really an impressive work! Two question here.

I would like ask is the overall process of the text-guided image editing is using only pre-trained model without any extra training or fine-tuning?

If it does not required any further fine-tuning or training, what is the purpose of having diffusion guided loss (which combine loss from CLIP model and background preservation loss)?

Thanks in advance for your clarification!
opened by JacksonCakes 4
model_output_size

in image_editor.py:

self.model.load_state_dict( torch.load( "checkpoints/256x256_diffusion_uncond.pt" if self.args.model_output_size == 256 else "checkpoints/512x512_diffusion.pt", map_location="cpu", ) )

mentioned '512x512_diffusion', is it an conditional or unconditional model? (can you share me its download link?) It is natural that your method is built on a pretrained uncontitional model. If '512x512_diffusion' is a conditioonal model, what is the condition for face data for example? can you help me figure out this?

opened by fido20160817 2
Scribble-guided editing

Hi! I wonder if a loss such as MSE or LPIPS is used between the user-provided scribbles and the scribbled regions of $\widehat{x}_0$ , in addition to the CLIP loss. I am curious how the shapes and colors stay consistent when only text with no specific description, e.g., "blanket" in Fig 9, is given.

opened by wileewang 2
AttributeError: 'PosixPath' object has no attribute 'with_stem'

Thanks for opening your code. I really appreciated that.

I tried to run your code in the Google Colab with GPU Runtime.

I got an error. And I couldn't find any solution despite of googling...

The error message is as follows : AttributeError: 'PosixPath' object has no attribute 'with_stem'

I think it's kinda related to "pathlib" module. maybe it's due to the fact that pathlib doesn't work well with latest python 3.x version which I'm using.

I hope this error will be solved soon QQ

-----detail description-------

I run the terminal argument like this as follows : python main.py -p "rock" -i "input_example/img.png" --mask "input_example/mask.png" --output_path "output" --batch_size 1

And, I got this whole bunch of error messages as follow : Using device: cuda:0 tcmalloc: large alloc 2209964032 bytes == 0x89ebe000 @ 0x7fbdefa2cb6b 0x7fbdefa4c379 0x7fbd30b1026e 0x7fbd30b119e2 0x7fbd334aeee1 0x7fbdd598e236 0x7fbdd541ef98 0x593784 0x594731 0x548cc1 0x51566f 0x549e0e 0x593fce 0x5118f8 0x549e0e 0x4bcb19 0x59582d 0x595b69 0x62026d 0x55de15 0x59af67 0x515655 0x549e0e 0x4bca8a 0x5134a6 0x549576 0x593fce 0x548ae9 0x5127f1 0x4bc98a 0x532b86 Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] Loading model from: /usr/local/lib/python3.7/dist-packages/lpips/weights/v0.1/vgg.pth Start iterations 0 0% 0/75 [00:00<?, ?it/s]clip_loss - 867.99 range_loss - 0.00

Traceback (most recent call last): File "main.py", line 8, in <module> image_editor.edit_image_by_prompt()

File "/content/drive/MyDrive/ws/blended-diffusion/optimization/image_editor.py", line 266, in edit_image_by_prompt visualization_path = visualization_path.with_stem(

AttributeError: 'PosixPath' object has no attribute 'with_stem' 0% 0/75 [00:01<?, ?it/s]

The codes below are the ones that error comes in. Those are from blended-diffusion/optimization/image_editor.py python file.

line 1 - from pathlib import Path ... line 261 - for b in range(self.args.batch_size): line 262 - pred_image = sample["pred_xstart"][b] line 263 - visualization_path = Path( line 264 - os.path.join(self.args.output_path, self.args.output_file) line 265 - ) line 266 - visualization_path = visualization_path.with_stem(

opened by ngys321 1
Purpose of skip_timesteps

Hi, I would like to ask what is the purpose of having skip_timesteps in the code? I can't seems to find any related information on this from the paper.

opened by JacksonCakes 1

Official implementation for: Blended Diffusion for Text-driven Editing of Natural Images.

Related tags

Overview

Blended Diffusion for Text-driven Editing of Natural Images

Applications

Multiple synthesis results for the same prompt

Synthesis results for different prompts

Altering part of an existing object

Background replacement

Scribble-guided editing

Text-guided extrapolation

Composing several applications

Code availability

Comments

Question about training

model_output_size

Scribble-guided editing

AttributeError: 'PosixPath' object has no attribute 'with_stem'

Purpose of skip_timesteps

Owner

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

A Repository of Community-Driven Natural Instructions

Flybirds - BDD-driven natural language automated testing framework, present by Trip Flight

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Official PyTorch implementation for FastDPM, a fast sampling algorithm for diffusion probabilistic models

Official implementation for "Style Transformer for Image Inversion and Editing" (CVPR 2022)

Official PyTorch implementation of "IntegralAction: Pose-driven Feature Integration for Robust Human Action Recognition in Videos", CVPRW 2021

A Python script that creates subtitles of a given length from text paragraphs that can be easily imported into any Video Editing software such as FinalCut Pro for further adjustments.

InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images

(ICCV 2021) Official code of "Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing."

Official code release for: EditGAN: High-Precision Semantic Image Editing

[SIGGRAPH 2022 Journal Track] AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in Pytorch

Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)