Controllable Image Synthesis via SegVAE

Yen-Chi Cheng1,2
Hsin-Ying Lee1
Min Sun2
Ming-Hsuan Yang1,3
1University of California, Merced
2National Tsing Hua University
3Google Research

Label-set to Semantic map generation. (Top) Given a label-set, our model can generate diverse and realistic semantic maps. Translated RGB images are shown to better visualize the quality of the generated semantic maps. (Bottom) The proposed model enables several real-world flexible image editing.


Flexible user controls are desirable for content creation and image editing. A semantic map is commonly used intermediate representation for conditional image generation. Compared to the operation on raw RGB pixels, the semantic map enables simpler user modification. In this work, we specifically target at generating semantic maps given a label-set consisting of desired categories. The proposed framework, SegVAE, synthesizes semantic maps in an iterative manner using conditional variational autoencoder. Quantitative and qualitative experiments demonstrate that the proposed model can generate realistic and diverse semantic maps. We also apply an off-the-shelf image-to-image translation model to generate realistic RGB images to better understand the quality of the synthesized semantic maps. Furthermore, we showcase several real-world image-editing applications including object removal, object insertion, and object replacement.


Overview. Given a label-set as input, we adopt a VAE to model the multimodal shapes of of the semantic maps, and leverage an LSTM to iteratively predict the semantic map of each category starting from a blank canvas.



Controllable Image Synthesis via SegVAE

Yen-Chi Cheng, Hsin-Ying Lee, Min Sun, Ming-Hsuan Yang
In European Conference on Computer Vision, 2020.
    title={Controllable Image Synthesis via {SegVAE}},
    author={Cheng, Yen-Chi and Lee, Hsin-Ying and Sun, Min and Yang, Ming-Hsuan},


Multi-modality. We demonstrate the ability of SegVAE to generate diverse results given a label-set on both datasets.

Qualitative comparison. We present the generated semantic maps given label-sets on the CelebAMask-HQ (left) and the HumanParsing (right) datasets. The proposed model generates images with better visual quality compared to other methods. We also present the translated realistic images via SPADE.

Editing. We present three real-world image editing applications: add, remove, and new style. We show results of three operations on both datasets.


This template was borrowed from these amazing webpages: Colorful Image Colorization and NeRS.