GroundingBooth: Grounding Text-to-Image Customization

Zhexiao Xiong¹, Wei Xiong², Jing Shi², He Zhang², Yizhi Song³, Nathan Jacobs¹

¹Washington University in St. Louis, ²Adobe, ³Purdue University

We propose GroundingBooth, a framework for grounded text-to-image customization. GroundingBooth supports: (a) grounded single-subject customization, and (b) joint grounded customization for multi-subjects and text entities. GroundingBooth achieves prompt following, layout grounding for both subjects and background objects, and identity preservation of subjects simultaneously.

Abstract

Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks.

Multi-concept customization on DreamBench objects. Results show that our method achieves joint foreground-background control with text alignment and identity preservation of foreground objects. Our model seamlessly supports the customization of multiple subjects. Even when the bounding boxes of the foreground objects have a large overlap with the background text entities, the model can distinguish subject-driven foreground generation from text-driven background generation.

Visual comparison with existing methods on DreamBench objects for the single-subject customization task. Previous non-grounding based customization methods are inclined to generate objects that are very large and in the center of the image, which gains benefit in CLIP-I score and DINO score during evaluation. However, in real-world scenarios, users may expect to flexibly control the size of the subject in the generated images. They may choose to generate larger background with broader textual information, where, in such cases, non-grounding customization methods cannot generate the desired result. The visual results demonstrate that our results achieves better identity preservation performance with accurate layout-alignment.

Visaul results of reference-guided image generation with complex layout and text entities as conditions on COCO validation set. Results show that even if we input complex layouts and text entities to the model, our model can still generate high-quality scenes with precise layout alignment of all the objects and regions, and accurate identity preservation of the reference object, while preserving the text-alignment. Compared with previous layout-to-image generation methods, our model has a competitive accuracy in grounding the visual concepts and remarkable improvement on identity preservation.

Method

(a) grounded single-subject customization, and (b) joint grounded customization for multi-subjects and text entities. GroundingBooth achieves prompt following, layout grounding for both subjects and background objects, and identity preservation of subjects simultaneously.

Grounding Module of our proposed framework. Our grounding module takes both the prompt-layout pairs and reference object-layout pairs as input. For the foreground reference object, both CLIP text token and the DINOv2 image class token are utilized.

GroundingBooth: Grounding Text-to-Image Customization

Abstract

Method

Grounding Module of our proposed framework. Our grounding module takes both the prompt-layout pairs and reference object-layout pairs as input. For the foreground reference object, both CLIP text token and the DINOv2 image class token are utilized.

Pipeline of our proposed masked cross-attention. Q, K, and V are image query, key, and value respectively, and A is the affinity matrix.

More Results

More results on complex scene generation on COCO validation set.