GroundingBooth: Grounding Text-to-Image Customization

1Washington University in St. Louis, 2Adobe, 3Purdue University
MY ALT TEXT

We propose GroundingBooth, a framework for grounded text-to-image customization. GroundingBooth supports: (a) grounded single-subject customization, and (b) multi-subjects and text entities joint grounded customization, achieving joint grounding of subject-driven foreground and text-driven background generation with identity preservation and text-image alignment.

Abstract

Recent studies in text-to-image customization show great success in generating personalized object variants given several images of a subject. While existing methods focus more on preserving the identity of the subject, they often fall short of controlling the spatial relationship between objects. In this work, we introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed text-image grounding module and masked cross-attention layer allow us to generate personalized images with both accurate layout alignment and identity preservation while maintaining text-image coherence. With such layout control, our model inherently enables the customization of multiple subjects at once. Our model is evaluated on both layout-guided image synthesis and reference-based customization tasks, showing strong results compared to existing methods. Our work is the first work to achieve a joint grounding of both subject-driven foreground generation and text-driven background generation.

Method

MY ALT TEXT

An overview of our proposed framework. It is divided into two steps: (1) Feature extraction. We use CLIP encoder and DINOv2 encoder to extract text and image embeddings respectively, and use our proposed grounding module to extract the grounding tokens. (2) Foreground-background cross-attention control in each transformer block of U-Net. During training, we use dataset with single reference object. In inference stage, the pipeline allows feature injection of multiple reference objects through copied masked cross-attention layers. Our work is the first attempt that introduces precise grounding in the customized image synthesis task, which jointly controls the size and location for both the image-driven foreground objects and text-driven background, adaptively harmonizing the poses of the reference objects as well as faithfully preserving their identity.

MY ALT TEXT

Grounding Module of our proposed framework. Our grounding module takes both the prompt-layout pairs and reference object-layout pairs as input. For the foreground reference object, both CLIP text token and the DINOv2 image class token are utilized.

MY ALT TEXT

Pipeline of our proposed masked cross-attention. Q, K, and V are image query, key, and value respectively, and A is the affinity matrix.