Open-vocabulary Object Segmentation with Diffusion Models


Ziyi Li* 1
Qinye Zhou* 1
Xiaoyun Zhang1
Ya Zhang1, 2
Yanfeng Wang1, 2
Weidi Xie1, 2

1CMIC, Shanghai Jiao Tong University
2Shanghai AI Lab

ICCV 2023



Code [GitHub]

Paper [arXiv]

Cite [BibTeX]



alt text

Predictions from our guided text-to-image diffusion model. The model is able to simultaneously generate images and segmentation masks for the corresponding visual objects described in the text prompt, for example, Pikachu, Unicorn, etc.


Abstract

The goal of this paper is to extract the visual-language correspondence from a pre-trained text-to-image diffusion model, in the form of segmentation map, i.e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt. We make the following contributions:
(i) we pair the existing Stable Diffusion model with a novel grounding module, that can be trained to align the visual and textual embedding space of the diffusion model with only a small number of object categories;
(ii) we establish an automatic pipeline for constructing a dataset, that consists of {image, segmentation mask, text prompt} triplets, to train the proposed grounding module;
(iii) we evaluate the performance of open-vocabulary grounding on images generated from the text-to-image diffusion model and show that the module can well segment the objects of categories beyond seen ones at training time;
(iv) we adopt the augmented diffusion model to build a synthetic semantic segmentation dataset, and show that, training a standard segmentation model on such dataset demonstrates competitive performance on the zero-shot segmentation~(ZS3) benchmark, which opens up new opportunities for adopting the powerful diffusion model for discriminative tasks.




Architecture

The overview of our method. The left figure shows the knowledge induction procedure, where we first construct a dataset with synthetic images from diffusion model and generate corresponding oracle groundtruth masks by an off-the-shelf object detector, which is used to train the open-vocabulary grounding module. The right figure shows the architectural detail of our grounding module, that takes the text embeddings of corresponding entities and the visual features extracted from diffusion model as input, and outputs the corresponding segmentation masks. During training, both the diffusion model and text encoders are kept frozen.



Protocol-I: Grounded Generation

Quantitative result for Protocol-I evaluation on grounded generation

Our model has been trained on the synthesized training datasets, that consists of images with one or two objects from only seen categories, and test on our synthesized test dataset that consist of images with one or two objects of both seen and unseen categories. Our model outperforms the DAAM by a large margin.

Qualitative Results

Segmentation results of PASCAL-sim (left) and COCO-sim (right) on seen (motorbike, bottle, backpack and apple) and unseen (sofa, car, hot dog and bear) categories. Our grounded generation model achieves comparable segmentation results to the oracle groundtruth generated by the off-the-shelf object detector.



Protocol-II: Open-vocabulary Segmentation

Comparison with the previous ZS3 methods on PASCAL VOC.

The “Seen”, “Unseen”, and “Harmonic” denote mIoU of seen categories, unseen categories, and their harmonic mean. These ZS3 methods are trained on PASCAL-VOC training set.

Visualization of zero-shot segmentation results on Pascal-VOC

MaskFormer trained on our synthetic dataset achieves comparable performance with Zegformer (the state-of-the-art zero-shot semantic segmentation method) in segmenting unseen categories, {\em i.e.} pottedplant, sofa and tvmonitor. Note that although MaskFormer has seen these categories during training, the image-segmentation pairs of these categories are generated with our grounding module.



Ablation Study

Effect on the Number of Seen Categories.

We ablate the number of seen categories to further explore the generalisation ability of our proposed grounding module. As shown in the table, the grounding module can generalise to unseen categories, even with as few as five seen categories;

Timesteps for Extracting Visual Representation.

We compare the performance by extracting visual representation from Stable Diffusion at different timesteps. Results show that as the denoising steps gradually decrease, i.e., from t = 0 → 50, the performance for grounding tends to decrease in general, when t = 5, the best result is obtained.



Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.