Learning to Compose Visual Relations

NeurIPS 2021 (Spotlight)

Nan Liu^1* Shuang Li^2* Yilun Du^2* Joshua B. Tenenbaum² Antonio Torralba² (* indicate equal contribution)
University of Michigan MIT CSAIL

Interactive Demo

Text Prompt:
A small gray rubber cube a small purple metal cube

Text Prompt:
A maple wood cabinet a blue fabric couch

Abstract

The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually independent, their relations are entangled and dependent on each other. To circumvent this issue, existing works primarily compose relations by utilizing a holistic encoder, in the form of text or graphs. In this work, we instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure.

Paper

Learning to Compose Visual Relations
Nan Liu^1* Shuang Li^2* Yilun Du^2* Joshua B. Tenenbaum² Antonio Torralba²
(* indicate equal contribution)
NeurIPS 2021 (Spotlight) [Code] [Paper]

Model Overview

Image Generation Results

Image generation results on the CLEVR dataset. Image are generated based on 1 - 4 relational descriptions. Note that the models are trained on a single relational description and the composed scene relations (2, 3, and 4 relational descriptions) are outside the training distribution.

Image generation results on the iGibson dataset. Below are generated iGibson images. Note that the models are trained on a single relational description and the composed scene relations (2 relational descriptions) are outside the training distribution.

Image generation results on the real world Blocks dataset. Note that the models are trained on a single relational description and the composed scene relations (2 and 3 relational descriptions) are outside the training distribution.

Image Editing Results

Image editing results on the CLEVR dataset. Left: image editing results based on a single relational scene description. Right: image editing results based on two composed relational scene descriptions. Note that the composed scene relations in the right part are outside the training distribution and our approach can still edit the images accurately.

Image-text Retrieval Results

We evaluate whether our proposed model can understand different relational scene descriptions by image-to-text retrieval. We compare the proposed approach with the pretrained CLIP and fine-tuned CLIP and show their top-1 retrieved relation description based on the given image query.