CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

Paper

Code

A 3D scene composed of objects generated with our method using only text prompts

Abstract

We present a technique for zero-shot generation of a 3D model using only a target text prompt. Without any 3D supervision our method deforms the control shape of a limit subdivided surface along with its texture map and normal map to obtain a 3D asset that corresponds to the input text prompt and can be easily deployed into games or modeling applications. We rely only on a pre-trained CLIP model that compares the input text prompt with differentiably rendered images of our 3D model. While previous works have focused on stylization or required training of generative models we perform optimization on mesh parameters directly to generate shape, texture or both. To constrain the optimization to produce plausible meshes and textures we introduce a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding.

Method

Below we show an overview of our method

Here we show some animations of the generation process for the prompts "an armchair in the shape of an avocado", "a lamp shade" and "a wooden table"

A GIF of our method generating an armchair from text prompt 'an armchair in the shape of an avocado'

A GIF of our method generating a table from text prompt 'a wooden table'

Results

a 🍔

A vase made of colorful crochet

Burj Al Arab

A semi-translucent gummy bear, digital art

Mount Everest cake

backpack

Christmas tree

RPG mana potion

a coffee

a milkshake

Brutalist style apple, rendering

Minecraft Axe

Citation

@article{khalid2022clipmesh} author ={Khalid, Nasir Mohammad and Xie, Tianhao and Belilovsky, Eugene and Tiberiu, Popa}, title = {CLIP-Mesh: Generating textured meshes from text using pretrained image-text models}, journal = {SIGGRAPH Asia 2022 Conference Papers}, month = {December}, year = {2022}, }