Virtual Pets: Animatable Animal Generation in 3D Scenes

Yen-Chi Cheng
UIUC & Snap Research
Chieh Hubert Lin
UC Merced & Snap Research
Chaoyang Wang
Snap Research
Yash Kant
UofT & Snap Research
Sergey Tulyakov
Snap Research
Hsin-Ying Lee
Snap Research

Arxiv 2023

Paper Code
Virtual Pets.  Given a 3D scene, we can generate diverse 3D animal motion sequences that are environment-aware.

Abstract

Toward unlocking the potential of generative models in immersive 4D experiences, we introduce Virtual Pet, a novel pipeline to model realistic and diverse motions for target animal species within a 3D environment. To circumvent the limited availability of 3D motion data aligned with environmental geometry, we leverage monocular internet videos and extract deformable NeRF representations for the foreground and static NeRF representations for the background. For this, we develop a reconstruction strategy, encompassing species-level shared template learning and per-video fine-tuning. Utilizing the reconstructed data, we then train a conditional 3D motion model to learn the trajectory and articulation of foreground animals in the context of 3D backgrounds. We showcase the efficacy of our pipeline with comprehensive qualitative and quantitative evaluations. We also demonstrate versatility across unseen cats and indoor environments, producing temporally coherent 4D outputs for enriched virtual experiences.


Diverse Motion Generation

Diverse Environment-aware Motion Generation. We show the diverse motion generations in different environments. (Top) We show 4D generation given different starting poses G0. (Bottom) We show diverse motion outputs given the same starting pose in the same scene. The proposed method can generate diverse motions in different environments.


Diverse Motion Generation (Multi-view)

Diverse Environment-aware Motion Generation. We show the diverse motion generations rendering in multi-view.


Diverse Textures for Foreground and Background Objects

Diverse textures. We adopt Text2Tex and SceneTex to perform diverse texturing to both foreground objects and background scenes.


Overview of Virtual Pets

The proposed framework of Virtual Pets. (Left) To extract 3D shapes and motions from monocular videos: we first learn a Species Articulated Template Model with an articulated NeRF using a collection of cat videos. We then perform Per-Video Fine-tuning. For each video, we further reconstruct the background with a static NeRF. The articulated NeRF trained in species-level stage is loaded and fine-tuned in this stage to make sure the motions, which are Trajectory and Articulation, respect the reconstructed background shape.
(Right) After that, we train an environment-aware 3D motion generator with a Trajectory VAE and an Articulation VAE. It generates 3D motions based on vertices of the foreground limbs, distance from foreground to background, and pointclouds sampled from the background


Inference

Inference: Texturing and Rendering. At the inference time, given textureless foreground and background meshes, we first adopt Text2Tex and SceneTex to texture the meshes. Meanwhile, we generate the motion sequence using the trained trajectory VAE and articulation VAE. We then obtain the final predicted foreground mesh after deformation and transformation. Finally, the 3D motion sequences and the 3D scene are rendered to videos given camera poses.


Citation

@article{cheng2023VirtualPets,
  title = {{V}irtual {P}ets: Animatable Animal Generation in 3D Scenes},
  author={Cheng, Yen-Chi and Lin, Chieh Hubert and Wang, Chaoyang and Kant, Yash and Tulyakov, Sergey and Schwing, Alexander G and Gui, Liangyan and Lee, Hsin-Ying},
  journal = {arXiv preprint arXiv:2312.14154},
  year={2023},
}

Acknowledgement

This webpage is borrowed from DreamFusion. Thanks for their beautiful website!