One-shot Implicit Animatable Avatars with
Model-based Priors

1State Key Lab of CAD & CG, Zhejiang University 2Max Planck Institute for Intelligent Systems, Tübingen, Germany
3University of Cambridge 4Xiaohongshu Inc.
*denotes equal contribution

ELICIT creates free-viewpoint motion videos from a single image by constructing an animatable NeRF representation in one-shot learning.

Abstract

Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar.Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://github.com/huangyangyi/ELICIT

Video

Animatable Avatars From Single image

Evaluation on multi-view captured 3D human datasets

We evaluate ELICIT on multi-view 3D datasets ZJU-MoCAP and Human3.6M. For each example, given a single monocular RGB image, ELICIT can generate a realisitic animated avatar. We use the captured 3D groudtruth motion to animate the avatar to generate the synthesized motion video.

Results on ZJU-MoCAP dataset

Results on Human3.6M dataset

Evaluation on various cloth style humans

ELICIT creates animatable avatars with realistic details on high-resolution 2D human photos with various cloth styles from DeepFashion datasets.

Comparison with SOTA NeRF methods (single-image input)

Novel view synethesis

Novel pose synethesis

Comparison with non-NeRF methods on DeepFashion

BibTeX

@inproceedings{huang2022elicit,
      title={One-shot Implicit Animatable Avatars with Model-based Priors},
      author={Huang, Yangyi and Yi, Hongwei and Liu, Weiyang and Wang, Haofan and Wu, Boxi and Wang, Wenxiao and Lin, Binbin and Zhang, Debing and Cai, Deng},
      booktitle={IEEE Conference on Computer Vision (ICCV)}, 
      year={2023}
  }