More Demos for MimicTalk

1. High-quality Personalized Talking Face Generation

The models are trained through the proposed SD-Hybrid adaptation pipeline, talking 1000 iterations and less than 10 minutes for training.

2. Comparison with Person-Dependent NeRF-based Baselines

3. Stylized and Expressive Co-Speech Motion Generation with the ICS-A2M model

Post-Rebuttal Demo 1. Mimicking Target Identity's In-Domain Talking Style

This demo shows that using the target identity's training video clip as the style prompt, our method could produce expressive results of the target identity with good identity similarity and style similarity, achieving the goal of personalized talking face generation (i.e., mimicking the target identity's visual attributes and talking style).

• Please focus on the lip motion, the talking face avatars are driven by the audio track in the video.

Rebuttal Demo 1. More Identities

This demo compares our method with the baseline of 9 additional identities, (all training videos are 10-second-long), and the results prove that our method has better data efficiency and lip-sync quality.

• Please focus on the lip motion, the talking face avatars are driven by the audio track in the video.

Rebuttal Demo 2. Audio-to-Pose

To predict head pose from audio, we additionally train an audio-to-pose model, which follows the main structure of the ICS-A2M model proposed in the original paper. The demo show that our model can produce novel head poses that are coherent with the input audio.

• Please focus on the head pose and ignore the lip motion (we have muted the driving audio).

Rebuttal Demo 3. Driven by OOD poses

We additionally compare our method with the baseline when driven by various OOD head poses, and the results show that our method could well handle the OOD pose while the baseline cannot.

• Please focus on the head pose and ignore the lip motion (we have muted the driving audio).

Rebuttal Demo 4. Talking Style Mimicking

We tried different classifier-free guidance scale (CFG) in the samppling process of the ICS-A2M model (default=1.0 in the original paper). We found increasing the CFG scale further improves the style similarity for our flow-matching-based motion generation. The results show that our method could well handle various style references (6 prompts in the video) while the baseline degrades in identity similarity.

BibTeX

@inproceedings{ye2024real3dportrait,
    author    = {Ye, Zhenhui and Zhong, Tianyun and Ren, Yi and Yang, Jiaqi and Li, Weichuang and Huang, Jiangwei and Jiang, Ziyue and He, Jinzheng and Huang, Rongjie and Liu, Jinglin and Zhang, Chen and Yin, Xiang and Ma, Zejun and Zhao, Zhou},
    title     = {Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis},
    journal   = {ICLR},
    year      = {2024},
  }