Video-Realistic Talking Heads with Natural Facial Expressions for Interactive Services

TNT members involved in this project:
Stella Graßhof, M.Sc.
Felix Kuhnke, M.Sc.
Show all

Recent investigations indicate that image-based rendering is able to generate natural looking talking heads for innovative multimodal human-machine interfaces. The design of such an image-based talking head system comprises two steps. One is the analysis of the audiovisual data of a speaking person to create a database, which contains a large number of eye images, mouth images and associated facial and speech features. The other is the synthesis of facial animations, which selects and concatenates appropriate eye and mouth images from the database, synchronized with speech generated by a TTS synthesizer.

Our award winning image-based talking head system shall be extended with natural facial expressions, which are controlled by parameters obtained from analyzing the relationship between facial expressions and prosodic features of the speech. In addition, the relationship between the prosodic features of the speech and the global motion of the head shall be studied. In order to allow for real-time animation on consumer devices, methods for reducing the size of the database have to be developed. Computer aided modeling of human faces usually requires a lot of manual control to achieve realistic animations and to prevent unrealistic or non-human like results. Humans are very sensitive to any abnormal lineaments, so that facial animation remains a challenging task till today.


Facial animation combined with text-to-speech synthesis (TTS), also known as talking head, can be used as a modern human-machine interface. A typical application of facial animation is presented. Here an internet-based customer service site integrates a talking head into its web site. Subjective tests showed that Electronic Commerce Web sites with talking heads get a higher ranking than without.


Nowadays animation techniques range from animating 3D models to image-based rendering of models. In order to animate a 3D model consisting of a 3D mesh, which defines the geometric shape of the head, vertices of the 3D mesh are moved. The first approaches already began in the early 70's. Since then different animation techniques were developed, which continuously improved the animation. However, animating a 3D model still does not achieve photo-realism.


Photo-realism means to generate animations that are undistinguishable from recorded video. Recently, image-based facial animation was introduced. Image-based rendering processes only 2D images, so that new animations are generated by combining different facial parts of recorded image sequences. Hence, a 3D model is not necessary. The system described in can produce videos of people uttering a text, they never said before. Short video clips, each showing three consecutive frames (called tri-phones) are stored as samples, which lead to a large database.


Ezzat et al. have demonstrated a sample-based talking head that uses morphing to generate intermediate appearances of mouth shapes from a very small set of mouth samples.


Cosatto et al. designed a system, which achieves photo-realistic facial animations and can be currently regarded as the state-of-the-art facial animation engine. The face model mainly consists of a personalized mask and a large database of mouth images and related information. Our system is based on Cosatto's work.

Our goal is to achieve photo-realistic facial animations. We plan to extend our current 2D facial animations to 3D. A special challenge is the estimation of correlated parameters in deformable 3D facial models.

Our image-based facial animation system consists of two main parts: Audiovisual analysis of a recorded human subject and synthesis of facial animation.


In the analysis part a database with images of deformable facial parts of the human subject is created. The input of the visual analysis process is a video sequence and a face mask of the recorded human subject. For positioning the face mask to the recorded human subject in the initial frame, facial features such as eye corners and nostrils have to be localized. These facial features, which are independent from local deformations such as a moving yaw or blinking eye, are selected to initially position the face mask. Furthermore, the camera is calibrated so that the intrinsic camera parameters, such as focal length, are known. Thus, only the position and orientation of the mask in the initial image must be reconstructed. This problem is known as the Perspective-n-Point problem in the computer community. We use the reliable and accurate method solving this type of problem. In order to estimate the pose of the head in each frame, a gradient-based motion estimation algorithm estimates the three rotation and three translation parameters. An accurate pose estimation is required in order to avoid a jerky animation.


After the motion parameters are calculated for each frame, mouth samples are normalized and stored into a database. Normalizing means to compensate for head pose variations. Each mouth sample is characterized by a number of parameters consisting of its phonetic context, original sequence and frame number. Furthermore, each sample is characterized by its visual information, which is required for the selection of samples to create animations. The visual appearance of a sample can be parameterized by PCA or LLE. The geometric features are extracted by AAM (Active Appearance Models) based feature detection. Our database consists of approximately 20000 images, which is equal to 10 minutes recording time.


A face is synthesized by first generating the audio from the text using a TTS synthesizer. The TTS synthesizer sends phonemes and their duration to the unit selection engine, which chooses the best samples from the database. Then, image rendering overlay these facial parts corresponding to the generated speech over a background video sequence. Background sequences are recorded video sequences of the human subject with typical short head movements. In order to conceal illumination differences between an image of the background video and the current mouth sample, the mouth samples are blended in the background sequence using alpha-blending.

Our research has already lead to a free demo software for facial animation, which we provide as download at this website.


Additionally we offer the data which was collected during the research of an expressive talking head with a male speaker for download here.


  • We provide no support and no guarantee.
  • If you use the provided data for your research, a proper reference to our institute must be given.


If you are interested in writing your thesis (Bachelor, Master, Studienarbeit or Diplom) and thereby contribute to this project, please contact Stella Graßhof or Felix Kuhnke.

Show recent publications only
  • Journals
    • Kang Liu, Joern Ostermann
      Evaluation of an Image-based Talking Head with Realistic Facial Expression and Head Motion
      Journal on Multimodal User Interfaces, Special issue: Emotion-based Interaction, October 2011
    • Axel Weissenfeld, Kang Liu, Jörn Ostermann
      Video-realistic image-based eye animation via statistically driven state machines
      The Visual Computer, Springer Berlin / Heidelberg, November 2009
    • Kang Liu, Joern Ostermann
      Optimization of An Image-based Talking Head System
      Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation, EURASIP Journal on Audio, Speech, and Music Processing, Hindawi Publishing Corporation, Vol. 2009, September 2009
    • Axel Weissenfeld, Kang Liu, Joern Ostermann
      Gesichtsanimation mit Image-based Rendering für Dialogsysteme
      Telekommunikation Aktuell, Berichte aus Forschung und Entwicklung in Informationstechnik und Telekommunikation, 60. Jahrgang, Heft 07-12, Verlag für Wissenschaft und Leben, Erlangen, 2006
    • Jörn Ostermann, Lawrence S. Chen, Thomas S. Huang
      Animated Talking Head with Personalized 3D Head Model
      VLSI Signal Processing, Kluwer Academic Publishers, The Netherlands, pp. 97-105, 1998
    • Joern Ostermann
      Animated Talking Head with Personalized 3D Head Model
      Journal of VLSI Signal Processing, Kluwer Academic Publishers, p. 9, 1998, edited by Chen, Lawrence S.; Huang, Thomas S.
  • Conference Contributions
    • Felix Kuhnke, Jörn Ostermann
      Visual Speech Synthesis From 3D Mesh Sequences Driven By Combined Speech Features
      Proc. of the IEEE International Conference on Multimedia and Expo (ICME), IEEE, Hong Kong, July 2017
    • Stella Graßhof, Hanno Ackermann, Felix Kuhnke, Jörn Ostermann, Sami Brandt
      Projective Structure from Facial Motion
      15th IAPR International Conference on Machine Vision Applications (MVA) (accepted), Nagoya (Japan), May 2017
    • Stella Graßhof, Hanno Ackermann, Sami Brandt, Jörn Ostermann
      Apathy is the Root of all Expressions
      12th IEEE Conference on Automatic Face and Gesture Recognition (FG2017) (accepted), Washington D.C., USA, 2017
    • Stella Graßhof, Hanno Ackermann, Jörn Ostermann
      Estimation of Face Parameters using Correlation Analysis and a Topology Preserving Prior
      14th IAPR International Conference on Machine Vision Applications (MVA), Tokyo, May 2015
    • Karsten Vogt, Oliver Müller, Jörn Ostermann
      Facial Landmark Localization using Robust Relationship Priors and Approximative Gibbs Sampling
      Advances in Visual Computing , Springer, Vol. 9475, pp. 365 -- 376, Las Vegas, 2015, edited by George Bebis et al.
    • Stella Graßhof, Jörn Ostermann
      Performance of Image Registration and Its Extensions for Interpolation of Facial Motion
      PSIVT 2013 Workshops, Springer Lecture Notes on Computer Sciences (LNCS), pp. 216--227, October 2013
    • Kang Liu, Joern Ostermann
      Realistic Facial Expression Synthesis for an Image-based Talking Head
      IEEE Conference on Multimedia and Expo, ICME2011 , p. 6, Barcelona, Spain, July 2011
    • Kang Liu, Joern Ostermann
      Evaluation of an Image-based Talking Head with Realistic Facial Expression and Head Motion
      Proceedings of CASA (Computer Animation and Social Agents) workshop on Emotion-based Interaction, Chengdu, China, May 2011
    • Kang Liu, Joern Ostermann
      Realistic Head Motion Synthesis for an Image-based Talking Head
      FG 2011, The 9th IEEE Conference on Automatic Face and Gesture Recognition , p. 6, Santa Barbara, CA, March 2011
    • Kang Liu, Joern Ostermann
      Image-based Talking Head: Analysis and Synthesis
      DAGA 2010, 36. International Conference on Acoustics, Deutschen Gesellschaft für Akustik, pp. 87-88, Berlin, March 2010
    • Kang Liu, Joern Ostermann
      Minimized Database of Unit Selection in Visual Speech Synthesis Without Loss of Naturalness
      The 13th International Conference on Computer Analysis of Images and Patterns CAIP2009, Springer-Verlag Berlin Heidelberg, pp. 1212-1219, Münster, Germany, September 2009, edited by X. Jiang and N. Petkov
    • Kang Liu, Joern Ostermann
      An Image-based Talking Head System
      LIPS 2009 Special Session in AVSP 2009, Norwich, UK, September 2009
    • Axel Weissenfeld, Kang Liu, Joern Ostermann
      Video-Realistic Image-based Eye Animation System
      EUROGRAPHICS 2009 (Short Paper), Munich, April 2009
    • Kang Liu, Joern Ostermann
      Realistic Facial Animation System for Interactive Services
      Interspeech 2008, LIPS 2008: Visual Speech Synthesis Challenge, Brisbane, Australia, September 2008
    • Kang Liu, Joern Ostermann
      Realistic Talking Head for Human-Car-Entertainment Services
      IMA 2008 Informationssysteme für mobile Anwendungen, GZVB e.V. (Hrsg.), pp. 108-118, Braunschweig, Germany, September 2008
    • Kang Liu, Axel Weissenfeld, Joern Ostermann, Xinghan Luo
      Robust AAM Building for Morphing in an Image-based Facial Animation System
      IEEE Multimedia and Expo, 2008 IEEE International Conference on , Hannover, Germany, June 2008
    • Axel Weissenfeld, Kang Liu, Wei Liu, Joern Ostermann
      Image-based Head Animation System
      1. Kongress Multimediatechnik 2006, Institut für Multimediatechnik GmbH -IFM, pp. 67-72, Wismar, November 2006
    • Axel Weissenfeld, Onay Urfalioglu, Kang Liu, Joern Ostermann
      Robust Rigid Head Motion Estimation based on Differential Evolution
      IEEE International Conference on Multimedia & Expo 2006, IEEE Multimedia and Expo, 2006 IEEE International Conference on, pp. 225 - 228, Toronto, CN, July 2006
    • Kang Liu, Axel Weissenfeld, Joern Ostermann
      Parameterization of Mouth Images by LLE and PCA for Image-based Facial Animation
      ICASSP06,Toulouse, France IEEE Proceedings, IEEE, Vol. 5, pp. 461-464, May 2006
    • Axel Weissenfeld, Kang Liu, Sven Klomp, Joern Ostermann
      Personalized Unit Selection for an Image-based Facial Animation System
      IEEE MMSP 2005, Shanghai/China, IEEE, November 2005
    • Jörn Ostermann, Axel Weissenfeld, Kang Liu
      Talking Faces - Technologies and Applications (Keynote)
      Vision, Video, and Graphics 2005, Eurographics Association, pp. 157-158, University of Edinburgh, July 2005, edited by Emanuele Trucco
    • A.C. Andres del Valle, Joern Ostermann
      3D talking head customization by adapting a generic model to one uncalibrated picture
      ISCAS 2001, Sydney, Australia, Vol. 2, pp. 325-328, May 2001
    • Joern Ostermann, D. Millen
      Talking heads and synthetic speech: An architecture for supporting electronic commerce
      ICME 2000, International Conference on Multimedia and Expo, New York, USA, IEEE CNF, Vol. 1, pp. 71-74, July 2000
    • Joern Ostermann, Y. Wang, M. Beutnagel, A. Fischer
      Integration of talking heads and text-to-speech synthesizers for visual TTS
      International Conference on Spoken Language Processing, Sydney, Australia, pp. 297-300, 1998
  • Technical Report
    • Felix Kuhnke, Stella Graßhof, Jörn Ostermann
      Das Gesicht als Interface zwischen Mensch und Maschine - Wie wir zukünftig mit Robotern kommunizieren
      Unimagazin - Forschungsmagazin der Leibniz Universität Hannover, pp. 14-16, Hannover, 2016
    • Joern Ostermann, Erich Haratsch
      Parameter-Based Model-Independent Animation of Personalized Talking Heads
      IEEE Transactions on circuits and systems for video technology, IEEE Transactions on circuits and systems for video technology, p. 24, 1996