act driven approach
The most direct and relatable approach is by using an acting video, that drives both visual and audio output. With ElevenLabs VoiceChanger Tool, we can take the original recorded voice and create a synthesized voice with the exact timing and tone from the original but translated in gender, age or sound style. The same goes for the video: With a proper footage video, we can use the driving video to manipulate the footage to a pretty convincing result.
audio driven approach
We can use audio generative tools as a guiding element for the acting and animation at all. This comes handy, as we can script and design the spoken word very precisely and try different variations without going on the nerves of actors 🙂 – The crutial part here is the reanactment of the audio by a real person. In comparison with the exclusive audio driven lipsync approach, we see a lot more depth and relatability in the performance.
#1 – generating authentic speech sample
With generators like BARK, F5TTL or Elevenlabs, we can generate spoken word in a pretty relatable manner in a wide range of languages and voice variations. For this example, we take an interview snippet in german with a quick laughter in the beginning. The sample is generated with BARK/SUNO, that can output speech sounds like laughter or else. https://huggingface.co/spaces/suno/bark
#2 – creating a suitable scene for the audio
Depending on the content or context of the audio, we can generate a suitable keyframe/scene, that fits the needs of expression of the audio. This example is generated with FLUX DEV 1.0 with a custom trained LoRa.
#3 – exclusive audio driven lip sync approach
Generative tools like Runways LipSync use audio information to induct facial animation in a video. This comes quiet handy, when there is a quick result needed. This approach lacks relateable expressions as the AI model has not much understanding of what facial expression suits the audio context.
#4 – acted video driven approach
The acted video approach gives a lot more expressive control over the animation. We can use the audio to guide the acting, but leave the emotional interpretation by the actor him or herself.