act driven approach The most direct and relatable approach is by using an acting video, that drives both visual and …

act driven approach

The most direct and relatable approach is by using an acting video, that drives both visual and audio output. With ElevenLabs VoiceChanger Tool, we can take the original recorded voice and create a synthesized voice with the exact timing and tone from the original but translated in gender, age or sound style. The same goes for the video: With a proper footage video, we can use the driving video to manipulate the footage to a pretty convincing result.

audio driven approach

We can use audio generative tools as a guiding element for the acting and animation at all. This comes handy, as we can script and design the spoken word very precisely and try different variations without going on the nerves of actors 🙂 – The crutial part here is the reanactment of the audio by a real person. In comparison with the exclusive audio driven lipsync approach, we see a lot more depth and relatability in the performance.

#1 – generating authentic speech sample

With generators like BARK, F5TTL or Elevenlabs, we can generate spoken word in a pretty relatable manner in a wide range of languages and voice variations. For this example, we take an interview snippet in german with a quick laughter in the beginning. The sample is generated with BARK/SUNO, that can output speech sounds like laughter or else. https://huggingface.co/spaces/suno/bark

[laughs] Also, wir haben in der Spule einfach drauf los gespielt. Das wozu wir Lust hatten. Das war immer alles ziemlich chaotisch und improvisiert und am Anfang waren wir auch meist unser eigenes Publikum. Aber irgendwie hat sich das dann rumgesprochen.

#2 – creating a suitable scene for the audio

Depending on the content or context of the audio, we can generate a suitable keyframe/scene, that fits the needs of expression of the audio. This example is generated with FLUX DEV 1.0 with a custom trained LoRa.

generated keyframe suitable for the audio output

le82colorflux close up photo of an adult aged woman with neutral friendly face talking and slight wrinkles with brown long unmade straight hair sitting on a table and wearing a turtleneck sweater. It is in a east european distressed atelier studio room with cracks in the plaster with old wallpapers and old furninture. Pieces of cloth and several tools lying around on work tables and large rolls of colored textile offcuts. In the background are several tailor’s dummies hung with unfinished eccentric theater costumes in vibrant colors and patterns. There is an east european tiled stove in the corner.

#3 – exclusive audio driven lip sync approach

Generative tools like Runways LipSync use audio information to induct facial animation in a video. This comes quiet handy, when there is a quick result needed. This approach lacks relateable expressions as the AI model has not much understanding of what facial expression suits the audio context.

#4 – acted video driven approach

The acted video approach gives a lot more expressive control over the animation. We can use the audio to guide the acting, but leave the emotional interpretation by the actor him or herself.

basic video driven animation

#1 – crafting the keyframe

keyframe of a causal east german woman in the early 80s with custom LoRa

le82colorflux photo a close up portrait of an old woman diy headband with neural face and slight wrinkles with brown long straight hair standing arms crossed and wearing a turtleneck sweater. It is in a east european distressed atelier studio room with cracks in the plaster with old wallpapers and old furninture. Pieces of cloth and several tools lying around on work tables and large rolls of colored textile offcuts. In the background are several tailor’s dummies hung with unfinished eccentric theater costumes in vibrant colors and patterns. There is an east european tiled stove in the corner.

generative acting tools experiments / RUNWAY, FLUX, BARK