Skip to main content

Alibaba's AI video generator just dunked on Sora by making the Sora lady sing

An AI-generated woman sings a song

Alibaba wants you to compare its new AI video generator to OpenAI's Sora. Otherwise, why use it to make Sora's most famous creation belt out a Dua Lipa song?

On Tuesday, an organization called the "Institute for Intelligent Computing" within the Chinese e-commerce juggernaut Alibaba released a paper about an intriguing new AI video generator it has developed that's shockingly good at turning still images of faces into passable actors and charismatic singers. The system is called EMO, a fun backronym supposedly drawn from the words "Emotive Portrait Alive" (though, in that case, why is it not called "EPO"?).

EMO is a peek into a future where a system like Sora makes video worlds, and rather than being populated by attractive mute people just kinda looking at each other, the "actors" in these AI creations say stuff — or even sing.

Alibaba put demo videos on GitHub to show off its new video-generating framework. These include a video of the Sora lady — famous for walking around AI-generated Tokyo just after a rainstorm — singing "Don't Start Now" by Dua Lipa and getting pretty funky with it.

The demos also reveal how EMO can, to cite one example, make Audrey Hepburn speak the audio from a viral clip of Riverdale's Lili Reinhart talking about how much she loves crying. In that clip, Hepburn's head maintains a rather soldier-like upright position, but her whole face — not just her mouth — really does seem to emote the words in the audio. 

In contrast to this uncanny version of Hepburn, Reinhart in the original clip moves her head a whole lot, and she also emotes quite differently, so EMO doesn't seem to be a riff on the sort of AI face-swapping that went viral back in the mid-2010s and led to the rise of deepfakes in 2017.  

Over the past few years, applications designed to generate facial animation from audio have cropped up, but they haven't been all that inspiring. For instance, the NVIDIA Omniverse software package touts an app with an audio-to-facial-animation framework called "Audio2Face" — which relies on 3D animation for its outputs rather than simply generating photorealistic video like EMO.

Despite Audio2Face only being two years old, the EMO demo makes it look like an antique. In a video that purports to show off its ability to mimic emotions while talking, the 3D face it depicts looks more like a puppet in a facial expression mask, while EMO's characters seem to express the shades of complex emotion that come across in each audio clip.

It's worth noting at this point that, like with Sora, we're assessing this AI framework based on a demo provided by its creators, and we don't actually have our hands on a usable version that we can test. So it's tough to imagine that right out of the gate this piece of software can churn out such convincingly human facial performances based on audio without significant trial and error, or task-specific fine-tuning. 

The characters in the demos mostly aren't expressing speech that calls for extreme emotions — faces screwed up in rage, or melting down in tears, for instance — so it remains to be seen how EMO would handle heavy emotion with audio alone as its guide. What's more, despite being made in China, it's depicted as a total polyglot, capable of picking up on the phonics of English and Korean, and making the faces form the appropriate phonemes with decent — though far from perfect — fidelity. So in other words, it would be nice to see what would happen if you put audio of a very angry person speaking a lesser-known language into EMO to see how well it performed.

Also fascinating are the little embellishments between phrases — pursed lips or a downward glance — that insert emotion into the pauses rather than just the times when the lips are moving. These are examples of how a real human face emotes, and it's tantalizing to see EMO get them so right, even in such a limited demo.  

According to the paper, EMO's model relies on a large dataset of audio and video (once again: from where?) to give it the reference points necessary to emote so realistically. And its diffusion-based approach apparently doesn't involve an intermediate step in which 3D models do part of the work. A reference-attention mechanism and a separate audio-attention mechanism are paired by EMO's model to provide animated characters whose facial animations match what comes across in the audio while remaining true to the facial characteristics of the provided base image. 

It's an impressive collection of demos, and after watching them it's impossible not to imagine what's coming next. But if you make your money as an actor, try not to imagine too hard, because things get pretty disturbing pretty quick.  



from Mashable https://ift.tt/hFftL18
via IFTTT

Comments

Popular posts from this blog

Instagram accidentally reinstated Pornhub’s banned account

After years of on-and-off temporary suspensions, Instagram permanently banned Pornhub’s account in September. Then, for a short period of time this weekend, the account was reinstated. By Tuesday, it was permanently banned again. “This was done in error,” an Instagram spokesperson told TechCrunch. “As we’ve said previously, we permanently disabled this Instagram account for repeatedly violating our policies.” Instagram’s content guidelines prohibit  nudity and sexual solicitation . A Pornhub spokesperson told TechCrunch, though, that they believe the adult streaming platform’s account did not violate any guidelines. Instagram has not commented on the exact reasoning for the ban, or which policies the account violated. It’s worrying from a moderation perspective if a permanently banned Instagram account can accidentally get switched back on. Pornhub told TechCrunch that its account even received a notice from Instagram, stating that its ban had been a mistake (that message itse...

Colorado police identified the serial killer who murdered 4 women 40 years ago after exhuming his body to analyze a DNA sample

A scientist examines computer images of DNA models. Getty Images Police in Colorado have cracked the cold cases of four women killed 40 years ago. Denver PD said genetic genealogy and DNA analysis helped them identify the serial killer. He had died by suicide in jail in 1981. DNA from his exhumed body matched evidence from the murders. Police in Colorado have cracked the code on four murder cases that went unsolved for 40 years, using DNA from the killer's exhumed body. The cases pertain to four women killed in the Denver metro area between 1978 and 1981. They were 33-year-old Madeleine Furey-Livaudais, 53-year-old Dolores Barajas, 27-year-old Gwendolyn Harris, and 17-year-old Antoinette Parks. The four women were stabbed to death. Denver Police Commander Matt Clark said in a press conference Friday that there was an "underlying sexual component" to the murders but didn't elaborate further. In 2009, a detective reviewed Parks' case and picked several p...

Gemini vs. ChatGPT: Which one planned my wedding better?

I was all about the wedding bells after getting engaged in June, but after seeing some of these wedding venue quotes, it’s more like alarm bells. "Ding-dong" has been remixed to "cha-ching" – and I need help. I don’t even know how to begin wedding planning. What are the first steps? What do I need to prioritize first? Which tasks are pressing – and which can wait a year or two? I decided to enlist the help of an AI assistant. Taking it one step further, I thought it’d be interesting to see which chatbot – Gemini Advanced or ChatGPT Plus (i.e., ChatGPT 4o) – is the better wedding planner. Gemini vs ChatGPT: Create a to-do list I’m planning on have my wedding in the summer of 2026 – sometime between August and September. Besides that, I don’t have anything else nailed down, so I asked both Gemini and ChatGPT to give me a to-do list based on the following prompt: “My wedding is between August 2026 and September 2026. Give me a to-do list of things to do for the...