close
close

Meta says its Movie Gen represents a “real” advance in AI video generation

Metaplatforms

How fake or real is the growing stream of artificial intelligence (AI)-produced videos?

It turns out that there is a quantitative measure for this – or at least there is almost one. Humans still have to decide whether a video is good or not based on human perception.

Also: New Meta-Ray-Ban AI features are being introduced, making the smart glasses even more tempting

Mark Zuckerberg, owner of Meta Platforms, announced on Friday a new AI model called Movie Gen that can generate HD videos (1080p resolution) from a text prompt. The company says these videos are, on average, “more realistic” than videos created using competing technology (such as OpenAI’s Sora text-to-video model).

It can also generate synchronized audio, adjust the video to show a person's face, and then automatically edit the video with just a text prompt, such as: B. “Dress the penguins in Victorian outfits” to disguise the penguins on the screen.

Also: OpenAI introduces text-to-video model and the results are amazing. See for yourself

In the accompanying paper, “Movie Gen: A Cast of Media Foundation Models,” meta-AI researchers describe how they had people rate the realism of the AI-generated videos:

Authenticity: This measures which of the compared videos is most similar to a real video. For fantastical prompts that are not included in the training set distribution (e.g. depicting fantasy creatures or surreal scenes), we define reality as an imitation of an excerpt from a movie that follows a realistic art style. We also ask reviewers to provide a reason for their choice, e.g. E.g. “the appearance of the subject is more realistic” or “the movement is more realistic”.

There is also an accompanying blog post.

The human tests determine a win/loss value for Movie Gen compared to Sora and three other well-known text-to-video AI models, Runway Gen3, Lumalabs and Kling1.5.

Also: The best AI image generators of 2024

The authors note that it is not yet possible to automatically create good comparisons. Furthermore, “judging reality and aesthetics depends heavily on human perception and preference,” they write.

meta-2024-movie-gen-splash

Metaplatforms

Not just in terms of realism, but also in terms of how good the movement is in a video, whether parts of an action are skipped or distorted, and how closely the video corresponds to the text prompt entered are things that you simply can't automate . they show off.

“We find that existing automated metrics struggle to provide reliable results, reinforcing the need for human assessment.”

The benchmark measures how “people prefer the results of our model over competing industry models,” the paper says, resulting in a “net win rate” percentage.

Also: These Meta-Ray-Ban smart glasses are my favorite Prime Day deal so far

The average win rate against Sora is 11.62% of the time. The win rate against the others is much higher.

“These significant net gains demonstrate Movie Gen Video’s ability to simulate the real world with generated videos that respect physics, with motion that is both appropriate in scale and consistent and without distortion.”

They offer some example screenshots of videos directly unlike Sora. According to the authors, “OpenAI Sora may tend to generate less realistic videos (e.g. the cartoon-like kangaroo in the second row), which may lack the motion details described in the text prompt (e.g. the non-walking robot below) . Row).”

meta-2024-movie-gen-versus-sora

Metaplatforms

The authors built the AI ​​model for Movie Gen from what they called a “cast of base models.”

Also: Surprisingly, Meta suddenly destroys Apple in the innovation battle

In the training phase, images and videos from a mix of public and licensed datasets are compressed until the model learns to efficiently reproduce pixels of the data, the authors report. As they call it: “We encode the RGB pixel space videos and images into a learned spatiotemporal compressed latent space using a Temporal Autoencoder (TAE) and learn to generate videos in this latent space.”

meta-2024-training-movie-gen

Meta used several steps to not only generate videos, but also sync audio, personalize, and edit videos.

Metaplatforms

This video generation is then “conditioned” by text inputs so that the model is able to produce videos in accordance with the text prompts.

Together, the parts create a model with 30 billion parameters – not huge by today's training standards.

Also: Meta's new $299 Quest 3S is the VR headset most people should buy this holiday season

A second neural network called “Movie Gen Audio” produces high-fidelity audio – but for sound effects and music, not for speech. This is based on an existing approach called “diffusion transformer” with 13 billion parameters.

This all requires a lot of processing power: “6,144 H100 GPUs, each running at 700W TDP and 80GB HBM3, using Meta's Grand Teton AI server platform.”

Generating videos isn't all Movie Gen does. In a next step, the authors also subject the model to additional training to create “personalized” videos, where a person's face can be forced to appear in the film.

Also: ChatGPT is by far the most searched AI tool, but number two is surprising

They also add one final component, which is the ability to edit the videos with just a text prompt. The problem the authors faced is that “video editing models are hampered by the lack of supervised video editing data,” so there are not enough examples to provide the AI ​​model to train.

To get around this, the team turned to the Movie Gen AI model and modified it in several steps. First, they use data from image editing to simulate what happens when editing video images. They incorporated this into the training of the model at the same time as the original text-to-video training, so that the AI ​​model develops the ability to coordinate single frame editing with multiple video frames.

In the next part, the authors feed the model with a video, a text caption, e.g. E.g. “A person is walking down the street” and an edited video and train the model to produce the instruction that would result in the original video changing to the edited video. In other words, you force the AI ​​model to associate instructions with modified videos.

Also: The 4 biggest challenges of AI-generated code that Gartner didn't include in its latest report

To test video editing capability, the authors create a new benchmark test based on 51,000 videos collected by meta-researchers. They also hired crowd workers to develop editing instructions.

To evaluate the editing of the videos, the Meta team asked human reviewers to rate which video was better: one created using their AI model or one using the current state of the art. They also used automated measures to compare the before and after videos in the task.

Also: These AI avatars now have human-like expressions

“Human raters prefer Movie Gen Edit to all baselines by a significant margin,” the authors write.

In all of these steps, the authors pioneer work in coordinating the size of AI models, data, and computational effort used. “We find that scaling the training data, computation and model parameters of a simple Transformer-based model trained with flow matching produces high-quality generative models for video or audio.”

However, the authors admit that human assessments have their pitfalls. “Defining objective criteria for evaluating model generations using human evaluations remains a challenge, and therefore human evaluations can be influenced by a number of other factors such as personal biases, backgrounds, etc.”

Also: Pearson introduces new AI certification – with a focus on practical use in the workplace

The paper makes no suggestions on how to deal with these human biases. However, Meta notes that they will release a benchmark test for others to use, without disclosing a time frame:

To thoroughly evaluate video generations, we propose and hope to publish a benchmark, Movie Gen Video Bench, consisting of 1000 prompts covering all the various testing aspects summarized above. Our benchmark is more than 3 larger than the prompt sets used in previous work.

The company also promised to eventually make its videos available for public viewing: “To enable a fair and easy comparison with Movie Gen Video for future work, we hope to publicly release our non-carefully generated videos for the Movie Gen Video Bench prompt set to be able to.”

Also: Can synthetic data solve AI’s privacy concerns? This is what this company relies on

According to Meta, the Movie Gen model has not yet been deployed. Finally, the authors write that all AI models “require several improvements before they can be deployed.” For example, the videos generated by the model “still suffer from issues such as artifacts in generated or edited videos around complex geometry, manipulation of objects, object physics, state transformations, etc.” The sound “is sometimes out of sync when the movements are dense,” such as for example in a video of tap dancing.

Despite these limitations, Movie Gen will one day provide the path to a complete video creation and editing suite, and even a customized video podcast with your own likeness.