close
close

Meta joins the AI ​​video war with the powerful Movie Gen model


Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more


Meta founder and CEO Mark Zuckerberg, who built the company on the back of his successful social network Facebook, wrapped things up this week by posting a video on his personal Instagram (a social network acquired by Facebook) of himself at the gym Leg press exercise performed on a machine in 2012).

However, in the video, the leg press transforms into a neon cyberpunk version, an ancient Roman version, and also a gold flaming version.

As it turns out, Zuck was doing more than just sports: He used the video to announce Movie Gen, Meta's new family of generative multimodal AI models that can create both video and audio from text prompts and allow users to customize their own videos. adding special effects, props, costumes and changing selected elements simply through text guidance, as Zuck did in his video.

The models appear to be extremely powerful, allowing users to modify only selected elements of a video clip rather than “reshooting” or regenerating the whole thing, similar to Pika's spot editing on older models, but with longer clip generation and integrated sound.

Meta's testing, outlined in a model family technical document released today, shows that it outperforms leading competitors in the space, including Runway Gen 3, Luma Dream Machine, OpenAI Sora and Kling 1.5, on many audience ratings of various attributes such as consistency and “Naturalness” of movement.

Meta has positioned Movie Gen as a tool for both everyday users looking to improve their digital storytelling and professional video artists and editors, even Hollywood filmmakers.

Movie Gen represents Meta's latest advancement in generative AI technology, combining video and audio capabilities into a single system.

Specifically, Movie Gen consists of four models:

1. Video of the film generation – a 30B parameter text-to-video generation model

2. Movie Gen Audio – a 13B parameter video-to-audio generation model

3. Personalized Movie Gen Video – a version of Movie Gen Video post-trained to generate personalized videos based on a person's face

4. Movie Gene Edit – a model with a novel post-training method for precise video editing

These models enable the creation of realistic, personalized HD videos of up to 16 seconds at 16 FPS as well as 48 kHz audio and provide video editing capabilities.

Designed to handle tasks ranging from personalized video creation to sophisticated video editing and high-quality audio generation, Movie Gen leverages powerful AI models to expand users' creative options.

Key features of Movie Gen suite include:

Video generation: Movie Gen allows users to produce high resolution (HD) videos by simply entering text prompts. These videos can be rendered at 1080p resolution, are up to 16 seconds long and are powered by a 30 billion parameter transformer model. The AI's ability to manage detailed prompts allows it to handle various aspects of video creation, including camera movement, object interactions, and environmental physics.

Personalized Videos: Movie Gen offers an exciting personalized video feature that allows users to upload an image of themselves or others to be featured in AI-generated videos. The model can adapt to different prompts while preserving the individual's identity, making it useful for creating customized content.

Precise video editing: The Movie Gen suite also includes advanced video editing features that allow users to change specific elements within a video. This model can change localized aspects such as objects or colors, as well as global changes such as background changes, all based on simple text instructions.

Audio generation: In addition to video capabilities, Movie Gen also has an audio generation model with 13 billion parameters. This feature enables the creation of sound effects, ambient music and synchronized audio that works seamlessly with visual content. Users can create Foley sounds (sound effects that amplify and simultaneously amplify real-world sounds such as the rippling of fabrics and the echo of footsteps), instrumental music, and other audio elements up to 45 seconds in length. Meta posted a sample video of Foley sounds below (turn up the volume to hear it):

Trained on billions of online videos

Movie Gen is the latest advancement in Meta's ongoing AI research efforts. To train the models, Meta says it relied on “Internet-scale image, video and audio data,” specifically 100 million videos and 1 billion images, from which it “learned about the visual world by 'watching' videos.” learns”. technical paper.

However, Meta did not specify whether the data was in paper form, licensed into the public domain, or whether it was simply deleted, as many other AI model makers have done – leading to criticism from artists and creators such as YouTuber Marques Brownlee (MKBHD) – and, in the case of AI video modeling provider Runway, a class action lawsuit alleging copyright infringement by creators (which is still before the courts). Therefore, it can be assumed that Meta will immediately face criticism for its data sources.

Leaving aside the legal and ethical questions surrounding training, Meta clearly positions the Movie Gen creation process as novel, using a combination of typical diffusion model training (commonly used in video and audio AI) with Large Language Model (LLM) training and a new one A technique called “flow matching,” which is based on modeling changes in the distribution of a data set over time.

At each step, the model learns to predict the speed at which samples should “move” toward the target distribution. Flow matching differs from standard diffusion-based models in key ways:

Zero terminal signal-to-noise ratio (SNR): Unlike traditional diffusion models that require special noise plans to maintain zero-end SNR, flow matching inherently ensures zero-end SNR without additional adjustments. This provides robustness to the choice of noise plans and contributes to more consistent and higher quality video outputs.

Efficiency in Training and Inference: Flow matching proves to be more efficient compared to diffusion models in terms of both training and inference. It offers flexibility in the type of noise plans used and demonstrates improved performance across a range of model sizes. This approach has also shown better agreement with human assessment results.

The Movie Gen system's training process focuses on maximizing flexibility and quality in both video and audio production. It is based on two main models, each with extensive training and fine-tuning procedures:

Video model of the film generation: This model has 30 billion parameters and starts with basic text-to-image generation. It then converts text to video, creating videos up to 16 seconds long in HD quality. The training process includes a large dataset of videos and images, allowing the model to understand complex visual concepts such as motion, interactions and camera dynamics. To improve the model's capabilities, they refined it on a curated set of high-quality videos with text captions, which improved the realism and precision of its results. The team further expanded the model's flexibility by training it to handle personalized content and editing commands.

Audio model of the film generation: With 13 billion parameters, this model generates high-quality audio that is synchronized with visual elements in the video. The training set included over a million hours of audio, allowing the model to recognize both physical and psychological connections between sound and image. They improved this model through supervised fine-tuning using selected high-quality audio and text pairs. This process helped create realistic ambient sounds, synchronized sound effects, and mood-matched background music for various video scenes.

It follows previous projects such as Make-A-Scene and the Llama Image models, which focused on producing high-quality images and animations.

This release marks the third major milestone in Meta's journey to generative AI and underscores the company's commitment to pushing the boundaries of media creation tools.

Launching on Insta in 2025

Set to debut on Instagram in 2025, Movie Gen is poised to make advanced video creation more accessible to the platform's broad user base.

While the models are currently in the research phase, Meta is optimistic that Movie Gen will enable users to produce compelling content with ease.

As the product evolves, Meta intends to work with developers and filmmakers to refine Movie Gen's features and ensure it meets users' needs.

Meta's long-term vision for Movie Gen reflects a broader goal of democratizing access to sophisticated video editing tools. While the suite offers significant potential, Meta recognizes that generative AI tools like Movie Gen are intended to enhance, not replace, the work of professional artists and animators.

As Meta prepares to launch Movie Gen, the company remains focused on refining the technology and eliminating existing limitations. Further optimizations are planned aimed at improving inference time and expanding the capabilities of the model. Meta also hinted at possible future applications, such as creating custom animated greetings or short films based entirely on user input.

The release of Movie Gen could usher in a new era for content creation on Meta's platforms, with Instagram users among the first to experience this innovative tool. As technology advances, Movie Gen could become an important part of the meta-ecosystem and the ecosystem of makers – professionals and indie producers.