Linchakin

Meta's Groundbreaking AI Film Maker: Make-A-Scene

 October 01, 2022     No comments   

Meta AI’s new model make-a-video is out and in a single sentence: it generates videos from text. It's not only able to generate videos, but it's also the new state-of-the-art method, producing higher quality and more coherent videos than ever before. This is all information you must’ve seen already on a news website or just by reading the title of the article, but what you don’t know yet is what it is exactly and how it works.

Louis Bouchard HackerNoon profile picture

Louis Bouchard

I explain Artificial Intelligence terms and news to non-experts.

Credibility

Meta AI’s new model make-a-video is out and in a single sentence: it generates videos from text. It’s not only able to generate videos, but it’s also the new state-of-the-art method, producing higher quality and more coherent videos than ever before!

You can see this model as a stable diffusion model for videos. Surely the next step after being able to generate images. This is all information you must’ve seen already on a news website or just by reading the title of the article, but what you don’t know yet is what it is exactly and how it works.

Here's how...

References

►Read the full article: https://www.louisbouchard.ai/make-a-video/
► Meta's blog post: https://ai.facebook.com/blog/generative-ai-text-to-video/
►Singer et al. (Meta AI), 2022, "MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA", https://makeavideo.studio/Make-A-Video.pdf
►Make-a-video (official page): https://makeavideo.studio/?fbclid=IwAR0tuL9Uc6kjZaMoJHCngAMUNp9bZbyhLmdOUveJ9leyyfL9awRy4seQGW4
► Pytorch implementation: https://github.com/lucidrains/make-a-video-pytorch
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:00

methias new model make a video is out

0:03

and in a single sentence it generates

0:05

videos from text it's not unable to

0:07

generate videos but it's also the new

0:09

state-of-the-art method producing higher

0:11

quality and more coherent videos than

0:14

ever you can see this model as a stable

0:16

diffusion model for videos surely the

0:19

next step after being able to generate

0:21

images this is how information you must

0:23

have seen already on a News website or

0:26

just by reading the title of the video

0:28

but what you don't know yet is what is

0:30

it exactly and how it works make a video

0:33

is the most recent publication by met

0:35

III and it allows you to generate a

0:37

short video out of textual inputs just

0:40

like this so you are adding complexity

0:42

to the image generation test by not only

0:45

having to generate multiple frames of

0:47

the same subject and scene but it also

0:49

has to be coherent in time you cannot

0:51

simply generate 60 images using dally

0:53

and generate a video it will just look

0:56

bad and nothing realistic you need a

0:58

model that understands the world in a

1:00

better way and leverages this level of

1:02

understanding to generate a coherent

1:04

series of images that blend well

1:06

together you basically want to simulate

1:08

a world and then simulate recordings of

1:11

it but how can you do that typically you

1:14

will need tons of text video pairs to

1:16

train your model to generate such videos

1:18

from textual input but not in this case

1:21

since this kind of data is really

1:23

difficult to get and the training costs

1:25

are super expensive they approach this

1:27

problem differently another way is to

1:30

take the best text to image model and

1:32

adapt it to videos and that's what met I

1:35

did in a research paper they just

1:38

released in their case the text to image

1:40

model is an another model by meta called

1:43

magazine which I covered in a previous

1:45

video if you'd like to learn more about

1:47

it but how do you adapt such a model to

1:50

take time into consideration you add a

1:53

spatial temporal pipeline for your model

1:55

to be able to process videos this means

1:58

that the model will not only generate an

2:00

image but in this case 16 of them in low

2:03

resolution to create a short coherent

2:06

video in a similar manner as a text to

2:08

image model but adding a one-dimensional

2:11

convolution along with the regular

2:13

two-dimensional one the simple addition

2:15

allows them to keep the pre-trained

2:17

two-dimensional convolutions the same

2:19

and add a temporal Dimension that they

2:22

will train from scratch reusing most of

2:25

the code and models parameters from the

2:27

image model they started from we also

2:30

want to guide Our Generations with text

2:32

input which will be very similar to

2:34

image models using clip embeddings a

2:37

process I go in detail in my stable

2:39

diffusion video if you are not familiar

2:41

with their problem but they will also be

2:43

adding the temporal Dimension when

2:45

blending the text features with the

2:47

image features doing the same thing

2:49

keeping the attention module I described

2:52

in my make a scene video and adding a

2:55

one-dimensional attention module or

2:57

temporal considerations copy pasting the

3:00

image generator model and duplicating

3:02

the generation modules for one more

3:04

Dimension to have all our 16 initial

3:07

frames but what can you do with 16

3:10

frames well nothing really interesting

3:13

we need to make a high definition video

3:16

out of those frames the model will do

3:19

that by having access to previews and

3:21

future frames and iteratively

3:23

interpolating from them both in terms of

3:27

temporal and spatial Dimensions at the

3:30

same time so basically generating new

3:33

and larger frames in between those

3:35

initial 16 frames based on the frames

3:38

before and after them which will

3:40

fascinate making the movement coherent

3:43

and overall video ruined this is done

3:45

using a frame interpolation Network

3:47

which I also described in other videos

3:50

but will basically take the images we

3:52

have and fill in gaps generating in

3:54

between information it will do the same

3:57

thing for a spatial component enlarging

3:59

the image and filling the pixel gaps to

4:02

make it more high definition

4:04

so to summarize the fine tune a text to

4:07

image model for video generation this

4:09

means they take a powerful model already

4:12

trained and adapt and train it a little

4:14

bit more to get used to videos this

4:16

retraining will be done with unlabeled

4:19

videos just to teach the model to

4:21

understand videos and video frame

4:23

consistency which makes the data set

4:25

building process much simpler then we

4:27

use once again an image optimized model

4:30

to improve spatial resolution in our

4:32

last frame interpolation component to

4:35

add more frames to make the video fluid

4:38

of course the results aren't perfect yet

4:40

just like text to image models but we

4:43

know how fast the progress goes this was

4:45

just an overview of how met I

4:47

successfully tackled the text to video

4:49

task in this great paper all the links

4:52

are in the description below if you'd

4:53

like to learn more about their approach

4:55

at pytorch implementation is also

4:57

already being developed by the community

4:59

as well so stay tuned for that if you'd

5:02

like to implement it yourself thank you

5:04

for watching the whole video and I will

5:06

see you next time with another amazing

5:08

paper

Related Stories

L O A D I N G
. . . comments & more!

Adblock test (Why?)


You may be interested in:
>> Is a Chromebook worth replacing a Windows laptop?
>> Find out in detail the outstanding features of Google Pixel 4a
>> Top 7 best earbuds you should not miss

Related Posts:
>> Recognizing 12 Basic Body Shapes To Choose Better Clothes
>>Ranking the 10 most used smart technology devices
>> Top 5+ Best E-readers: Compact & Convenient Pen
  • Share This:  
  •  Facebook
  •  Twitter
  •  Google+
  •  Stumble
  •  Digg
Email ThisBlogThis!Share to XShare to Facebook

Related Posts:

  • PDFtoMusic Pro 1.7.3 Full Crack With Keygen 2022PDFtoMusic Pro 1.7.3 Crack Final Registration Code Free Download PDFtoMusic Pro 1.7.3 With Crack: Do you want to edit the old scores that you create… Read More
  • Understanding Metaverse: A Basic Explanation Metaverse is a virtual world, featuring avatars, digital objects, functioning economies, where technology is not just a tool, but something that is a… Read More
  • Raspberry Pi RC Car Lets You Explore From Afar With Online ControlByAsh Hill 22 November 2021This 3D-printed Raspberry Pi RC car created by Lukas Pfit...We’re huge fans of Raspberry Pi RC cars but this one, known as Newone designed by maker Lukas Pfitscher, is one of the coolest we’ve seen. Not only ca… Read More
  • Is Buying a Prebuilt PC for its Graphics Card a Good Idea?ByJarred Walton 22 November 2021Some people are buying prebuilt PCs only to "shuck" them for...If you're looking to upgrade your PC with a new graphics card, your best plan of attack may be GPU shucking. It might go by other names, but the idea … Read More
  • Phishing Campaign Targets TikTok InfluencersPhishing emails are targeting large TikTok accounts with phony copyright warnings or offers for account verification, according to researchers at Abno… Read More
Newer Post Older Post Home

0 Comments:

Post a Comment


Copyright © 2025 Linchakin | Powered by Blogger
Design by Hardeep Asrani | Blogger Theme by NewBloggerThemes.com | Distributed By Gooyaabi Templates