Artificial Intelligence in Video – Turn Imagination to Reality
It took me less than 3 minutes to generate the below short video by using a simple text prompt in Runway’s Gen2 model for free. Yes, the video is really short and there is potential for the quality to improve, but more on that later.
If you revisit recent history, democratization of video tools for content creators has opened up opportunities and built multiple business empires which were originally thought unimaginable. Twitch, for example, enabled gaming creators to livestream their live gameplay video and turned it into a massive global business – a business which got acquired later by Amazon for $1B (in 2014). Tiktok, a household name now, allowed creators to use simple yet sophisticated short form video tools which led to an explosion of user generated content (UGC) in video, partly also enabled by other macro tailwinds such as cheaper smartphones and data available.
Are we on the cusp of another such revolution in video? Yes and it’s driven by none other than Artificial Intelligence. Let’s understand how AI could help in video creation and editing:
• Modality transfers (conversion of text / audio to video): This itself is of two kinds:
o Creating a new generation like the one you saw earlier using short prompts (Generative AI)
o Converting existing content (text / audio) to a video representation
• Adding visual effects (CGI, motion capture / VFX) to videos
• Content editing (converting long form video to multiple short form videos, adding captions from audio, etc.)
What is here and now? (ones in blue above)
Converting existing content (text / audio) to video representation
We are sitting on a repository of 600 million blogs in the world today. Each day, atleast 5000 English news articles get published by publications online. However, all of such content happens to be in text. With the emergence of short-form video platforms like Tiktok, Reels, Shorts, etc. creating an enduring behavioural shift of users towards video content, bloggers and media publications are looking for cheaper ways to tap into this distribution (some spend thousands of dollars per minute of video). AI has the potential of lowering down these costs by 70%+.
Enterprises also need to convert text scripts to video to create more engaged content for their internal training and development use cases as well as hyper-personalize their marketing efforts with new and existing customers.
This is exactly where AI video companies like Synthesia, Heygen, InVideo, Rephrase, Rizzle, etc. operate and have raised $400M+ of capital.
Addingvisual effects to videos
VFX is a super expensive affair (~$10B market globally)!‘Avengers: Endgame’, with a budget of ~$400M is the most expensive special effects Hollywood has made.
A significant portion of the VFX budget is spent on long periods of manual work performed by artists for motion capture (MoCap), CGI and other labour intensive time consuming production tasks. For example, on average, it takes 1 day to shoot one hour of motion capture and 5days to clean it up! Several early AI companies in the space like WonderStudio, Kapnetix, NVIDIA’sOmniverse, Magic 3D, etc.are already automating and bringing down 5 days of post-shoot work down to 1hour. Such tooling innovations will not only bring down costs, they also havethe potential to localize content in novel ways – Imagine a movie releasedwith Tom Cruise deepfake in US and with SRK deepfake in India.
Content editing
While long form creation goes through several frictions (horizontal vs vertical format), once generated the struggle is far from over. Hear it first-hand from Nasdaily below (11 million subs on YT):
It is this friction that several companies like GlossAI, Metaphysic, Vidyo, etc. are solving for (having raised $30M+over last 3 years). These tools let you convert your long-form content to short-form with AI predicting which parts of the video are likely to be more ‘clickbaity’ and publish it on multiple platforms from thereon. Such tools not only increase the productivity of the creator, they also help broaden the audience (through features like instant dubbing in local languages, instant transcribing of video, etc.). Some key use cases being captured by content editing companies in AI video are below:
What is getting there?
Generative AI for video (text prompt to video)
Early text-to-video models, trained on short video clips, did not yield high-quality results primarily because of challenges with the size of video datasets available along with an alt text (text that describes the video). Consequently, the prompts do not generate relevant output.
What is changing now?
As text-to-image models went from GAN based architectures 2-3 years ago to Diffusion based models like Stable Diffusion, DALL-E and Midjourney, we have seen tremendous efficiency in cost of generation (because of the models working on much lesser datasets than before) without compromising on quality (ex – Midjourney is self-funded till date and would cost ~$600k to develop from scratch – link to tweet).
Leveraging the advancements made in text-to-image models, newer text-to-video models rely on diffusion techniques to generate a series of images (example below) that are temporally and spatially consistent and then stitch them together to make a video. These models (like Meta’s Make a Video, Runway’s Gen 2, NVIDIA’s Video LDM) are much more likely to scale faster and be available to public at large (check out this short film made by stitching together outputs generated from Runway’s Gen 2). Such models have already raised $200M+ of capital over the last 3 years.
Would viewers watch AI generated content?
My view here is that it would be similar to how autotune has evolved in music – most artists use it today in their music. Music concerts themselves have gone through a major shift with almost all artists doing just lip-sync without anyone protesting today.
Who will be the key beneficiaries of Generative AI for video?
With GenAI models for video getting more advanced (as output becomes as realistic as Midjourney for images), it would become possible to create long form content at a fraction of the effort involved eliminating days and months of selecting the right actors, locations and actual shooting of scenes. Movie production houses (a ~$70+ billion market globally) will be a key beneficiary of this. It won’t just reduce costs for larger studios but will enable Indie studios to ship out more content and thus having higher chances of hitting a break out success.
This will also help enterprises with their advertising / marketing efforts (given video ads lead to higher engagement and eventual conversion) with not just lower cost of producing ads but also with personalized ads for users depending on what stage they’re at in their user journey – check out this ad by Private Island created solely using AI and no real actors / sets.
Democratization of long form video creation, VFX and editing could allow a lot more creators to emerge creating more and more video content, some of which will break out and further incentivize newer creators – kicking in the flywheel that keeps speed of generation going. This could help YouTube improve its creator to user ratio (120M Youtube channels vs 2B+ monthly active users and bring it closer to Tiktok (50% of users have created content on the platform).
PGC-powered OTTs (which are struggling worldwide currently on the back of very high content production and licensing costs – Netflix spends $17 billion yearly on content) could turn their business models more profitable using AI led innovations.
We are super excited to witness this dynamic environment ahead of us and would love to partner with visionary entrepreneurs building in AI. If you are building in AI first solutions in the space of content, feel free to reach us at rahul.chugh@matrixpartners.in. We also crowdsource and maintain a repo of GenAI startups from India.