Artificial intelligence is poised to play increasingly significant roles in our daily lives. This is evident from the introduction of immensely popular OpenAI’s ChatGPT last year and this year’s unveiling of Sora generative AI models. In this article, the discussion revolves around the capabilities and limitations of OpenAI’s Sora, particularly in its role as a simulator for generating videos from text instructions. We’ve touched on its strengths, such as creating fantastical scenes, as well as its weaknesses, including inaccuracies in modelling physics and interactions.
What is OpenAi’s Sora?
OpenAI, backed by Microsoft, propelled into the limelight last year by the widespread use of ChatGPT, is now extending its artificial intelligence capabilities to video. Introducing Sora – a generative AI model/software for creating high-definition videos from descriptive text instructions. This implies that you provide a written description and the AI model generates a video that corresponds to the details provided in the description.
Generative AI is a class of artificial intelligence systems that allows a user to create chatbots, image generators, music generators with descriptive text instructions.
“Sora serves as a foundation for models that can understand and simulate the real world.” OpenAI wrote in its announcement.
Sora is being trained to understand and mimic how things move in the real world, that can help people solve problems involving real-world interaction. According to the sample videos provided by OpenAI on its website, one of Sora’s notable capabilities is its ability for crafting imaginative scenes that defy reality.
For example, here’s the descriptive instruction and sample video by Sora available on its website.
Prompt: A gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.
Can you try it now?
The cutting-edge text-to-video model is currently limited to generating videos up to 60 seconds long and thus far it is accessible to a select group of testers or ‘red teamers’, who assess the model for vulnerabilities and risks in critical areas such as misinformation, hateful content and bias. Additionally, the model is also accessible to a group of visual artists, designers, and filmmakers to gather insights on refining the model to better serve the needs of creative professionals.
Red teaming involves a team of experts, referred to as the red team, conducting simulations resembling real-world scenarios to uncover vulnerabilities and weaknesses within the system. It helps organizations to enhance their readiness, identify potential threats, and make informed decisions.
What is OpenAI?
OpenAI is a U.S. based artificial intelligence research organization founded in December 2015. Overall, OpenAI has played a pivotal role in advancing the field of artificial intelligence and promoting the responsible and beneficial use of AI technology. Notably, its breakthrough AI models, such as GPT (Generative Pre-trained Transformer) series and DALL-E, garnered widespread attention for their capabilities in natural language understanding and image generation, respectively.
Here’s what Sora can do?
The text-to-video AI model can:
- Create both realistic and imaginative videos while maintaining visual quality.
- Generate scenes with multiple characters that express vibrant emotions.
- Visualize complex scenes with specific types of motion and accurate details of the subject and background.
- Comprehend the semantics, context, and instructions conveyed in the prompt.
- Understand not just the user’s prompt, but also how those things exist in the physical world.
- Create multiple shots within a single generated video that accurately persist characters and visual style.
- Gradually blend two input videos, resulting in seamless transitions from one video to another.
Apart from its capability to create videos solely from text instructions, the model can be prompted with other inputs, such as pre-existing images or video. Sora is capable of bringing life to still images with prompts, by animating the image’s contents with precision. Moreover, Sora has the ability to extend existing videos or fill in missing parts by generating new frames.
Research technology behind Sora
Sora, a diffusion model, creates videos by initially presenting static noise-like images and progressively refining them by eliminating the noise through multiple iterations. It builds on prior research in DALL·E and GPT models.
Training text-to-video generation models effectively, requires a vast dataset comprising numerous videos accompanied by corresponding text captions. This training enables them to learn the relationship between textual descriptions and visual content. Sora uses recaptioning technique that was introduced in DALL-E 3. The technique involves generating highly detailed captions for visual training data.
Similar to DALL·E 3, Sora leverages GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables the model to generate videos that accurately follow user instructions. Like GPT models, Sora employs a transformer architecture, enabling enhanced scalability.
Competitions
With Sora, OpenAI aims to compete video-generation AI tools from tech giants like Meta, and Google and other startups, such as Stability AI and Runway. Notably, Google introduced Lumiere – its text-to-video diffusion model in January with capabilities, such as video stylization, cinemagraphs, and video inpainting.
Video stylization
Cinemagraphs
Video inpainting
Image Source: https://lumiere-video.github.io/
Limitations
The current version of Sora exhibits several limitations as a simulator, acknowledged by OpenAI. For example, its failure to precisely replicate the physics of various fundamental interactions, such as the shattering of glass.
Other interactions, such as eating food, may not consistently result in accurate changes in object state. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark. In complex scenes involving multiple characters, animals or people can spontaneously appear.
Conclusion
The introduction of Sora represents a big step forward for the industry. However, there are apprehensions regarding the ethical and societal ramifications of the technology. OpenAI is implementing several important safety steps ahead of making Sora available to the larger audience. Though it is quite normal to feel apprehensive about this innovation, one should not discount all the opportunities that shall arise with it.
For more technology updates, click here.