Video editing, the process of manipulating and rearranging video clips to meet desired objectives, has been revolutionized by the integration of artificial intelligence (AI) in computer science. AI-powered video editing tools allow for faster and more efficient post-production processes. With the advancement of deep learning algorithms, AI can now automatically perform tasks such as color correction, object tracking, and even content creation. By analyzing patterns in the video data, AI can suggest edits and transitions that would enhance the overall look and feel of the final product. Additionally, AI-based tools can assist in organizing and categorizing large video libraries, making it easier for editors to find the footage they need. The use of AI in video editing has the potential to significantly reduce the time and effort required to produce high-quality video content while also enabling new creative possibilities.
The use of GANs in text-guided image synthesis and manipulation has seen significant advancements in recent years. Text-to-image generation models such as DALL-E and recent methods using pre-trained CLIP embedding have demonstrated success. Diffusion models, such as Stable Diffusion, have also shown success in text-guided image generation and editing, leading to various creative applications. However, for video editing, more than spatial fidelity is required, and that is temporal consistency.
The work presented in this article extends the semantic image editing capabilities of the state-of-the-art text-to-image model Stable Diffusion to consistent video editing.
The pipeline for the proposed architecture is depicted below.
Given an input video and a text prompt, the proposed shape-aware video editing method produces a consistent video with appearance and shape changes while preserving the motion in the input video. To obtain temporal consistency, the approach uses a pre-trained NLA (Non-Linear Atlas) to decompose the input video into the background (BG) and foreground (FG) unified atlases with associated per-frame UV mapping. After the video has been decomposed, a single keyframe in the video is manipulated using a text-to-image diffusion model (Stable Diffusion). The model exploited this edited keyframe to estimate the dense semantic correspondence between the input and edited keyframes, which allows for performing shape deformation. This step is very delicate, as it produces the shape deformation vector applied to the target image to maintain temporal consistency. This shape deformation serves as the basis for per-frame deformation since the UV mapping and atlas are used to associate the edits with each frame. Furthermore, a pre-trained diffusion model is exploited to ensure the output video is seamless and without unseen pixels.
According to the authors, the proposed approach results in a reliable video editing tool that provides the desired appearance and consistent shape editing. The figure below offers a comparison between the proposed framework and state-of-the-art approaches.
This was the summary of a novel AI tool for accurate and consistent shape-aware text-driven video editing.
If you are interested or want to learn more about this framework, you can find a link to the paper and the project page.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.