Training-Free Neural Matte Extraction for Visual Effects: Abstract and Introduction

6 Jul 2024


(1) Sharif Elcott, equal contribution of Google Japan (Email:;

(2) J.P. Lewis, equal contribution of Google Research USA (Email:;

(3) Nori Kanazawa, Google Research USA (Email:;

(4) Christoph Bregler, Google Research USA (Email:


Alpha matting is widely used in video conferencing as well as in movies, television, and social media sites. Deep learning approaches to the matte extraction problem are well suited to video conferencing due to the consistent subject matter (front-facing humans), however training-based approaches are somewhat pointless for entertainment videos where varied subjects (spaceships, monsters, etc.) may appear only a few times in a single movie – if a method of creating ground truth for training exists, just use that method to produce the desired mattes. We introduce a training-free high quality neural matte extraction approach that specifically targets the assumptions of visual effects production. Our approach is based on the deep image prior, which optimizes a deep neural network to fit a single image, thereby providing a deep encoding of the particular image. We make use of the representations in the penultimate layer to interpolate coarse and incomplete "trimap" constraints. Videos processed with this approach are temporally consistent. The algorithm is both very simple and surprisingly effective.


• Computing methodologies → Image processing; • Applied computing → Media arts.


Alpha matting, deep learning, visual effects.

ACM Reference Format:

Sharif Elcott, J.P. Lewis, Nori Kanazawa, and Christoph Bregler. 2022. TrainingFree Neural Matte Extraction for Visual Effects. In SIGGRAPH Asia 2022 Technical Communications (SA ’22 Technical Communications), December 6–9, 2022, Daegu, Republic of Korea. ACM, New York, NY, USA, 5 pages.


Alpha matte extraction refers to the underconstrained inverse problem of finding the unknown translucency or coverage 𝛼 of a foreground object [Wang and Cohen 2007]. Matte extraction is widely used to provide alternate backgrounds for video meetings, as well as to produce visual effects (VFX) for movies, television, and social media. However, it is not always recognized in the research literature that these two applications have significantly different requirements. This paper introduces a neural matte extraction method specifically addressed to the assumptions and requirements of VFX.

pecifically addressed to the assumptions and requirements of VFX. Matting for video calls requires real-time performance and assumes a single class of subject matter, front-facing humans. This is a "train once and use many times" situation for which it is possible and advantageous to obtain training data. Video call matting often assumes a fixed camera, and may require a "clean plate" image of the room without the participant. On the contrary, VFX production has the following assumptions and requirements:

• Diverse and often rare ("one-off") subject matter. For example the youtube video [Godzilla vs. Cat 2021] (30M views) involves mattes of a cat, ships, and wreckage. A physical prop such as a crashed alien spaceship might only be used in a few seconds in a single movie, never to be seen again. Thus, gathering a ground-truth training dataset for deep learning is often pointless: if there is a method to generate the ground-truth mattes for training, just use this method to produce the desired mattes – no need to train a model!

• Visual effects frequently involves moving cameras as well as extreme diversity of moving backgrounds. For example, an actor might be filmed as they run or travel in a vehicle any place on earth, or on set with extraterrestrial props in the near background. Thus, even for the major case of repeated subject matter (humans), gathering a representative training dataset is more challenging than in the case of video calls.

• Real-time performance is not required. Instead, the on-set filming time is to be minimized, due to the combined cost of the actors (sometimes with 7-figure salaries) and movie crew. It is often cheaper to employ an artist to work days to "fix it in post" rather than spend a few minutes on-set (with actors and crew waiting) to address an issue.

• Clean-plates are not a desirable approach for matting, and are often not possible. With moving cameras, a motion control rig is required to obtain a clean plate. The indoor use of motion control for clean plates is expensive due to the previous principle (minimizing on-set time). Moreover, this approach is generally not feasible outdoors, for reasons including background movement (e.g. plants moved by wind) and changing lighting (moving clouds blocking the sun).

Our solution, a matte extraction approach employing the deep image prior [Ulyanov et al. 2018], addresses these requirements. It has the following characteristics and contributions:

• It is a deep neural network matte extraction method targeted towards the requirements of VFX rather than video meetings. To our knowledge, this is the first deep neural matte extraction method that requires neither training data nor expensive on-set image capture to support the matting process (i.e. clean plates or greenscreens).

• It relies exclusively on the given image or video along with coarse "trimaps" (Fig. 1) that are easily created during post production using existing semi-automatic tools. Professional software such as Nuke or After Effects is typically used, however a familiar example is the "Select Subject" tool in Photoshop [Adobe 2018], followed by dilation/erosion and automated with "Actions".

• It does not require a clean plate, thus allowing extractions from outdoor footage.

• It does not require greenscreens, yet can produce detailed high quality mattes even when the foreground and background subject have similar colors (green hair in Fig. 1).

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.