AI lets you edit people talking in videos by adding or deleting words from a transcript


Actors who flub dialogue may become less of a bane to producers thanks to an AI that allows such to be edited just by retyping a transcript of their lines.

The software works by combining existing clips with digital face models to create new footage that is lip-synced to match the desired edits.

The technology could have the potential to be abused to create more creepy deepfake videos that make people appear to say things that they never did.

However, the researchers say that these risks are more than outweighed by the benefits of the program, which could also cheaply localise or translate content.

At present, the technique only works on videos with a forward-facing interview that are of a particular length – 40 minutes or over.

The video-editing algorithm was designed by an international team of researchers led from Stanford University in California.

Developed to work on talking-head videos — those that show the speaker face-on, from the shoulders up — the software could allow producers to correct misspoken lines without the need for expensive re-shoots or abandoning existing footage. 

The strength of the system is its ease of use, for it allows producers to edit a video simply by changing the words in a transcript of the actor’s speech.

Like in a conventional word processor, words can be added, removed or rearranged as desired. 

To show the potential of the system, researchers made complex edits to sample footage, including adding, changing and removing words, as well as transitioning into different languages and creating whole new sentences from scratch. 

The edited videos were rated as appearing authentic almost 60 per cent of the time in a crowd-sourced survey of 138 participants. 

‘Visually, it’s seamless,’ said lead author and computer scientist Ohad Fried, of Stanford University.

‘There’s no need to rerecord anything,’ he added.

‘The implications for movie post-production are big,’ said paper author Ayush Tewari of the Max Planck Institute for Informatics in Saarbrücken, Germany.

In this era of fake videos, technology that can help generate deepfakes inherent raise important ethical concerns, Dr Fried acknowledged.

The technology could potentially be re-purposed to place words into the mouths of people who did not consent to be edited and may not want to appear to be speaking the revised dialogue that has been forced upon their footage.

‘Unfortunately, technologies like this will always attract bad actors,’ he said. 

For the developers, however, dealing with the type of bad actors that flub their lines is more valuable to the film industry than those who might abuse the tech.

Re-shooting footage to correct mistakes is an expensive and time-consuming process, as is trying to customise existing video by audience.

The algorithm might be used, for example, to shift existing videos into different languages or to better suit different cultural background.

‘This technology is really about better storytelling,’ Dr Fried said.

‘The struggle is worth it given the many creative video editing and content creation applications this enables,’ he added.

Dr Fried suggested that, to mitigate concerns, the tech could be given the capacity to add a form of watermarking that logs any edits made.

In addition, understanding how videos can be manipulated like this helps researchers to develop better forensic techniques to spot fakes in the future.

Nevertheless, he cautioned, viewers should remain sceptical and cautious, and question the veracity of the video content that they consume.

The first step in the video editing process is to create a transcript of all the actor’s spoken lines, which is then synchronised with the audio in the video.

As users then make revisions to the video transcript, the algorithm identifies segments from elsewhere in the film that contain the right facial movements to stitch together new footage to match the revised text. 

When new audio is needed to complete the edits, the user can introduce new recordings or even use an artificially-generated voice. 

At this stage, the patchwork-like video would have jarring jump cuts and other notable visual flaws.

To get rid of these, the algorithm endeavours to smooth the result, combining the assembled footage with a three-dimensional digital model of the actor’s face.

This rendering, however, is not yet realistic-looking.

To polish the output, the algorithm uses a machine learning technique called Neural Rendering to convert the rough model into a photo-realistic video with a perfect lip sync to the edited audio.

 The algorithm does have it limitations, however, as it requires at least 40 minutes of original footage as an input to work from.

This means that, as it stands, the program will not work with just any video sequence.

With their present study complete, Dr Fried and colleagues are not working to refine their algorithm to further enhance the visual quality. 

The full findings of the research have been submitted for publication in the journal ACM Transactions on Graphics. 

A pre-print version of the article can be read on the arXiv repository.

Leave a Reply

Your email address will not be published. Required fields are marked *