Voice Cloning: What It Is and Why It’s Scary

In recent months, several voice-over actors reported being shocked to find artificial intelligence copying their speech. One individual was Remie Michelle Clark, the voice of Microsoft Bing in Ireland, who discovered a text-to-speech website offering her voice as that of a generic Irish woman.

What Is Voice Cloning?

Voice cloning builds a digital copy of a person’s unique voice, including speech patterns, accents, voice inflection and even breathing, by training an algorithm with a sample of a person’s speech that can be as short as a three-second audio clip.

Customers could pay to use Clarke’s “voice” to say anything — advertisements, trainings, YouTube videos or even voicemail messages.

Frightening, yes, but this is possible due to a new form of artificial intelligence capable of very accurately mimicking human speech so that it’s virtually impossible to distinguish between what is real and what is AI generated. This activity is called voice cloning because it builds a digital copy of a person’s unique voice, including speech patterns, accents, voice inflection and even breathing, by training an algorithm with a sample of a person’s speech.

Once a voice model is created or captured (some new variations can clone human voices from as little as a three-second audio clip), plain text is all that’s needed to synthesize a person’s speech, capturing and mimicking the exact sound of an individual. Cloned voices can also reflect a whole range of emotions, ranging from anger and fear to even love and boredom. This is far from the synthetic speech of the past that was robotic, stiff, unnatural and clearly sounded like a machine.

Read More about Artificial IntelligenceHow Should Entrepreneurs Use ChatGPT?

Voice Cloning Is a Formidable Force

However, AI-based voice cloning is a potent new technology that promises to transform lives, in many cases for the better. There’s a huge upside to its use in entertainment, where voice-over artists will be able to do much more. For example, if they are overbooked, an artist can simply send a sample of their voice to one of the jobs so their voice can be cloned, and they’ll still be paid.

Voice cloning can also be used to translate an actor’s words into different languages, meaning film production companies will no longer need to hire foreign-language actors to make versions of their movies suitable for other countries.

Perhaps the biggest potential for good belongs to the medical realm, helping individuals with speech disabilities. Imagine being able to create artificial voices for people who are unable to talk without assistance. Or imagine a patient with throat cancer who may need to have their larynx removed, but can record their voice prior to surgery in order to create a cloned voice that sounds more like their old self.

Read More about Artificial IntelligenceHow to Prepare for a Job Interview Run by AI

How Voice Cloning Enables Scams

Perhaps unsurprisingly, there’s also a tremendous opportunity for this technology to be abused by cybercriminals. In March 2023, when Silicon Valley Bank collapsed, a fake audio recording of U.S. President Joe Biden emerged, directing his administration to “use the full force of the media to calm the public.” Fact-checkers were ultimately able to expose the fraudulent nature of the clip, but by that point the audio had already been heard by millions and was well on its way to stirring panic.

AI voice generators can be used to impersonate not just celebrities and people in authority, but regular people as well. So-called vishing (voice phishing) attacks occur when cybercriminals impersonate regular people. Elderly people are often targeted in these types of attacks, and in some cases rush to the bank to withdraw money for a loved one who supposedly just called in desperation, only to find out it was just an AI-generated scam that replicated the loved one’s voice without their consent.

Today, many different types of voice cloning companies are launching, and as this technology becomes more mainstream and available, certain abuses and misuses are surely to emerge.

As these technologies expand there are certain safeguards that should be implemented around voice cloning:

Opt In/Opt Out Procedures

At the entrance to a security line at an airport, people are required to present both their license and their boarding pass. In many airports, facial recognition is applied at this checkpoint to ensure a match between the person presenting and the person in the license photo whose name is also on the boarding pass.

At these checkpoints, very clear signage indicates that biometric data is being collected, what it’s being used for, where and how it will be saved and alternative procedures if someone does not wish to consent. The same opt in/opt out consent procedures that have become commonplace for facial recognition must also be available anytime there’s an effort or intention to record a person’s voice. This is the only way to enable people to maintain control over their unique, natural biological identifiers.

Multi-factor Authentication

Multi-factor authentication occurs when a code is sent to a person’s device (most commonly a cell phone) pursuant to a person entering a primary password or some other identifier, like a biometric. Multi-factor authentication is not a perfect technique because it can introduce friction to the user authentication process and messages to cell phones can still be intercepted. However, for organizations using voice recognition as a form of biometric authentication, multi-factor authentication can provide an additional layer of validation.

Liveness Detection

In a similar vein to multi-factor authentication, organizations using voice recognition as a form of authentication can implement liveness detection, a process that is already widely used in facial recognition.

Liveness detection thwarts attempts to dupe a system by deciding whether it’s really a live person or a spoof, for instance a photo or replay attack as opposed to a live voice. Detecting playback spoofing attacks in speaker verification systems can be a big challenge, but liveness detection can identify these with a high level of accuracy through multiple methods including intrasession voice variation.

In this situation, a user speaks a phrase to get verified and the system captures an audio sample from that phrase. The speaker is then prompted to repeat a random part of the phrase and the system compares the received samples in order to provide a liveness detection score.

In summary, voice cloning is an exciting new frontier that has the potential to deliver many benefits to our society, especially in the medical field. However, as with all disruptive technologies, we must exercise caution, as the potential for ethical concerns, legal liabilities and scamming can be significant.

Organizations that have invested in voice recognition as a form of biometric authentication would be well-advised to take extra measures to guard against these threats. In addition, individuals should be vigilant and assume responsibility when they record and post a video to social media, since they are potentially making their sensitive biometric data available for compromise.