Is There Something Special About the Human Voice?
Advancements in artificial intelligence (AI) have made it possible for speech synthesis tools to generate eerily realistic voices. These tools can mimic accents, whisper, and even clone the voices of real people. But with these developments, it raises the question: how can we tell the difference between a human voice and an AI-generated one?
Today, it’s easier than ever to have a conversation with AI. From chatbots that answer questions to AI systems that speak multiple languages and use different accents, technology is making it possible for machines to communicate like never before. In fact, some AI-powered tools can now clone the voices of real people. For instance, one AI tool was recently used to replicate the voice of late British broadcaster Sir Michael Parkinson for a podcast series. Another example is natural history broadcaster Sir David Attenborough, who was disturbed to hear his voice cloned by AI to say things he never said.
While some of these voice-cloning tools are used for harmless entertainment, others are being exploited in scams to deceive people. For example, criminals use AI-generated voices to trick people into transferring money or revealing personal information.
However, not all AI-generated voices are used maliciously. They are also integrated into chatbots powered by large language models, making conversations with machines sound more natural and convincing. Take ChatGPT’s voice function, for example. It can respond with variations in tone and emphasis, much like a human would to convey empathy or emotion. It can also interpret non-verbal cues like sighs or sobs, speak in over 50 languages, and even make phone calls to assist with tasks. In one demonstration, ChatGPT even ordered strawberries from a vendor over the phone.
These AI advancements lead to a compelling question: is there anything truly unique about the human voice that helps us distinguish it from machine-generated speech?
The Challenge of Telling AI from Human Voices
Jonathan Harrington, a phonetics expert at the University of Munich, Germany, has spent years studying how humans speak, produce sounds, and create accents. He is impressed by how realistic AI-generated voices have become in recent years. However, he believes there are still subtle cues that can help us tell the difference.
To explore this, we set up a challenge. We asked Conor Grennan, Chief AI Architect at New York University Stern School of Business, to create audio clips where he reads a passage from Alice in Wonderland—once with his own voice and once with an AI-generated voice from ElevenLabs, a speech-cloning software company. After playing the clips for people, around half of them struggled to tell which voice was human and which was AI.
One of the cybersecurity experts we spoke to, Steve Grobman from McAfee, was also unable to easily distinguish the two voices. He mentioned that AI voices may lack certain nuances, such as the natural cadence or breathing patterns we associate with human speech. For instance, humans often breathe irregularly, while AI-generated voices may sound too perfect.
In fact, many experts acknowledge that detecting deepfakes—AI-generated speech or video that imitates real people—can be difficult for the human ear. For example, a deepfake of Bill Gates once fooled listeners, making it sound as though he was endorsing a quantum AI stock trading tool. Despite sounding like him, it was flagged as a fake by deepfake detection software.
How Can We Tell AI from Human Voices?
While AI-generated voices have become impressively realistic, there are still some clues that can help us tell them apart from human speech.
One key feature to listen for is intonation, or the rise and fall in pitch during a sentence. Humans typically adjust their pitch to reflect the meaning or emotion behind their words. For example, the phrase “Marianne made the marmalade” may sound different depending on whether it’s a statement or a question. AI voices can struggle with this level of nuance.
Another clue lies in prosody, the rhythm and pattern of speech. Humans naturally emphasize certain words for meaning, and AI voices often fail to replicate this consistently. For example, if asked, “Did Marianna make the marmalade?”, a human would likely emphasize the word made, while an AI might emphasize a different word.
Additionally, breathing patterns can be a giveaway. Humans naturally breathe irregularly, and their breath intakes may vary in length. AI-generated voices, however, might sound too perfect or regular, giving away their artificial nature.
The Growing Threat of Voice Cloning
As AI voice technology improves, concerns about voice cloning are rising. Experts worry that cloned voices could be used in scams, identity theft, or to manipulate individuals. One cybersecurity example highlighted by Assaf Rappaport, CEO of cybersecurity firm Wiz, involved criminals creating a voice clone of him using a recent talk he gave. They attempted to use the cloned voice to deceive his employees into revealing credentials, though the attempt was unsuccessful.
Cybersecurity expert Pete Nicoletti from Check Point Software recommends being cautious if you suspect someone is using a voice clone. He advises asking personal questions or suggesting you’ll call back to verify their identity. In a work setting, you should avoid making wire transfers based solely on a phone call from someone claiming to be a high-level executive.
The Future of AI Voices
AI voice technology is improving rapidly, and experts like Dane Sherrets, innovation architect at HackerOne, believe it will only get more convincing. AI can now mimic human-like inflection, breathing, and even hesitation, but it’s still not perfect. While AI can replicate much of human speech, it struggles to capture the full range of human emotions and the complexities of context.
As AI continues to advance, experts are working to develop better detection tools. McAfee, for example, is partnering with major PC manufacturers to install deepfake detection software on devices, and ElevenLabs offers a free tool to detect AI-generated voices. However, as AI technology and detection tools evolve, we may find ourselves in a race where distinguishing AI from humans becomes increasingly difficult.
Conclusion: The Importance of Face-to-Face Interaction
Given the growing capabilities of AI-generated voices, it’s becoming harder to tell whether you’re speaking to a human or a machine. Experts recommend being cautious and using alternative methods to verify someone’s identity, such as asking personal questions or using voice validation methods. In some cases, the best solution might be to spend more time interacting in person.
In the battle between AI-generated voices and detection technology, we may find that the key to distinguishing a real person from a machine lies not in the voice itself, but in the authenticity of human interaction.
Were you able to tell which voice was AI and which was human in our “Alice in Wonderland” challenge? The first clip was AI, and the second was human.