Voice AI (artificial intelligence), also known as AI voice, is an evolving landscape defined by how we communicate with technology and, more importantly, how it talks back to us. These two-way human-and-machine interactions occur everywhere, from call centers to smart speakers. Each medium offers a unique user flow and sonic experience. Furthermore, each has its own brand identity and expression. For example, take asking Alexa, Amazon Echo Dot, or Google Home Mini a question, the outcomes will likely be similar in that you’ll be directed to a popular answer, but how you get there is another story. The end-user experience will result in a distinct response from Alexa, not just in terms of tone of voice—but tonality.
This is because each platform understands and responds to us differently due to proprietary NLP (natural language processing) and NLU (natural language understanding), which is how their software detects speech and subsequently determines our verbal intents. While big tech may hold the text-to-speech victory title, newcomers are paving the way for more voice assistants and voiceover tools. Therein lies the sonic branding opportunity for brands both new and established. AI voice branding is customizing what the voice sounds like just as much as it is how it speaks, meaning there is room for aligning brand attribute associations and aural aesthetics with an audio identity.
How is an AI voice made?
Synthetic voices can be purely computer-generated (remember Stephen Hawking’s voice?) but are now more commonly created by cloning human voices. The cloning process involves an individual recording any combination of words, syllables, or paragraphs to train an AI algorithm that can later concatenate sound samples (systematically stitching audio snippets together)—with varying degrees of complexity—to replicate the original voice. After the cloning process is complete, utilizing it requires a process known as text-to-speech.
Text-to-speech (TTS) is relatively self-explanatory. In most cases, an AI voice will read what is written (text) aloud to you (speech). While this technology has roots in accessibility, the use cases are wide-ranging from screen readers to full-on conversations and more. Traditionally, generating new TTS took coders familiar with SSML (speech synthesis markup language; the HTML of voice) and conversation designers to formulate phrases. Of course, there have always been workarounds, such as prefacing a phrase with “Alexa, Simon Says…” to have it read aloud to you. These methods are still complex and voice-based and require recording the playback. Today, thanks to user-friendly features, text-to-speech is as easy as type-to-talk.
Speaking of type-to-talk, have you heard the new TikTok voice in North America? It’s a youthful, optimistic voice known to users as “Jessie” since last year. Now ubiquitous on social media (via account linking and sharing), you can hear Jessie’s voice on Meta apps, including Instagram and Facebook. Although we’ve grown accustomed to hearing this voice (amongst a suite of new options), it was not actually the first AI voice of TikTok.
When TikTok first rolled out the TTS feature, it had a different default that was familiarly female-sounding, yet slightly more robotic. Soon after launching, it became the de facto narrator of posts from pets to people across the United States. Shortly after, voice actor Bev Standing stepped forward and filed a lawsuit against TikTok, claiming they used her voice without permission. While they’ve since settled the suit, this scenario raises questions about the ethics and legality of voice cloning.
With someone’s voice (quite literally) at your fingertips, one-way dictation can be as powerful as it is problematic. On one hand, you have the power to make a bot say whatever you want. On the other hand, many people have the choice to make the bot say something that the original voice actor wouldn’t otherwise have said, including profanity. From a personal brand safety standpoint, voice actors have to strike their own balance on the continuum of comfort versus compensation. When it comes to brand safety for corporations, the script is flipped. In the case of Bev Standing, this meant her voice was potentially being used in ads by McDonald’s and KFC. If the problem of fast food and foul play didn’t start, another debate would’ve come up soon enough. The trend of text-to-speech sonic branding only began here.
As brands increasingly advertise on social media platforms, they begin to use the same creative tools as the users. As with many forms of advertising, the commercials start to blend in with the content. For example, on the radio, jingles sound like songs, and on TV, commercials look like sitcoms. Instagram ads now function as infomercials for a new generation, blending into the background as they become indiscernible from everything else in your feed. Until recently, celebrity spokespeople have been the go-to influencers for these ad campaigns. However, now, artificial intelligence is replacing influencers.
AI voices now act as “audio influencers.” These AI voice personas have inherited the identity of influencers by creators as they are used to pitch products and services, and their disembodied voices now seamlessly sell us the latest fads. Watch and listen to this supercut of synthesized spokespeople taking sponsored posts by storm.
Without watching carefully, the first half may sound like one long ad. It is actually six different sponsored posts dictated by “Jessie.” As is the case with shareable sonic branding, if every brand is using the same music and the same voice, there’s a missed opportunity to stand out as a leader in sonic branding. If it weren’t for changes in music and scenery between each section, the continuity of the voice profile could trick the listener into thinking that it was all one long commercial from the same company. Adding to customer confusion, they also may assume it is a real-life testimonial rather than a scripted speech synthesis soundbite.
In this video segment, there are a total of eleven campaigns strung together. The brands behind them include meal delivery services such as Factor and Freshly, as well as fitness apps like Joggo and Hydrow. Some of these direct-to-consumer newcomers or health and wellness startups may not have the big budgets of big tech to create their own characters, but it’s not just small companies participating in the trend. In the final example, even the Sonic the Hedgehog movie gets in on the audio influencer action. This is just a more recent version of the case of Bev Standing and McDonald’s, and it’s a hint at what may lie ahead.
The future of AI voice branding
We have only scratched the surface of speech synthesis on social media. As brand voice becomes virtual, text-to-speech will help scale-up production for voice-first platforms and the Metaverse. We can already hear the groundbreaking work happening with Sonantic and Val Kilmer in Top Gun: Maverick after battling throat cancer, or de-aging Mark Hamill’s voice with deepfake technology. This spans from Hollywood to our homes as we preserve the memories of loved ones via voice. The VP of Alexa AI recently demonstrated how easy it is to recreate the sound of someone’s grandmother with as little as a minute of audio. Given enough recordings and resources, brands can potentially time travel by cloning the voice of founders, sports teams can revive legends, and companies will be able to use sound to enhance the future of storytelling in ways we are only starting to imagine. We will traverse the uncanny valley as we learn to navigate what is real and what is surreal, but the benefits of AI Voice Branding will be boundless.
Cover image source: Jason Rosewell