Stanford University prides itself on its international diversity, touting that today's undergraduates hail from 70 countries. So a friend-group that included a computer science major from China, an AI-focused management science and engineering (MSE) major from Russia, and a business-oriented MSE major from Venezuela isn't an anomaly. The friends did the normal things Stanford students do with their free time, like fountain hopping, cheering at football games, and hiking the trail around the Stanford Dish radio telescope.
And then came the pandemic.
"Stanford went virtual," Andres Perez Soderi recalls. (He's the member of the trio from Venezuela.) "And we scattered around the Bay Area, to San Francisco, Pleasanton, as well as Palo Alto, and we were keeping in touch online. School just isn't fulfilling when you aren't physically there, and we had a lot of time on our hands."
They also had an idea, sparked by a conversation with another friend, a computer science major who had gone back to his home in Guatemala, where he had gotten a job at a call center doing tech support in order to support his family.
“We knew from our own experience that forcing a different accent on yourself is uncomfortable. … We thought if we could allow software to translate the accent [instead], we could let people speak naturally.”
—Andres Perez Soderi, Sanas
"When he got the job," Soderi said, "we told him that he'd be the best tech support person they'd ever had, he's the smartest guy we've met and always had a smile on his face."
But the job didn't last—his customer satisfaction numbers were too low, because callers struggled to understand his accent and would lash out in frustration.
Given the three spoke English with vastly different accents, the problem hit home.
"We decided to help the world understand and be understood," Soderi said.
They dedicated their empty pandemic hours to building a solution.
"We did a lot of research around what people have done in the past. People have done voice conversion for deep fakes, and that technology is pretty advanced. But there's been little done in accent translation. So, say, if I used an existing system to make me sound like Batman, I would sound like a Chinese-accented Batman" says Shawn Zhang, the trio's member from China.
"We knew about accent-reduction therapy and being taught to emulate the way someone else speaks in order to connect with them. And we knew from our own experience that forcing a different accent on yourself is uncomfortable. I went to a British high school and tried to force a British accent; it was an experience that was hard to digest. We thought if we could allow software to translate the accent [instead], we could let people speak naturally," says Soderi.
"Our first approach was naïve," Zhang says. "We built a system that converted speech to text and then text to speech." That wasn't going to be particularly useful for real-time conversation, their ultimate goal. So they began thinking about how to structure data to use in training a neural network to convert accents directly, speech to speech. They reached out to professors at Stanford and experts in industry to advise them.
And they filed the paperwork to incorporate as a company—Sanas. (Incorporation is something else that is not an unusual step when Stanford undergrads start tinkering with anything.)
The name came from a hunt through random syllables, looking for something that sounded good and was available to use. Sanas jumped out because it is a palindrome—and it turned out to refer to whispers or sounds in some forms of ancient Latin. They assigned the CTO title to Zhang, CFO to Soderi, and CEO to Maxim Serebryakov.
That all happened in the first half of 2020, and things have continued to move quickly. Sanas now has a full-time engineering staff of 14, including the founders, and three more part-time developers, plus two employees working on the business side. All now work remotely, spread out internationally. The company completed a seed funding round of US $5.5 million in late May, a few months shy of Zhang's twenty-first birthday, bringing total investment to about $6 million.
Baris Akis, the president and co-founder of Human Capital, who led the seed round, stated at the time: "As an immigrant from Turkey, I've always felt that getting rid of the accent barrier was a critical next step for a more fair and prosperous world."
Today, Sanas has an algorithm that can shift English to and from American, Australian, British, Filipino, and Spanish accents. They developed it using a neural network, trained with recordings made, for the most part, by professional voice actors.
Says Zhang, "You aren't just doing audio signal processing, changing the pitch and tone. You have to change the phonetics. So we really needed parallel data sets, created by readers using the same source material, so the neural network could learn to map from one to the other, examining both to learn how to transform the pronunciation."
The algorithm runs locally on a CPU (not in the cloud), with 150 milliseconds of delay, at the speech quality of telephone audio, working alongside communications apps like Zoom, Skype, and WhatsApp. A typical Zoom delay is about 50 milliseconds, bringing the total delay to about 200 milliseconds. Soderi indicated that generally anything below 300-to-350 milliseconds is imperceptible in audio communications, so users don't notice a lag. And the algorithm is efficient in terms of CPU usage.
But, Zhang admits, there's plenty of room for improvement. "We are trying to make more clear, natural, and pleasant to hear; it's an ongoing process."
The team plans to add more accents within English, but also work with accents of other languages, including Spanish and French.
Their first customers will be among outsourcing companies, the kinds hired to provide customer service and other telephone support functions. Seven such firms are currently piloting the system.
"But that's just our first use case," says Zhang, "because it is a measurable and controlled environment. We don't see ourselves as a call center company, we want to go into healthcare, entertainment, education, and other spaces. We want to develop this as a tool that helps people with human-to-human interaction, without hurting their cultural identities."
- Deep Learning Startup Maluuba's AI Wants to Talk to You - IEEE ... ›
- Skype Will Translate Multilingual Conversations by Next Year - IEEE ... ›
- Baidu's AI Can Do Simultaneous Translation Between Any Two ... ›
Tekla S. Perry is a senior editor at IEEE Spectrum. Based in Palo Alto, Calif., she's been covering the people, companies, and technology that make Silicon Valley a special place for more than 40 years. An IEEE member, she holds a bachelor's degree in journalism from Michigan State University.