Big Brother is listening. Companies use “bossware” to listen to their employees when they’re near their computers. Multiple “spyware” apps can record phone calls. And home devices such as Amazon’s Echo can record everyday conversations. A new technology, called Neural Voice Camouflage, now offers a defense. It generates custom audio noise in the background as you talk, confusing the artificial intelligence (AI) that transcribes our recorded voices.
The new system uses an “adversarial attack.” The strategy employs machine learning—in which algorithms find patterns in data—to tweak sounds in a way that causes an AI, but not people, to mistake it for something else. Essentially, you use one AI to fool another.
The process isn’t as easy as it sounds, however. The machine-learning AI needs to process the whole sound clip before knowing how to tweak it, which doesn’t work when you want to camouflage in real time.
So in the new study, researchers taught a neural network, a machine-learning system inspired by the brain, to effectively predict the future. They trained it on many hours of recorded speech so it can constantly process 2-second clips of audio and disguise what’s likely to be said next.
For instance, if someone has just said “enjoy the great feast,” it can’t predict exactly what will be said next. But by taking into account what was just said, as well as characteristics of the speaker’s voice, it produces sounds that will disrupt a range of possible phrases that could follow. That includes what actually happened next; here, the same speaker saying, “that’s being cooked.” To human listeners, the audio camouflage sounds like background noise, and they have no trouble understanding the spoken words. But machines stumble.
The scientists overlaid the output of their system onto recorded speech as it was being fed directly into one of the automatic speech recognition (ASR) systems that might be used by eavesdroppers to transcribe. The system increased the ASR software’s word error rate from 11.3% to 80.2%. “I’m nearly starved myself, for this conquering kingdoms is hard work,” for example, was transcribed as “im mearly starme my scell for threa for this conqernd kindoms as harenar ov the reson” (see video, above).
The error rates for speech disguised by white noise and a competing adversarial attack (which, lacking predictive capabilities, masked only what it had just heard with noise played half a second too late) were only 12.8% and 20.5%, respectively. The work was presented in a paper last month at the International Conference on Learning Representations, which peer reviews manuscript submissions.
Even when the ASR system was trained to transcribe speech perturbed by Neural Voice Camouflage (a technique eavesdroppers could conceivably employ), its error rate remained 52.5%. In general, the hardest words to disrupt were short ones, such as “the,” but these are the least revealing parts of a conversation.
The researchers also tested the method in the real world, playing a voice recording combined with the camouflage through a set of speakers in the same room as a microphone. It still worked. For example, “I also just got a new monitor” was transcribed as “with reasons with they also toscat and neumanitor.”
This is just the first step in safeguarding privacy in the face of AI, says Mia Chiquier, a computer scientist at Columbia University who led the research. “Artificial intelligence collects data about our voice, our faces, and our actions. We need a new generation of technology that respects our privacy.”
Chiquier adds that the predictive part of the system has great potential for other applications that need real-time processing, such as autonomous vehicles. “You have to anticipate where the car will be next, where the pedestrian might be,” she says. Brains also operate through anticipation; you feel surprise when your brain incorrectly predicts something. In that regard, Chiquier says, “We’re emulating the way humans do things.”
“There’s something nice about the way it combines predicting the future, a classic problem in machine learning, with this other problem of adversarial machine learning,” says Andrew Owens, a computer scientist at the University of Michigan, Ann Arbor, who studies audio processing and visual camouflage and was not involved in the work. Bo Li, a computer scientist at the University of Illinois, Urbana-Champaign, who has worked on audio adversarial attacks, was impressed that the new approach worked even against the fortified ASR system.
Audio camouflage is much needed, says Jay Stanley, a senior policy analyst at the American Civil Liberties Union. “All of us are susceptible to having our innocent speech misinterpreted by security algorithms.” Maintaining privacy is hard work, he says. Or rather it’s harenar ov the reson.