Coming soon – offline speech recognition on your phone
More than one in four people currently integrate speech recognition into their daily lives. A new algorithm developed by a University of Copenhagen researcher and his international colleagues makes it possible to interact with digital assistants like “Siri” without any internet connection, even on low memory devices. The innovation allows for speech recognition to be used anywhere, even in situations where security is paramount.
Talking to a computer was once the stuff of science fiction. Nowadays, saying “Hey Siri,” or Alexa, Google or other digital assistant on a smartphone or other interactive gizmo has become commonplace. Yet, in the future the role of speech recognition may become even more important.
While studies suggest that these technologies are already used by one in four people on a regular basis, should predictions hold true, by 2025 the number of devices equipped with speech recognition will exceed the planet’s population. And the technology is still evolving.
Until now, speech recognition has typically relied upon a device being connected to the internet. This is because the algorithms typically used for this process require significant amounts of temporary random access memory (RAM) which is usually provided by powerful data center servers. Indeed, try switching your smartphone to airplane mode and see how far your voice commands get you. But change is in the air.
A new algorithm developed by Professor Panagiotis Karras from the University of Copenhagen’s Department of Computer Science, together with speech technology researcher, Nassos Katsamanis of the Athena Research Center in Greece, and researchers from Aalto University in Finland and KTH in Sweden, allows smartphones or even smaller devices to decode speech without needing substantial memory—or internet access.
The code, recently presented in a scientific article, employs a clever strategy: it "forgets" what it doesn’t need in real-time.
“Speech recognition fundamentally works by matching the small speeech sounds we use to form words and sentences—known as phonemes—with a library of corresponding sounds,” explains Panagiotis Karras. “Probabilities are calculated for matches and the subsequent combinations that go on to form our words and sentences. The most likely sequences are calculated and the software translates these sounds into text.”
Current algorithms require increased memory the longer one speaks as all alternative combinations must remain open until the final sound is analyzed. The new algorithm does away with this problem.
“The algorithm conceived by Panos and developed further by our team, does something entirely new,” says co-developer and co-author Nassos Katsamanis. “Unlike the existing gold standard algorithm used since speech recognition’s early days, our algorithm only stores a fraction of the processing data, serving as a set of ‘coordinates.’ With these, an entire sequence can be reconstructed, which makes speech recognition possible with significantly less RAM.”
From Keywords to Entire Sentences
This maneuver may sound simple, but it involves an entirely new and unique code for which the researchers have sought a patent. This algorithm reduces the need for critical memory without sacrificing recognition quality. And though it requires slightly more time and computational power, the researchers assure that the difference is negligible vis-à-vis the muscular capabilities of modern devices.
Moreover, it works without an internet connection, thus enabling speech recognition—and potentially real-time language translation in the future, hope the researchers—anywhere, even in the depths of the Amazon jungle.
More info: A Linguistic Pathfinder
To understand how computers manage speech recognition, imagine solving a maze with a pencil.
Traditional algorithms approach speech redcognition in much the same way, by exploring all possible paths and remembering every dead-end until the maze essentially memorized and the goal is reached. This process places a heavy load on temporary memory as it tracks thousands of probabilities.
Panagiotis Karras’s new algorithm uses a principle that halves the problem at every step. Instead of remembering the entire maze, it keeps track of key points, recalculating paths as needed. In speech recognition, these key points are phonemes, which are stored as "coordinates" to reconstruct the optimal sequence later. This dramatically reduces memory requirements while maintaining accuracy.
The gold standard for this method is an older algorithm called Viterbi. The process described above places demands on a computer's temporary RAM storage, as it must calculate and remember the probability for all possible position of the maze at every step along the way. This can result in the algorithm having to keep track of millions of probabilities should the maze be long enough.
Panagiotis's new algorithm employs a principle that continuously halves the problem. At every stretch along its path through the maze, it only remembers the midpoint. The result is a significantly reduced need for temporary memory, as these "midpoints" are recalculated before the final route is presented.
In speech recognition, these points are represented by phonemes – the smallest units of sound in text that are calculated as the best match for what is spoken at any given point in the sentence being analyzed. These phonemes and their probabilities are stored as something like coordinates along a path that the algorithm identifies as optimal, as it works to navigate between the first and last sounds in a sentence.
Ultimately, they can be used to reconstruct the entire "path" and provide the best possible interpretation of the spoken input as text.
Single words or very short sentences are generally manageable when current software needs to store alternative sequences and libraries of potential sound interpretations. However, as sentences become longer and potential word combinations more complex, the demand for RAM increases.
“Certain small devices can already recognize and act based upon a few words without internet connectivity. For example, a smart home system can recognize keywords such as "turn on" or "turn off". This is known as small-vocabulary speech recognition. With our algorithm, it will be possible to recognize more extensive instructions or, in principle, entire languages – without an internet connection. This is referred to as large-vocabulary speech recognition,” says Professor Karras.
Enhanced Inclusion, Security, and Energy Savings
According to the researchers, the invention opens up a range of possibilities – from practical, security-related, and societal benefits – to its significant energy-saving potential.
For instance, many people could benefit from the ability to translate foreign languages while traveling, regardless of internet access. This is one possibility that the researchers hope to achieve. But, the societal impact of linguistic accessibility, both now and in the future, could be far more significant.
Nassos Katsamanis sees great promise in the technology: “This algorithm can help democratize language technology by making information more accessible. To make translation tools and speech assistants available regardless of internet access will allow more people to engage in society. In particular, it will help people without written language skills or those with physically disabilities, by enabling them to understand and influence societal decisions.”
Another key advantage of this speech recognition invention is its security implications. When security is paramount, the new algorithm addresses a significant problem: internet connections can be hacked. By eliminating the need for internet access, the algorithm enhances security.
Furthermore, while the energy used by data centers to support current spreech regnition technology may be invisible to consumers, it is highly relevant in a world facing climate change. The growing demand for this technology, when met by this invention, could lead to significant energy savings by reducing the enormous need for temporary memory.
“It is vital to reduce energy consumption to minimize reliance on fossil fuels, as many data centers still use these energy sources,” concludes Professor Karras.
About the study
The following researchers have contributed to the project:
Martino Ciaperoni, Aalto University, Finland.
Athanasios (Nassos) Katsamanis, Athena Research Center, Greece.
Aristides Gionis, KTH Royal Institute of Technology, Sweden and Aalto University, Finland.
Panagiotis Karras, Department of Computer Science, University of Copenhagen.
Contact
Panagiotis Karras
Professor
Department of Computer Science
University of Copenhagen
paka@di.ku.dk
piekarras@gmail.com
+45 9141 6469
Athanasios (Nassos) Katsamanis
Principal Researcher
Institute for Language and Speech Processing
Athena Research Center, Greece
nkatsam@athenarc.gr
+30 210 6875405
Kristian Bjørn-Hansen
Journalist and Press Contact
Faculty of Science
University of Copenhagen
kbh@science.ku.dk
+45 93 51 60 02