Organizers of a European Union–supported software sharing platform for language technologies are planting seeds for applications that could debut on it with some eye-catching results: We might see the sprouting of a Basque-speaking, Alexa-style home language assistant, for instance.
A first-release version of the platform, called the European Language Grid, is already being used to distribute and gain visibility for language usage and translation tools from some of the hundreds of European firms trading in language technology. Many of the tools offer the ability to communicate among speakers of complex languages--Irish Gaelic, Maltese, and Latvian, to name a few--that are spoken by relatively few people.
If it seems global technology giants such as Google or Amazon could deliver these kinds of tools, maybe that’s right. But they may not dedicate the time and ensure the polish that a dedicated niche developer might. Besides, supporters of the initiative say, Europe should take care of its own digital infrastructure. Getting linguistic architectures to work easily and freely is a key interest on a continent that is trying to hold together a strained economic and social union straddling dozens of mother tongues.
The Language Grid is meant to create a broad marketplace for language technology in Europe, says Georg Rehm, a principal researcher at the German Research Center for Artificial Intelligence (DFKI).
The Grid is a scalable web platform that allows access to data sets and tools that are docked behind the platform’s interface. The base infrastructure is operated on a Kubernetes cluster—a set of node machines that run containerized applications built by service providers. It’s all hosted by the cloud provider SysEleven in Berlin. Users can access data and tools in the docker containers without needing to install anything locally. Grid organizers recently picked 10 early-stage projects that can be supported by the platform, boosting them with small research grants. Another open call for projects is running through October and November. Results are likely in early January 2021.
“Our technologies and services will be more visible to a broader market we would otherwise not be able to reach,” says Igor Leturia Azkarate, speech technologies manager at Elhuyar Fundazioa, a non-governmental organization promoting the everyday use of Basque, especially in science and technology. “We hope it will help other speakers of minority languages be aware of the possibilities, and that they will take advantage of our work.”
Azkarate and his colleagues are adapting Basque language text-to-speech and speech recognition tools to work within Mycroft AI, a Python-based open-source software voice assistant. The goal is to make a home assistant speaker, an Alexa-like device, that operates natively in Basque. Right now, the big home assistants operate in the world’s dozen or so most widely spoken languages. But rather than obliging users to go into Spanish or English—or wait for an as-yet-undeveloped Basque front-end facsimile or halfway solution that might still leave a user with a Julio Iglesias playlist on their hands rather than some Iñigo Muguruza—Azkarate’s after something better. Once the Elhuyar team adapts its Basque tools, they’ll be accessible on the Language Grid for others to use or experiment with.
Another early-stage project is coming from Jörg Tiedemann at the University of Helsinki, who is working with colleagues to develop open translation models for the Grid. These models use deep neural networks—layered software architectures that implement complex mathematical functions—to map text into numeric representations. Using data sets to train the models to find the best ways to solve problems takes a lot of computing power and is expensive. Making the models available for re-use will help developers build tools for low-density languages. “Minority languages get too little attention because they are not commercially interesting,” Tiedemann says. “This gap needs to be closed.”
Andrejs Vasiļjevs, chief executive of the language technology company Tilde, got his start because of a scarcity of digital tools in his native Latvia. In the late 1980s, he was studying computer science in Riga; in those days Latvia was part of the Soviet Union, with personal computing a limited realm. As the Union collapsed, PCs came in and people wanted to use them to start independent newspapers and magazines. But because there were no Latvian keyboards nor any Latvian fonts, it was not possible to write in Latvian. Vasiļjevs got to work on the problem and started Tilde in 1991 with a business partner, Uldis Dzenis.
Three decades later, Tilde is still making tools to spur communication—but now in machine translation, speech synthesis, and speech recognition. A Tilde translation engine is currently running underneath Germany's EU presidency website, working alongside machine translation from DFKI, German firm DeepL, and the European Commission's own eTranslation service. The site provides on-the-fly translations from German, French, and English originals into all the other 21 official EU languages. The Riga-based developer already has several datasets and models on the Language Grid for potential clients to test out, including a machine translation engine for English to Bulgarian and back, and a text-to-speech model for Latvian language, child’s voice. “We want to integrate our key services into the European Language Grid,” Vasiļjevs says. “It makes for more exposure to the market.”
This article appears in the December 2020 print issue as “I’m Speaking Your Language.”
Michael Dumiak is a Berlin-based writer and reporter covering science and culture and a longtime contributor to IEEE Spectrum. For Spectrum, he has covered digital models of ailing hearts in Belgrade, reported on technology from Minsk and shale energy from the Estonian-Russian border, explored cryonics in Saarland, and followed the controversial phaseout of incandescent lightbulbs in Berlin. He is author and editor of Woods and the Sea: Estonian Design and the Virtual Frontier.