How Duolingo’s AI Learns What You Need to Learn
The AI that powers the language-learning app today could disrupt education tomorrow
It’s lunchtime when your phone pings you with a green owl who cheerily reminds you to “Keep Duo Happy!” It’s a nudge from Duolingo, the popular language-learning app, whose algorithms know you’re most likely to do your 5 minutes of Spanish practice at this time of day. The app chooses its notification words based on what has worked for you in the past and the specifics of your recent achievements, adding a dash of attention-catching novelty. When you open the app, the lesson that’s queued up is calibrated for your skill level, and it includes a review of some words and concepts you flubbed during your last session.
Duolingo, with its gamelike approach and cast of bright cartoon characters, presents a simple user interface to guide learners through a curriculum that leads to language proficiency, or even fluency. But behind the scenes, sophisticated artificial-intelligence (AI) systems are at work. One system in particular, called Birdbrain, is continuously improving the learner’s experience with algorithms based on decades of research in educational psychology, combined with recent advances in machine learning. But from the learner’s perspective, it simply feels as though the green owl is getting better and better at personalizing lessons.
The three of us have been intimately involved in creating and improving Birdbrain, of which Duolingo recently launched its second version. We see our work at Duolingo as furthering the company’s overall mission to “develop the best education in the world and make it universally available.” The AI systems we continue to refine are necessary to scale the learning experience beyond the more than 50 million active learners who currently complete about 1 billion exercises per day on the platform.
Although Duolingo is known as a language-learning app, the company’s ambitions go further. We recently launched apps covering childhood literacy and third-grade mathematics, and these expansions are just the beginning. We hope that anyone who wants help with academic learning will one day be able to turn to the friendly green owl in their pocket who hoots at them, “Ready for your daily lesson?”
The origins of Duolingo
Back in 1984, educational psychologist Benjamin Bloom identified what has come to be called Bloom’s 2-sigma problem. Bloom found that average students who were individually tutored performed two standard deviations better than they would have in a classroom. That’s enough to raise a person’s test scores from the 50th percentile to the 98th.
When Duolingo was launched in 2012 by Luis von Ahn and Severin Hacker out of a Carnegie Mellon University research project, the goal was to make an easy-to-use online language tutor that could approximate that supercharging effect. The founders weren’t trying to replace great teachers. But as immigrants themselves (from Guatemala and Switzerland, respectively), they recognized that not everyone has access to great teachers. Over the ensuing years, the growing Duolingo team continued to think about how to automate three key attributes of good tutors: They know the material well, they keep students engaged, and they track what each student currently knows, so they can present material that’s neither too easy nor too hard.
Duolingo uses machine learning and other cutting-edge technologies to mimic these three qualities of a good tutor. First, to ensure expertise, we employ natural-language-processing tools to assist our content developers in auditing and improving our 100-odd courses in more than 40 different languages. These tools analyze the vocabulary and grammar content of lessons and help create a range of possible translations (so the app will accept learners’ responses when there are multiple correct ways to say something). Second, to keep learners engaged, we’ve gamified the experience with points and levels, used text-to-speech tech to create custom voices for each of the characters that populate the Duolingo world, and fine-tuned our notification systems. As for getting inside learners’ heads and giving them just the right lesson—that’s where Birdbrain comes in.
Birdbrain is crucial because learner engagement and lesson difficulty are related. When students are given material that’s too difficult, they often get frustrated and quit. Material that feels easy might keep them engaged, but it doesn’t challenge them as much. Duolingo uses AI to keep its learners squarely in the zone where they remain engaged but are still learning at the edge of their abilities.
One of us (Settles) joined the company just six months after it was founded, helped establish various research functions, and then led Duolingo’s AI and machine-learning efforts until last year. Early on, there weren’t many organizations doing large-scale online interactive learning. The closest analogue to what Duolingo was trying to do were programs that took a “mastery learning” approach, notably for math tutoring. Those programs offered up problems around a similar concept (often called a “knowledge component”) until the learner demonstrated sufficient mastery before moving on to the next unit, section, or concept. But that approach wasn’t necessarily the best fit for language, where a single exercise can involve many different concepts that interact in complex ways (such as vocabulary, tenses, and grammatical gender), and where there are different ways in which a learner can respond (such as translating a sentence, transcribing an audio snippet, and filling in missing words).
The early machine-learning work at Duolingo tackled fairly simple problems, like how often to return to a particular vocabulary word or concept (which drew on educational research on spaced repetition). We also analyzed learners’ errors to identify pain points in the curriculum and then reorganized the order in which we presented the material.
Duolingo then doubled down on building personalized systems. Around 2017, the company started to make a more focused investment in machine learning, and that’s when coauthors Brust and Bicknell joined the team. In 2020, we launched the first version of Birdbrain.
How we built Birdbrain
Before Birdbrain, Duolingo had made some non-AI attempts to keep learners engaged at the right level, including estimating the difficulty of exercises based on heuristics such as the number of words or characters in a sentence. But the company often found that it was dealing with trade-offs between how much people were actually learning and how engaged they were. The goal with Birdbrain was to strike the right balance.
The question we started with was this: For any learner and any given exercise, can we predict how likely the learner is to get that exercise correct? Making that prediction requires Birdbrain to estimate both the difficulty of the exercise and the current proficiency of the learner. Every time a learner completes an exercise, the system updates both estimates. And Duolingo uses the resulting predictions in its session-generator algorithm to dynamically select new exercises for the next lesson.
When we were building the first version of Birdbrain, we knew it needed to be simple and scalable, because we’d be applying it to hundreds of millions of exercises. It needed to be fast and require little computation. We decided to use a flavor of logistic regression inspired by item response theory from the psychometrics literature. This approach models the probability of a person giving a correct response as a function of two variables, which can be interpreted as the difficulty of the exercise and the ability of the learner. We estimate the difficulty of each exercise by summing up the difficulty of its component features like the type of exercise, its vocabulary words, and so on.
The second ingredient in the original version of Birdbrain was the ability to perform computationally simple updates on these difficulty and ability parameters. We implement this by performing one step of stochastic gradient descent on the relevant parameters every time a learner completes an exercise. This turns out to be a generalization of the Elo rating system, which is used to rank players in chess and other games. In chess, when a player wins a game, their ability estimate goes up and their opponent’s goes down. In Duolingo, when a learner gets an exercise wrong, this system lowers the estimate of their ability and raises the estimate of the exercise’s difficulty. Just like in chess, the size of these changes depends on the pairing: If a novice chess player wins against an expert player, the expert’s Elo score will be substantially lowered, and their opponent’s score will be substantially raised. Similarly, here, if a beginner learner gets a hard exercise correct, the ability and difficulty parameters can shift dramatically, but if the model already expects the learner to be correct, neither parameter changes much.
To test Birdbrain’s performance, we first ran it in “shadow mode,” meaning that it made predictions that were merely logged for analysis and not yet used by the Session Generator to personalize lessons. Over time, as learners completed exercises and got answers right or wrong, we saw whether Birdbrain’s predictions of their success matched reality—and if they didn’t, we made improvements.
Dealing with around a billion exercises every day required a lot of inventive engineering.
Once we were satisfied with Birdbrain’s performance, we started running controlled tests: We enabled Birdbrain-based personalization for a fraction of learners (the experimental group) and compared their learning outcomes with those who still used the older heuristic system (the control group). We wanted to see how Birdbrain would affect learner engagement—measured by time spent on tasks in the app—as well as learning, measured by how quickly learners advanced to more difficult material. We wondered whether we’d see trade-offs, as we had so often before when we tried to make improvements using more conventional product-development or software-engineering techniques. To our delight, Birdbrain consistently caused both engagement and learning measures to increase.
Scaling up Duolingo’s AI systems
From the beginning, we were challenged by the sheer scale of the data we needed to process. Dealing with around a billion exercises every day required a lot of inventive engineering.
One early problem with the first version of Birdbrain was fitting the model into memory. During nightly training, we needed access to several variables per learner, including their current ability estimate. Because new learners were signing up every day, and because we didn’t want to throw out estimates for inactive learners in case they came back, the amount of memory grew every night. After a few months, this situation became unsustainable: We couldn’t fit all the variables into memory. We needed to update parameters every night without fitting everything into memory at once.
Our solution was to change the way we stored both each day’s lesson data and the model. Originally, we stored all the parameters for a given course’s model in a single file, loaded that file into memory, and sequentially processed the day’s data to update the course parameters. Our new strategy was to break up the model: One piece represented all exercise-difficulty parameters (which didn’t grow very large), while several chunks represented the learner-ability estimates. We also chunked the day’s learning data into separate files according to which learners were involved and—critically—used the same chunking function across learners for both the course model and learner data. This allowed us to load only the course parameters relevant to a given chunk of learners while we processed the corresponding data about those learners.
One weakness of this first version of Birdbrain was that the app waited until a learner finished a lesson before it reported to our servers which exercises the user got right and what mistakes they made. The problem with that approach is that roughly 20 percent of lessons started on Duolingo aren’t completed, perhaps because the person put down their phone or switched to another app. Each time that happened, Birdbrain lost the relevant data, which was potentially very interesting data! We were pretty sure that people weren’t quitting at random—in many cases, they likely quit once they hit material that was especially challenging or daunting for them. So when we upgraded to Birdbrain version 2, we also began streaming data throughout the lesson in chunks. This gave us critical information about which concepts or exercise types were problematic.
Another issue with the first Birdbrain was that it updated its models only once every 24 hours (during a low point in global app usage, which was nighttime at Duolingo’s headquarters, in Pittsburgh). With Birdbrain V2, we wanted to process all the exercises in real time. The change was desirable because learning operates at both short- and long-term scales; if you study a certain concept now, you’ll likely remember it 5 minutes from now, and with any luck, you’ll also retain some of it next week. To personalize the experience, we needed to update our model for each learner very quickly. Thus, within minutes of a learner completing an exercise, Birdbrain V2 will update its “mental model” of their knowledge state.
In addition to occurring in near real time, these updates also worked differently because Birdbrain V2 has a different architecture and represents a learner’s knowledge state differently. Previously, that property was simply represented as a scalar number, as we needed to keep the first version of Birdbrain as simple as possible. With Birdbrain V2, we had company buy-in to use more computing resources, which meant we could build a much richer model of what each learner knows. In particular, Birdbrain V2 is backed by a recurrent neural-network model (specifically, a long short-term memory, or LSTM, model), which learns to compress a learner’s history of interactions with Duolingo exercises into a set of 40 numbers—or in the lingo of mathematicians, a 40-dimensional vector. Every time a learner completes another exercise, Birdbrain will update this vector based on its prior state, the exercise that the learner has completed, and whether they got it right. It is this vector, rather than a single value, that now represents a learner’s ability, which the model uses to make predictions about how they will perform on future exercises.
The richness of this representation allows the system to capture, for example, that a given learner is great with past-tense exercises but is struggling with the future tense. V2 can begin to discern each person’s learning trajectory, which may vary considerably from the typical trajectory, allowing for much more personalization in the lessons that Duolingo prepares for that individual.
Once we felt assured that Birdbrain V2 was accurate and stable, we conducted controlled tests comparing its personalized learning experience with that of the original Birdbrain. We wanted to be sure we had not only a better machine-learning model but also that our software provided a better user experience. Happily, these tests showed that Birdbrain V2 consistently caused both engagement and learning measures to increase even further. In May 2022, we turned off the first version of Birdbrain and switched over entirely to the new and improved system.
What’s next for Duolingo’s AI
Much of what we’re doing with Birdbrain and related technologies applies outside of language learning. In principle, the core of the model is very general and can also be applied to our company’s new math and literacy apps—or to whatever Duolingo comes up with next.
Birdbrain has given us a great start in optimizing learning and making the curriculum more adaptive and efficient. How far we can go with personalization is an open question. We’d like to create adaptive systems that respond to learners based not only on what they know but also on the teaching approaches that work best for them. What types of exercises does a learner really pay attention to? What exercises seem to make concepts click for them?
Those are the kinds of questions that great teachers might wrestle with as they consider various struggling students in their classes. We don’t believe that you can replace a great teacher with an app, but we do hope to get better at emulating some of their qualities—and reaching more potential learners around the world through technology.
- New Records for AI Training ›
- History Of Natural Language Processing - IEEE Spectrum ›
- OpenAI's GPT-3 Speaks! (Kindly Disregard Toxic Language) ›