No matter the differences in language and culture, both Chinese- and English-language Internet users apparently find common ground in using easily guessable password variants of “123456.” But a recent study comparing password patterns among the two languages also found notable and unique features in Chinese passwords that have big implications for Internet security beyond China.
The password habits of Chinese-language users have been surprisingly understudied given that they make up more than 20 percent of all Internet users worldwide. More than 854 million people use the Internet in China alone—more than double the entire population of the United States. That’s why a group of Chinese and U.S. researchers set out to test how password security among both Chinese- and English-language users stands up against the best cracking algorithms.
“Our work may be among the first studies to examine the passwords of different languages,” says Ding Wang, an information security researcher at Peking University, in Beijing.
Wang and his colleagues analyzed 106 million real passwords from nine Web services—73 million passwords from six Chinese-language services and 33 million passwords from three English-language services—exposed by hackers and leaked online between 2009 and 2012. They were careful to directly compare the security of passwords only from similar Web service counterparts among the mix of social forums, gaming services, e-commerce websites, and programmer forums, plus the Yahoo Internet portal on the English-language side of the data set. Their results appear in a paper [PDF] presented at the 28th USENIX Security Symposium held in Santa Clara, Calif., from 14 to 16 August.
What may seem like a strong password based on English-language assumptions could actually be quite weak and easy to guess from a Chinese-language perspective. Yet many of the world’s popular Web services, including some homegrown Chinese services, approach password security from an English-language perspective.
The researchers pointed to the example of the popular Chinese password “woaini1314” that is currently rated “strong” by password strength meters used by AOL, Google, and even the popular Chinese social network Sina Weibo (and by IEEE Spectrum’s parent organization, IEEE). But speakers of Mandarin Chinese, the most popular spoken dialect of Chinese, can easily guess the “woaini1314” password because “woaini” in Chinese pinyin (romanized system of Chinese characters) means “I love you,” and “1314” sounds like “forever” in Chinese.
One main difference between Chinese-language and English-language passwords is that many Chinese-language users favor passwords consisting purely of digits. Beyond the infamous “123456” password, other popular passwords among Chinese-language users include “111111,” “123123,” and “123321.” Playing on the love theme, “5201314” is used because it sounds similar to the phrase “I love you forever and ever” in Chinese. Some popular password segments will add a letter to the string of digits, such as “a12345” and “12345a.”
Chinese-language users also often use their mobile phone numbers or certain dates (perhaps their birthdays) in passwords—something that English-language users don’t do as often. Instead, English-language users frequently compose passwords made purely of letters and lean toward certain words or phrases such as the easily guessable “password,” “letmein,” “sunshine,” and “princess.” Some of the most popular passwords include “abcdef” and “abc123” alongside “123456.”
Passwords that use only digits are easier to crack than passwords made only of letters because the digit combinations are based on just 10 possible digits as opposed to 26 letters in the modern English alphabet. But Chinese-language speakers sometimes demonstrated incredibly complex and creative passwords: Some members of the Chinese Software Developer Network (CSDN) service combined programming language commands with traditional Chinese poems.
“Chinese users can be really creative with combinations of letters and digits,” says Yuan Tian, a computer scientist at the University of Virginia in Charlottesville, Va., and coauthor on the study.
The password files used by researchers contained hashes of leaked or stolen passwords, not plain-text versions of the passwords themselves. The researchers tried to decode both Chinese-language and English-language passwords using two state-of-the-art algorithms for cracking passwords. They tested the Markov-chain model, which assigns certain probabilities to password characters based on their relationships with one another, and the probabilistic context-free grammars (PCFG) model, which parses passwords into letter segments, digit segments, and symbol segments before guessing the order of the most likely combinations.
The team also improved the PCFG approach by modifying it to account for certain password patterns more common to Chinese-language users. For example, they added number segments in the popular date format and Chinese names as written in the romanized Pinyin system. They also gave their PCFG-based algorithm the capability to process the interleaving patterns—strings of alternating digits and letters—found in many Chinese passwords.
Together, those efforts boosted the modified PCFG-based algorithm’s performance against the Chinese password data sets—it cracked between 98 percent and 188 percent more passwords than the standard version of the algorithm.
The results also highlighted key strengths and weaknesses of Chinese-language passwords in comparison with English-language passwords. Both types of algorithms cracked more of the easier Chinese passwords in comparison with English passwords when limited to 10,000 or fewer guess attempts. But the remaining Chinese passwords proved stronger than their English password counterparts as the number of guesses increased beyond 10,000 attempts.
The number of guesses matters because many Web services limit the number of online guesses before temporarily locking a user’s account. Leaked or stolen password storage files could allow hackers to make a theoretically unlimited number of offline guessing attacks because they don’t have to deal with possibly being locked out of a Web service. But even offline guess attacks are still limited by the cost-effectiveness of spending computing time and resources on so many guess attempts.
Between the two cracking algorithms, the Markov-based algorithm performed the best when given the opportunity to make 10,000 or more guesses. By comparison, the PCFG-based algorithm performed as well or better than the Markov-based algorithm with a smaller number of guesses. But the PCFG-based approach also proved most efficient given that it required 31 percent less computation and 70 percent less memory than the Markov-based approach.
From a security standpoint, the study also has big implications for companies operating Web services that have significant numbers of Chinese-language users—or even companies that hope to someday attract significant numbers of Chinese speakers as customers. Security administrators may want to consider adjusting password creation policies and strength meters to account for the most popular Chinese-language passwords and future variants, Tian says.
It’s also clear that individual Chinese-language speakers can do themselves a favor by avoiding using predictable digit patterns such as “123456” and “111111” for their passwords, not to mention the predictable letter and letter/digit hybrid patterns based on romantic themes of eternal love. (The same goes for English-language speakers still using “123456” and “abcdef”—just stop!)
The complexities of language’s influence on passwords may go even deeper within just the Chinese-language community. Chinese-language users generally rely upon the same set of Chinese characters for reading and writing, but spoken Chinese has multiple regional differences based on local dialects that can sound different in terms of pronunciation. As just one example, the pronunciation of “I love you” in Mandarin Chinese—considered mainland China’s official national language—sounds different from the pronunciation of the same phrase in the Cantonese branch of Chinese spoken by many people living in or originating from places such as Hong Kong, Macau, and Guangdong.
Those regional differences in spoken Chinese were beyond the scope of this particular study. But Tian observed that there could be differences in password patterns if speakers of Cantonese, Hokkien, Shanghainese, or other regional variants of Chinese tried creating passwords based on pronunciation.
As part of a deeper dive, researchers hope to continue evaluating Chinese-language password patterns by using surveys to better understand what Chinese Internet users are thinking when creating their passwords. And they raised the possibility of continuing their comparative studies of passwords in different languages beyond just Chinese and English.
“For our future work, we want to cover passwords around the world beyond China,” Wang says.
Jeremy Hsu has been working as a science and technology journalist in New York City since 2008. He has written on subjects as diverse as supercomputing and wearable electronics for IEEE Spectrum. When he’s not trying to wrap his head around the latest quantum computing news for Spectrum, he also contributes to a variety of publications such as Scientific American, Discover, Popular Science, and others. He is a graduate of New York University’s Science, Health & Environmental Reporting Program.