4 December 2008—At a rally in rural North Carolina during the 2008 U.S. presidential election campaign, Alaska governor Sarah Palin infamously said that there was a ”real America” and presumably a fake one. Though she was the butt of jokes for the remainder of the campaign, in a way Palin was right. One state over, a team of computer scientists and a physicist from Virginia Polytechnic Institute and State University (Virginia Tech), in Blacksburg, Va., was creating a fake America of its own.
The group has designed what it claims is the largest, most detailed, and realistic computer model of the lives of about 100 million Americans, using enormous amounts of publicly available demographic data. The model’s makers hope the simulation will shed light on the effects of human comings and goings, such as how a contagion spreads, a fad grows, or traffic flows. In the next six months, the researchers expect to be able to simulate the movement of all 300 million residents of the United States.
As many as 163 variables, mostly drawn from the U.S. Census, come into play for each synthetic American. Called EpiSimdemics, the model almost perfectly matches the demographic attributes of groups with at least 1500 people, according to Keith Bisset, a senior research associate who works on the simulation’s software. The software generates fake people to populate real communities and assigns each person characteristics such as age, education level, and occupation to mirror local statistics derived from the most recent national census. In accordance with the data, some individuals are clustered into families, while others live alone.
Every synthetic household is assigned a real street address, based on land-use information from Navteq, a digital-mapping company. Using data from a business directory, each employed individual is matched to a specific job within a reasonable commute from the person’s home. Similarly, actual schools, supermarkets, and shopping centers identified through Navteq’s database are also linked to households based on their proximity to the home. When an artificial American goes grocery shopping, the simulation algorithm assigns probabilities that he or she will visit one store over another, adding an element of unpredictability to a person’s daily schedule.
Though the simulation is not restricted to the study of contagious diseases, a major application so far has been modeling how a flu epidemic might propagate through different regions. To accomplish this, each person has an embedded model of how he or she might respond to the flu, with probabilities derived from epidemiological data and the person’s age and general health.
Now imagine that a few of those model citizens become infected with the flu. Discerning the impact of millions of unique behaviors on infectious disease patterns involves performing many millions of tiny but intertwined calculations. As the sample population grows, those calculations quickly become a hefty computing task. ”The lack of symmetry and regularity makes these types of problems very different from traditional physics problems that require large computing power,” says Stephen Eubank, the physicist on the project.”We have to address all kinds of scaling issues with the very irregular communication patterns in the model.”
To break up the problem into computable chunks, the software treats each person and each location as a separate set of calculations. In a flu experiment, the algorithm starts with a person in a given health state. If the person is recently infected, his or her health will steadily deteriorate over the course of a day. The victim may begin to show symptoms, and at a certain time the person will become contagious. The algorithm stores a record of each person’s health state as it was at each of the locations he or she visited throughout the day.
Once they have been compiled, those health records are dispatched to the modules of code representing the locations visited by each person. The algorithm checks all the interactions among people who were at a location at the same time and determines the number of new infections that arose from the day’s encounters. After those calculations are finished, the location module sends the updated infection data back to the modules representing each person. Each person and each location is calculated on a unique processing element so that many parts of the algorithm can be computed in parallel on a supercomputer. ”This brute-force computation changes qualitatively how we think about the problem,” says Christopher Barrett, who works on the project and is the director of Virginia Tech’s Network Dynamics and Simulation Science Laboratory.
For a recent experiment on flu transmission over three months in the Chicago area, for example, the researchers ran 10 iterations of the simulation in 30 minutes each on a cluster of three dozen machines. By virtue of organizing the problem into people and location entities, they were able to speed up the software substantially; using the algorithms available five years ago, a single simulation on comparable machines would have taken up to six hours.
Each run of the simulation reveals the path that the virus took through the population, which could help identify particularly vulnerable subpopulations and the most effective public health interventions. The simulation can also indicate the number of infections each day over the course of the study period, which is important because the peak in infections indicates the biggest burden on the city’s health system. To simulate flu transmission across the entire country, the computer scientists plan to incorporate human air travel next, using flight data from the International Air Transport Association on the number of flights connecting any two hubs. ”The vision is for a Google-like interface, where you approach the system and ask it a question,” says Barrett. ”The framework is there, and now we’re pushing the system to larger and larger scales.”