World's Largest Dataset on Human Genetic Variation Goes Public

National Institutes of Health finds one way to manage big data for biomedical science

2 min read
World's Largest Dataset on Human Genetic Variation Goes Public


The entire contents of the National Institutes of Health's 1000 Genomes Project—all 200-terabytes of it—will be made freely available to the public, the agency announced today. The project is touted as the world's largest set of data on human genetic variation. Amazon's cloud computing unit, Amazon Web Services,  will store the database
The project aims to provide a foundation for investigating how human genetic variation contributes to health and disease. Making the whole thing available for free means more scientists can use the data and, hopefully, conclusions about the relationship between genotype and such diseases as cancer and diabetes will be drawn at an accelerated rate. 
The project was initiated in 2008 and is based on the genomes of more than 2600 people from 26 populations around the world. Results from sequencing the DNA of 1700 of those people will be released on cloud now. The remaining 900 samples will be sequenced this year. 
The NIH's initiative is part of a larger movement to manage the deluge of "big data" in science,  which has become a scientific discipline in itself. Such data sets have become so massive that few researchers have the computing power to use them. The NIH has calculated that the 1000 Genomes Project is the equivalent of 16 million file cabinets filled with text, or more than 30 000 standard DVDs.
Making it available on cloud is a good deal for scientists and their institutions, who won't have to take on the costs of acquiring more bandwidth, data storage and analytical computing capacity just to access the data. "This means researchers and labs of all sizes and budgets have access to the complete 1,000 Genomes Project data and can immediately start analyzing and crunching the data without the investment it would normally require in hardware, facilities and personnel," says Deepak Singh, a principal product manager at Amazon Web Services. "Researchers can focus on advancing science, not obtaining the resources required for their research."
It may also end up being a good deal for Amazon Web Services (A.W.S.). Manipulating this much information requires a lot of computing power, and A.W.S. will be charging for additional resources that can be used to further process or analyze the data, reports the  New York Times
The White House, for its part, sees the 1000 Genomes Project on cloud as one example of the kind of solutions it is proposing through its  Big Data Research and Development Initiative [pdf]. The Office of Science and Technology Policy announced today that more than $200 million will be doled out to six federal agencies in an effort to make the most of the mountains of data being created for scientific discovery, environmental and biomedical research, education, and national security. 





The Conversation (0)