A new discipline has bloomed at the intersection of biology and computer science. Called bioinformatics, it is already so far advanced that many life scientists spend more time at their computers than they do at the laboratory bench. They gobble processing power like peanuts, burying themselves in the massive comparison of genes and the chemical instructions the genes issue to the body’s cells.
“Massive” took on new meaning in September, when Paul G. Allen, cofounder of Microsoft Corp., in Redmond, Wash., unveiled the greatest bioinformatics initiative yet: a map called the Allen Brain Atlas, indicating which of 20 000 genes is doing what in the brain and where it is doing it. The project is being undertaken by the Allen Institute for Brain Science, in Seattle, Wash.
Allen put up the US $100 million that the three-year mission is expected to consume, and he has promised to put the entire thing on the Web, in quarterly installments, for individual researchers to access free of charge. Then the real magic will begin: neuroscientists will sift the data for insight into the workings of the mind and clues to the causes, and possibly the cures, of such devastating ailments as Parkinson’s disease, epilepsy, depression, obsessive-compulsive disorder, alcoholism, and schizophrenia.
These brain disorders mostly resist today’s drug treatments, and because they generally torture people for years without killing them, they have created a huge pool of patients whose care costs tens of billions of dollars—not to mention indirect economic costs, notably from lost workdays. New ideas for drug therapy are desperately required; Allen’s Atlas promises to provide them.
Up to now the biggest numbers game in biology had been run by the publicly financed Human Genome Project, which sequenced each of the three billion letters in the DNA code for a human being. “I know a neuroscientist who downloaded the Human Genome Project onto an Apple iPod,” says Mark Boguski, an M.D. and Ph.D. who is a veteran of that project and who now directs the Atlas [see photo, “Cranial Cartographer”]. “But that was 3 gigabytes, and we will be producing petabytes.”
The orders-of-magnitude calculation is simple: multiply 20 000 genes by a trillion neurons. Nobody will be downloading this mass of data—that’s for sure. Drug companies and other power users that want to get their arms around the entire data set to apply their own algorithms will have to pay for special access to the Atlas computers or to other computers carrying a copy of all its data.
The project is negotiating with a supplier that it won’t name for 24 servers (or nodes) it wants for its computer farm, says Brian Crook, a software engineer who’s been with the project for the past six months. “That’s just a start. In my last job, I worked on a system that had just 15 terabytes in compressed form, and it required 60 nodes.” The Atlas will certainly need hundreds.
It may seem strange that so much more information can come from a single organ than from a genome that specifies the entire body, but DNA is really more like a recipe than a blueprint. It tells you not where every last component will go but merely what instructions to follow to get them there. Just as a very few ingredients can give rise to the delicately detailed swirls of foam in a soufflé, so can a handful of genes specify the stupendous complexity of the cerebral cortex—in this case, the mouse cortex.
Big as the Brain Atlas will be, its managers intend to start small, by scanning the gray matter of a little black-furred thoroughbred called the C57BL/6 mouse. It is a genetically uniform, fast-growing critter that has the added advantage of having all its genes decoded. With the transcript of that code—which includes about as many genes as humans have—the project’s directors hope to get a lot of work done fast and show some medically valuable results.
Only later will they begin associating structures in the mouse brain with those in the human one, a painstaking slog that must await the development of technology that can safely zero in on the physiology of individual neurons without first having to kill the subject.
“Paul Allen did not come to us and say, ‘Make an atlas of the mouse brain,’” Boguski says. “But he very quickly came to realize that the mouse brain was a very powerful tool.” By “us,” he means the board of advisers, an august bunch that includes the linguist Steven Pinker, professor of psychology at Harvard University in Cambridge, Mass., and author of The Language Instinct, and the molecular biologist James D. Watson, a codiscoverer of the structure of DNA, who is now president of Cold Spring Harbor Laboratory on Long Island, N.Y.
Allen was fascinated by the Human Genome Project, and like many computer people, he had also been interested in modeling the mind. The brain project fit both interests and also promised to give a lot of bang for his philanthropic buck. For several years, such a project had been on the wish list of the government-funded National Institutes of Health, in Bethesda, Md. Only now, however, had the technology become equal to the task. Methods of speeding and automating biological research had ripened, the mouse genome had been sequenced, and the ability to manage large data sets had matured.
Winner: Allen Brain Atlas
Goal: In three years, create a comprehensive map of the mouse brain showing in which cells each of 20 000 genes are active, and make it available to researchers free of charge on the Web. Later, make an equivalent map of the human brain
Why It’s a Winner: Generating more biological data than any project that has come before, the Atlas will provide the most detailed map of the most complex organ. It will act as a key resource for understanding and then combating many intractable brain disorders
Organization: Allen Institute for Brain Science
Center of Activity: Seattle, Wash.
Number of People on the Project: 45, going to 100 in two years
Budget: US $100 million provided by Paul Allen
How can a mere mouse yield medical wonders? There are plenty of physical diseases for which the mouse has proved to be a useful model, and there is no reason it cannot uncover the roots of mental disorders as well. True, a mouse does not have all that much upstairs. Yet, though the rodents do not suffer from the same sort of depression that afflicts people—they do not despise themselves or pray for death—a few seem to experience something rather like it. A normal mouse, set afloat in a tank with a hidden underwater perch, will tread water until its feet find purchase. “Quasidepressive” mice give up quickly and sink. However, treated with antidepressants, those same mice will persevere.
The mouse leg of the Allen Brain Atlas is just getting started in a long, low building in the Seattle area. A walk through the place gives me a distinct feeling of déjà vu: multiple office kitchens filled with appliances, sunny conference rooms strewn with $1000 Aeron executive chairs. Yep, this place was previously occupied by a dot-com company that apparently closed up shop before its employees had time to leave a single coffee stain on the carpet.
As visitors enter the laboratory proper, they pass a few big pieces of equipment from Germany, still in their crates; a lot more are on order. People are on order, too, as Boguski quietly makes clear to a delegation of British scientists. Advertisements in the science journal Nature have elicited a flood of résumés from experts in animal care, neuroscience, and computer science—the current hiring rate runs at about three per week.
Right now, just three people are sitting in the lab, all huddled along a pair of benches. Another 20 or so have yet to relocate from temporary quarters in the downtown Seattle offices of Vulcan Ventures Inc., the investment firm that manages Allen’s many biotech enterprises. By spring, the project should be staffed with up to about 45 people; in another couple of years, it will reach 100, including a number of top scientists holding joint academic appointments.
The basic goal is to show what the brain genes do and where they do it. Each gene directs the manufacture of a particular protein. Famous examples of genes identified by the proteins they make include the one for hemoglobin (which carries oxygen in the blood), estrogen (which feminizes women), and testosterone (which turns men into fools).
In the brain, some of the most interesting proteins are receptors, so called because they sit on the cell membrane and receive chemical messengers. For instance, the dopamine D 4 receptor detects the messenger dopamine as it comes in from neighboring neurons; this receptor is thought to play a role in schizophrenia, depression, attention-deficit disorder, and even the penchant for novel experiences.
Ideally, the Atlas would study the proteins directly, but because proteins are hard to detect in fine detail and at high production speeds, the Atlas will instead follow an easier target: nucleic acids that ferry data from the nucleus to the structures that translate them into protein. The technique involves slicing the brain into many thin sections, putting each one on a slide, and exposing it to chemicals that attach to the nucleic acids you’re interested in.
Chemical reactions turn the attached acids a color so that their location in the brain can be scanned and digitized. The resulting deluge of data must be cataloged so that various search algorithms can fit it all into physiologically meaningful patterns.
To get a first pass through 20 000 genes within three years, the Atlas project will rear mice to a precise age, then sacrifice them minutes, even seconds, before cutting their brains into 25-µm slices, about three cells thick. Speed is of the essence: other organs can be put on ice, but brains need glucose and oxygen from second to second or they begin to die. To help freeze important molecules in place, the workers will inject a preservative into the mouse while it’s still alive and use the heart to pump the liquid through the brain.
The entire process will be standardized and, to the extent possible, roboticized to increase productivity. A slice will be wafted to a slide with a puff of air, lightly glued to it—rather as a Post-It slip is glued to a desk—then examined under a microscope and its image digitized [see photo, “One Down...”].
No brains are being dissected right now, and the various ways of automating the slicing, dicing, dissecting, and staining are still being worked out. It’s clear, though, that the factory will have to work fast. Given a 1.5-cm-long brain and a 25-µm-thick slice, you get 600 slices per brain; with the ability to see just three genes per slice and the goal of looking at 20 000 genes altogether, you require 7000 brains (perhaps 8000 for good measure). That comes to four million slices, most of which will have to be processed in the last two years of the three-year run, when the system should be generating some 30 000 slides per week.
Already, Baylor College of Medicine, in Houston, Texas, has produced some sample slides for the programmers to play with while they optimize the software [see screen shot, “Brainscape Navigator”]. By thus tackling the information technology (IT) challenge first, they will have the project ready when the flood of homegrown data starts pouring in. A slide’s worth of data may not seem much, coming as it does from just three genes, but because each gene’s protein-making activity is mapped over two dimensions, the yield comes to some 50 MB.
That’s the kind of specificity that medical researchers need. “It’s like real estate—what matters is location, location, location,” says Boguski. “It matters not just what area the molecule’s in but what cell it’s in.”
Rather than carefully dissect a single slice for hours, the project will shove many identical slices through its mill fast, taking care to slice at the same angle every time. This strategy of trading quantity of data for quality is as foreign to the painstaking world of neuroanatomy as it is familiar to that of computer science. Chess software, for example, incorporates little chess knowledge but applies it to so many millions of possible lines of play that it can give headaches even to Garry Kasparov, the world’s top player.
Neuroanatomists, like chess masters, don’t like the idea of an automated factory beating them at their own game. “One I talked to said that with 25-micron sections, we often wouldn’t even get the nucleus [the cell’s central DNA archive],” says Boguski. “I asked him, if it were cheap enough to do the same experiment 1000 times, wouldn’t that be better than doing it once, thoroughly?” A thousand slices should get the nucleus most of the time.
Data from each two-dimensional slice will be fed to programs that reconstruct the brain’s structures in three dimensions. Say you have a neuron whose cell body—containing the nucleus—sits in the middle of the brain and whose axon—the long, communicating stalk analogous to an interconnect in an IC—reaches to the frontal part of the brain. The software will have to tease out the long, skinny, possibly oblique path frame by frame.
Unlike most other big biology projects, the Atlas will study genes just as they come up, in no particular order. “That was a mistake at the Human Genome Project—the scientists stayed with their favorite areas and the work never got done,” says Boguski. “The real surprises will come when we look at all genes, agnostically. Scientists are trained to be hypothesis-driven, but the Atlas will be data-driven.”
Boguski, a pathologist by training, worked in bioinformatics on the Human Genome Project before either the project or the field bore those names. And he continued at a bioinformatics company, Rosetta Inpharmatics LLC, now in Kirkland, Wash. His background makes him particularly sensitive to another mistake the Atlas means to avoid: the scanting of bioinformatics. “The original funders of the Human Genome Project underestimated the IT element, the National Institutes of Health have not come to grips with it, and GenSat [a government mouse-brain anatomy program] paid only for data production, not for bioinformatics,” Boguski says.
That oversight in the Human Genome Project led to a last-minute scramble in the spring of 2000 to develop a program that assembled all the various fragments of genetic code into a mostly coherent whole. “If you can’t use the data, what good is it?” Boguski asks. “We have twice as many computer science people as biologists right now, and even when the project reaches full employment, the ratio will probably still be 1:1.”
Lin Chen, a bioinformaticist at the Atlas, is working in a number of software environments. Strewn around his workstation are manuals from Red Hat, a software brand from Red Hat Inc., in Raleigh, N.C., that is based on Linux, the open-source operating system that is itself based on Unix. “Unix is big in bioinformatics because it’s good for big-batch processing,” Chen says. “Also, a lot of the software is written in Perl, which is easy, fast, and loaded with functions to deal with pattern search and string search. Most things here, I wrote in Perl,” the programming language developed by linguist Larry Wall to take in large amounts of data and manipulate it flexibly—a “Swiss Army chainsaw,” as its devotees call it.
Chen used to work for Celera Genomics, part of Applera Corp., based in Norwalk, Conn. Celera made itself a big name (though no money) by sequencing the human genome faster than the Human Genome Project could manage. Unlike the Human Genome Project, Celera did not underestimate the IT challenge: it spent $50 million on what was one of the largest computing centers outside government weapons laboratories.
The Atlas governing body, the Allen Institute for Brain Science, won’t make a dime from this, either—it can’t, as it is a not-for-profit organization. However, that isn’t stopping it from acting entrepreneurially. Boguski is looking for corporate and government money to extend and enhance the project. “We are not just building an atlas, we’re building a platform that can be used for other experiments,” he says.
One next-generation project would be to study mice in which certain genes have been disabled, so that their role in the brain can be deduced. Another would be to construct maps not of the immediate chemical messengers of the genes, as the scientists are doing now, but of the final chemical products—the entire set of proteins in the brain, its so-called proteome.
Next might be the extension of this static image of the brain to a more dynamic one, part of the ever-increasing dimensionality: first a line (the string of DNA code), then a plane (gene activity in a cross section), next a rendered solid, and finally, perhaps, a representation of how the solid structure changes over time. Of course, to keep Paul Allen happy, the researchers will endeavor to link this picture of mice to men, in a point-to-point correspondence between the brains of the two species.
It is not yet clear how this will be done. One way might be to tag nucleic acids performing protein synthesis with magnetic molecules, then to scan the living brain electromagnetically, as in functional magnetic resonance imaging. The result, assembled by a computer, would then be a 3-D depiction of the synthesis of the protein in question.
“We’d start by scanning small animals noninvasively, then move to humans,” Boguski says. Even if only a few proteins could be outlined in this fashion, the resulting information could serve as signposts for the proper alignment of other data that can be physiologically linked to it.
Like a detective’s magnifying glass, the Atlas will aid sight, not confer it. A lot of inspired pattern-sifting will be required to unravel the common psychiatric disorders, which mostly stem not from the actions of a single misbegotten gene but from those of many genes, all reacting to environmental cues, one another, and each one’s reactions.
“If the cause of a disease is like a needle in a haystack, then we’re making the haystack smaller,” Boguski says. “There are pathways to disease, and once you find them, you’re well on your way to finding the ultimate cause.”