Few fields of endeavor have advanced as swiftly as bioinformatics over the past couple of decades. Just 25 years ago, the human genome was still largely a mystery. Then, in 2003, the first sequence was announced, of about 92 percent of a human genome. That sequence cost some US $300 million dollars. Over the years, as the technology became more advanced and pervasive, the cost of sequencing declined. Nowadays, it’s possible to get a sequence for well under $1,000. This price drop has triggered a revolution in the ability of doctors to identify a patient’s susceptibility to disease and also to prescribe effective treatments.
Once the genome was sequenced, the enormous task of identifying the function of the many genes began. Most estimates of the number of protein-coding genes in the human genome are now in the range of 19,000 to 21,000, although some are considerably higher. And as many as a quarter of these genes remain of largely uncertain function. The most powerful software-based tool for researchers trying to understand the function of these many genes is a system called BLAST, which stands for Basic Local Alignment Search Tool.
Here’s how it works. Let’s say a team of research biologists has come across a rhesus monkey gene that they can’t identify. They can enter into BLAST the nucleotides of the DNA or the amino-acid sequences of the protein associated with the gene. BLAST then searches enormous databases to find similar genes within the genomes of countless creatures, including humans. A match to a known gene often enables the researchers to infer the function of the unknown gene. It also lets them infer functional and evolutionary relationships that might exist between the sequences, and locate the unknown gene within one or more families of related genes.
First released in 1990, BLAST was created by a group at the U.S. National Institutes of Health that included Eugene Myers, Webb Miller, Stephen Altschul, Warren Gish, and David Lipman. Their 1990 paper describing BLAST has more than 75,000 citations, making it one of the most highly cited research papers of all time.
Earlier this year, Myers and Miller received the IEEE Francis E. Allen Medal, which honors “innovative work in computing leading to lasting impact on other aspects of engineering, science, technology, or society.” Shortly before the ceremony, IEEE Spectrum spoke with Myers, who had just retired as a director of the Max Planck Institute of Molecular Cell Biology and Genetics.
Eugene Myers on…
- The origins of the BLAST system
- Will it ever be possible for humans to live hundreds of years?
- Why we should resurrect extinct creatures
- The next big challenge for bioinformatics
It’s the mid-to-late 1980s at the U.S. National Institutes of Health. What was in the air? What were some of the motivating factors that led you and your colleagues there to work on and, ultimately, complete, BLAST?
Eugene Myers: Well, there was already a tool like BLAST for searching the database, but it wasn’t very efficient and it took much too long. And David Lipman, who was running the National Center for Biotechnology Information (NCBI), that growing database, was looking for something faster. And I happened to be on sabbatical. And I was a smoker at the time, and I was downstairs and he brought me this article about this new hot chip that was being promoted by TRW. And I’m sitting there smoking my cigarettes saying, “Oh, David, I don’t believe in ASICs. I think if we just write the right code, we can do something.” And I had actually been working on a technique, a theoretical technique, for sublinear search. And I mean, basically, David and I and Webb got together and we had a very quick series of exchanges where we basically took the theoretical idea and distilled it down to its essence. And it was really fun, actually. I mean, Webb and I were passing back and forth versions of code, trying different implementations. And that was it. And I need to say, we got something that was fast as greased lightning at the time.
Do you remember what the chip was?
Myers: I think it was called the FDF, and it was a systolic-array chip. It was designed for pattern matching primarily for the intelligence agencies. [Editor’s note: the Fast Data Finder (FDF) was an ASIC for recognizing strings of text. It was created at TRW in the mid-1980s.]
Ah, intrigue. So that leads us to the next question, which is, for those who aren’t biologists, what exactly does BLAST do? It’s been called a sort of a search engine for genes. So a biologist who is doing a sequence, say, of a genome has a piece of genetic material that’s presumably a gene and doesn’t know what this gene does.
Myers: Well, I mean, basically, BLAST takes a DNA sequence or protein sequence, which is just a code over some alphabet, and it goes off and it searches the database looking for something that looks like that sequence. In biology, sequences aren’t preserved exactly. So you’re not looking for exactly the same sequence. You’re looking for something that’s like it. A few of the symbols can be different, maybe one can be missing, there could be an extra one. So it’s called approximate match.
And when you say it goes off and finds them, it finds them from a catalog of the genomes and genetic material of all living creatures that have been recorded.
Myers: Yes. The database is oftentimes preprocessed to accelerate the search, although the initial BLAST, basically, just streamed the entire database.
So it will find a close-as-possible match for whatever the sequences you have, which may be a gene, and it will find it and it might be a totally different creature…
Myers: It could potentially find many of them. And one of the important things about BLAST, actually, which Altschul contributed, was it actually gave you the probability that you would see that match by chance. Because one of the big problems prior to that is that people were taking things that they thought kind of looked the same and saying, “Well, here’s an interesting match,” when in fact, according to probability theory, that was not an interesting match at all. So one of the very nice things about BLAST is it gave you a P-value that told you whether or not your match was actually interesting or not. But it would actually give you a whole list of matches and rank them according to their probability.
So one of the things that this illustrates is that all of us creatures on Earth, all of us, we’re made up of genes, and not only are we made up of genes, but you see throughout all of the living creatures very similar genes. So the blueprint, if you will, the elements of the blueprint that make up a human are different, but remarkably similar to the ones that, say, make up a parakeet or a lizard.
Myers: Now, there was a huge diaspora of life about 500 million years ago from bacteria into multicellular creatures where we basically ended up with fish and insects and all of the more complex orders of life. And they, basically, all used the same genes or proteins, but they used them in different ways. And mostly what was going on was the way that those genes were being turned on and the way those cassettes were being run. I mean, for example, a fruit fly has 14,000 genes and a human being has, I don’t know, maybe 28,000. And basically, every gene that’s in the fruit fly, there’s an analog that’s in a human being. Human beings have more copies of particular genes. They have one or two of something instead of just one of them. And human beings have a lot more genes that turn things on and off selectively. In other words, that regulate how the genes are being used. But the actual repertoire of genes is very similar. When we sequenced the human genome back at the turn of the century, 2000, we looked at the fruit fly and we looked at a human, and we said, “Hey, the fruit fly is like a little human.” I mean, potentially it gets cancer, metabolic disorders. It’s really quite fascinating.
There are some very large-scale projects around the world now aimed at sequencing the genomes of enormous classes of creatures, such as, vertebrates or plants or all living things native to the British Isles. These initiatives are sometimes collectively referred to as “sequencing the world.” Why are these efforts important?
Myers: Well, that’s a complex question. The basic answer is that we’re starting to do it now because we can finally do it at a quality where we feel like these libraries of sequences that we produce are going to, basically, stand the test of time—that they’re sufficiently correct and accurate. And the fascinating thing is, we’re going to learn more about how the various genes function. See, there’s still a lot of questions about what these genes are doing. And we’re going to learn more about how they function by looking at how they’re working across all of life than by looking at a particular species. I mean, right now, most medicine is just focused on human beings.
For example, we’re interested in how long a human being lives. We’d like to live longer. But absent disease, the variation and the longevity of human beings is about 10 percent. I mean, some of us expire at 85, some of us at 95, and some of us at 75. It’s not a very big range. But for example, there are bats that as a function of their body weight live 50 times longer than they’re supposed to. Fifty times. That’s like living to 5,000 for a human being. And there are other bats that are very closely related to that bat—only 5 million years of evolution between them—where the bat lives a normal life. So if you go out into nature, you’re going to see these extremes in physical characteristics of what we call phenotype. So what we are interested in is what’s the relationship between the genotype, which is the gene sequence, and all the genes that are in it, and the phenotype, which is the physical characterization or manifestation of the creature.
So in other words, one of the things you want to do is you want to know what the cluster of genes is that enables certain bats to live 50 times longer than other bats?
Myers: Yes. So we think that by sequencing lots of pairs of bats that are short- and long-lived and comparing their genomes, we’re going to get real clues about what it takes to have a creature live a long time. And presumably, because the genes in a human being are so similar to those in the bat, it will translate to human medicine.
There is a study of so-called supercentenarians among human beings, if I’m not mistaken. So this would presumably provide additional depth and information beyond just studying supercentenarians. Supercentenarians are people who live to be about 100 without substantial decline, either mentally or physically.
Myers: A lot of that is about lifestyle. I mean, they’ve done studies, the Blue Zones. And it’s about having good friends, it’s about eating a healthy diet, not eating too much, getting a little exercise, not too much stress. A lot of these things, I mean, turn out to be very significant factors. But again, there’s basically a kind of an expire-by date for every species of creature, and they have a longevity. Because the original purpose, really, of a creature is to create children. And once you’ve created the children, your job’s done. I mean, once you’ve created offspring, you’ve propagated the genome and you’re superfluous.
We’ve got this natural built-in expiry date. And the question is, how can we fundamentally change that?
So we’ve got this natural kind of built-in expiry date. And the question is, how can we fundamentally change that? Because I don’t want to live to be 100. I want to live to be 1,000, okay? I mean, it’s too late for me. But think about it. If I could live to be 1,000, I could have 10 careers. I mean, I’d love to do 100 years as an architect, 100 years as a physician. Right?
So the idea is if you could identify the genes and the sequences that these long-lived creatures have in common, not only humans, but other creatures, you could, in theory, use a gene-editing technique, something that follows from CRISPR in the far future, to actually edit genes? This is probably decades from now.
Myers: Well, it could be just as simple as stopping certain reactions from happening. So it may not even be as much as a [genome] edit. I mean, it may just be like a drug where basically we just inhibit certain pathways. We build a small molecule that inhibits something to stop it from doing its thing, and that turns off the expiry clock. But we don’t know exactly how to do that yet. I mean, we know that reducing inflammation certainly leads to longer life. We know that not eating as much. So maybe there’s a drug that we can take that helps us metabolize better so that we don’t—so there are a lot of options like that. It doesn’t necessarily have to be gene editing. This is a kind of a futuristic thing. I can’t tell you when, but I can tell you that as long as we don’t blow ourselves up to kingdom come or ruin our planet and we have enough time, we will do it. We will do it.
One of the main motivations, perhaps the greatest motivation for all of this work, is to better understand how specific genetic variations lead to disease. It’s a lot of what keeps the money flowing and the whole enterprise going. And a very powerful tool for this purpose is the genome-wide association study. And this predates a lot of this technology. It’s an older tool, but it is one that is as dependent as ever on bioinformatics. And I would think because of the growing complexity, only getting more dependent.
Myers: A lot of what we’ve been trying to do for the last couple of decades is basically correlative. In other words, we’re not looking actually for causation. We’re just simply looking for correlation. This gene seems to have something to do with this disease and vice versa. And it doesn’t give us a mechanism, but it does tell us that this is associated with that. So we want to understand. A lot of what we’ve been doing is sequencing lots and lots of people. In other words, getting their genotype, their genome, and correlating that with their phenotype, with their physical characteristics. Do they get heart disease early? Do they get diabetes?
A classic one is breast cancer with the BRCA.
Myers: Right. And that was an example where we found basically the genes that are absolutely correlated with breast cancer. I mean, we know there’s a fairly small repertoire. But on the other hand, something like coronary health, heart health, is very, very complicated because really it’s a function of hundreds of genes. And so which combination and which battery? So basically, it’s not a single locus. I mean, early on, in the very early days, there were a lot of diseases that were caused by single mutation, but those are kind of the exception rather than the rule. I mean, those single mutations, they were incredibly serious diseases. And it’s nice that—well, I think we’re in a position to affect some of those.
It’s very interesting to have these single-locus diseases in hand to really improve the health of humanity as a whole. We’re going to need to have a kind of more refined understanding of the relationship between the genotype and the phenotype. And so these studies have been going on and people have been collecting data. In fact, the biggest problem, actually, isn’t getting everybody’s genome. The biggest problem is getting accurate phenotypic data. In other words, actually getting accurate measurements of people’s blood sugar. Like, when do you take the test, etc. I won’t go into all the complexities. But it’s actually building a database of all of the characteristics of people and basically digitizing all of the information we have about people. But this is going forward, and I think it will be very useful.
One of the more sensational applications of bioinformatics is the challenge of reviving extinct species. So we read about the woolly mammoth, and there’s recent talk about the dodo and others. There’s the quagga, I think. There’s just a whole host of creatures that have, sadly, departed from the earth, but that in theory, we could revive in some form with the techniques and tools now available.
Myers: I think probably what’s more interesting is not actually bringing them back, but understanding what they were. For example, Svante Pääbo’s work reconstructing the Neanderthal sequence. Okay. I mean, it turns out that we’re all about, I think, 4 percent Neanderthal DNA. And it turns out, for example with COVID, it turns out that your propensity for outcomes in COVID actually is correlated with whether or not you had some of this Neanderthal DNA.
I think it’s quite fascinating that we’re kind of an admixture of these things. So knowing this ancient genome is quite interesting. I mean, also, the woolly mammoth versus the modern-day elephant basically gives us another clue. And I think what’s fascinating is the fact that we can do it at all. If we can get sufficient DNA material, then we can extract these things. Understanding that the evolutionary history of mankind is certainly of interest because we’re interested in ourselves, yes? For other creatures, well, it is the case that if we have a sequence, I do believe that we will eventually be able to kind of realize Jurassic Park and actually literally create the genomic sequence, transplant it into a nonfertilized embryo of a nearby species, and create the creature, an instance of the creature. And I think that will be pretty cool if we really want to understand dodo birds. But I think in general, we don’t want to lose all of that diversity. That connects back to what we were talking about before, which are these projects to go out and sequence the world. For example, I’ve sequenced some nearly extinct turtles. Now, that I have the sequence of those turtles, even if they go extinct, we can still do a Jurassic Park sometime in the future, but at least the genetic inheritance of those species is still present and we will still have it. So it’s, basically, a matter of conservation and a matter of understanding evolution and it’s pretty damn cool.
The last thylacine died at the Beaumaris Zoo in Hobart, Australia, on 7 September 1936. Recently, biologists at the University of Melbourne launched a project aimed at bringing the creature back from extinction.Hum Historical/Alamy
And for full disclosure, we should point out that nobody could actually do Jurassic Park because dinosaur DNA is tens of millions of years old, so it no longer exists.
Myers: Yeah. I don’t mean Jurassic Park in the sense of bringing back dinosaurs. Jurassic Park in the sense of creating creatures that are no longer extant. Okay? I mean, that’s always the case with the best science fiction, is that it’s plausible. Jurassic Park is plausible, so is Gattaca. You know that one with Ethan Hawke where they, basically, sequenced everybody and they took the best? I mean, that is completely plausible.
What do you think are some of the most exciting challenges for young people, that they’ll be working on, say, in two years or four years? The big, difficult problems in bioinformatics.
Myers: Well, there are a lot of problems that still haven’t been solved. For example, how do you get a given shape and form from a genome? The genome actually encodes everything. It gives you five fingers. It gives you a nose, eyes. It encodes for everything. But we don’t understand the biophysical process for that. I mean, we have some idea that this gene controls that and this gene controls that, but that doesn’t tell us mechanistically what’s happening, and it doesn’t tell us how to intervene or what would happen if we intervene. So I still think that the fundamental question is to try to understand kind of what’s encoded in a genome and what mechanistically does it unfold. And I mean, computational biology is going to be at the core of it because, I mean, you’re talking about, okay, for a human being, 30,000 genes. Does 30,000 genes probably get transcribed into 150,000 different protein variants?
There are probably 10 billion of those proteins floating around an individual cell. And then your body—I mean, your brain alone has 10 billion neurons. So think about the scale of that thing. Okay? I mean, we’re not even close. So I think that high-performance computing. I think that advanced simulations.
A lot of what moves biology is technology, the ways to manipulate things. We’ve been able to manipulate creatures for a long time genetically. But now that we have this new mechanism, CRISPR-Cas, for which the Nobel was awarded a couple of years ago, I mean, we can now do that with precision and fidelity, which is a huge advance.
- Mapping Eliza: Decoding DNA Secrets - IEEE Spectrum ›
- Stopping Infection Outbreaks with AI and Big Data - IEEE Spectrum ›
- Have Researchers Computed the Complete Neanderthal Genome ... ›
- Methusaleh on Wings - IEEE Spectrum ›