Computers have played a part in the pursuit of science for most of their history. Humanity would not have made it to the moon without a myriad of computers controlling every aspect of the space flight, and computers are crucial in modern statistical analysis of data. But bioinformatics, and specifically genomics (the study of the human genome, and how it influences traits and diseases), are even more critically dependent on the use of computers, for a number of reasons.
To begin with, the human genome itself is huge: 700MB per person – and that’s just a file recording the differences from the “base” standard genome. Making sense of genomes requires you to do statistical analysis across large populations looking for correlations (known as Genome Wide Associations Studies).
In order to provide a rich set of individual genomes complete with trait information, Harvard Medical School launched the Personal Genome Project (PGP), an effort to sequence 100,000 individuals, along with in-depth information about their health and phenotypes (the physical expression of genes, such as eye color).
If one genome is big, 100,000 genomes is overwhelmingly huge, and it’s Dr. Madeleine Ball’s job to keep all the data happy. Ball oversees data collection and the public data portals for the PGP, as their Director of Biology. This can be as awesomely geeky as tweaking python scripts to analyze data, or as mundane as packaging blood samples so they can be sent off to be biobanked.
Ball’s role at the PGP evolved out of her desire to better understand the genome.
“I got involved in the genome interpretation side. You get the genome back, and you want to decide if there’s something important and unusual you want to know about,” she says. “That got me into manipulating the genome files, and then into the project in general, which means we need to track people’s accounts, and then I got involved in sample collection, which means that we need to track sample IDs and what happens to samples.”
Perhaps Ball’s most significant contribution to the project has been her shepherding of GET-Evidence, the public genome interpretation site that the project offers. Using the site, researchers and project participants can view and annotate information about specific gene variants, read research papers that are automatically spidered from the literature, and decide how relevant variants are to specific traits. One unique site feature is that anyone with an OAuth account can join in and help, bringing a crowdsourced approach to filtering through the endless new research that is published.
To say that Ball’s typical day is varied is to put it mildly. “Today I’m writing emails to Coriell [a biobank] and NIST [The National Institute of Standards and Technologies], trying to see if they are comfortable with certain consent forms. I’m trying to learn Django right now,” she explains. “Over the last year or so, most of my technical expertise has been in the genome interpretation stuff. I end up sitting down with our genome interpretation software and reading the original research papers, and doing statistics on those. At one point, I created most of the sample tracking stuff, to create automatically generated labels. I’m always doing one of everything. Sometimes I’ll be pulling up a genome variant file, and greping it for a variant, because someone told me they had found that a genome with this variant is important.”
One of Ball’s largest challenges is the lack of uniformity in personal health records (PHRs). The PGP program participants (who currently number in the low thousands) are very active, uploading all sorts of personal data such as PHRs, X-Rays, and MRI scans. Unfortunately, getting all that information into a consistent format is daunting. “Everyone has their own way of doing a health record,” says Ball, “And they all say, ‘Oh, we have electronic health records,’ as if it solves everything. That’s kind of like saying, ‘We all have Word documents;’ it doesn’t mean they’re all using the same coding systems.”
Ball is also concerned about the amount of information that leaks out of supposedly anonymous data files. PGP participants are made aware from the start that there is a high risk that their data and their identity might become linked, but Ball says that many user-uploaded files contain information that makes that connection easy. Too easy.
“One of the things we wish we could do is to help people scrub their data of private stuff, like their email address and their name, which are embedded into some data,” she says. “But our lawyer told us, ‘Don’t go down that road,’ because this is very heterogeneous data, and the moment we do it for one person, we become responsible for doing it for everyone, and we become liable for failing to do it correctly.”
The cost of genome sequencing is plummeting, and is well ahead of Moore’s Law. Already down into the mid-thousands, researchers expect the price to drop below $1,000 in the next one to two years. This will put pressure on the technology, because the software tools to manage and interpret genomes are in their infancy. Hopefully, through the pioneering work of researchers such as Dr. Ball, the tools may start to catch up.