Here's something from the news last year. Without the reference genome this woul...

bane · on May 5, 2013

So I guess my question is, and I've seen lots of articles and breakthroughs in genome sequencing, is what use has the actual data been in the HGP? It seems to me that things like this are about as based on the HGP data as velco is to NASA. It's an enormously beneficial spinout technology that happened to have developed as a side-benefit of the main work. I don't know if I'd go so far as to say the sequencing or velcro would have never been developed without the main research focii, but it didn't hurt.

ben1040 · on May 6, 2013

The HGP reference genome is pretty much essential to the "whole genome" analysis done on humans and that's the big direction in research right now. I work in cancer and disease genomics doing data analysis software and all of the analysis methodology goes back to this reference in some way.

Sequencing technology has gotten to a point where it's just blown Moore's Law absolutely out of the water and we can't throw more compute at the analysis problem, we have to make it smarter. The reference genome is used in how that's been made smarter.

It helps to discuss a little bit about how the HGP reference was produced, and why producing it took 10 years and three billion dollars.

The HGP process first had a map made, where the genome was broken into lots of smaller segments. The idea was that this reduced your problem space; any segment of DNA produced from a sample from that portion of the genome came from that area. Then that segment was broken into lots and lots of smaller chunks and then read on the sequencing machines in 600-800 base segments. By the time that sequencing technology reached "max level," the state of the art machine could generate 96 of those segments in an hour's time.

Then you'd calculate overlaps and assemble those smaller "reads" back into a sequence of that chunk you chose from the map. Then someone would audit the computer-generated assembly by hand, possibly ordering up more lab work to fill any gaps or resolve areas of crummy data. Repeat for the next chunk from the map.

Now here's how things work, when we need to do any sort of genomic analysis on an individual:

New technology has the ability to sequence human genomes at deep coverage in 11 days[1], and cranks out 6 billion reads 100bp long from places all over the genome. Computationally, this is an absolutely different animal. You can't feasibly try to re-assemble these reads into a human. So, what we do is use string matching algorithms to "map" a 100bp read back to where it most likely came from, using the HGP genome as a reference.[2] Since obviously your DNA does not match the HGP reference base-for-base, and mismatches/insertions+deletions are really where the interesting data is anyway, there's some leeway for mismatches in the mapping.

At that point, by mapping reads back to where they came from, we end up with a data file that represents an individual's genome. You're able to walk across the genome base for base and ask "So, base 347 of Chromosome 7 is a T in the reference, what is the most likely base on Joe's genome at this point given the reads we have that span this base?"

Mapping things to the reference also allows us to attempt to find really interesting stuff that can cause disease, such as structural variations in the genome. These are instances where large segments are removed, duplicated, inverted, or picked up and moved somewhere else relative to where they "should be."

[1] http://www.illumina.com/systems/hiseq_comparison.ilmn

[2] http://bio-bwa.sourceforge.net/ is the tool that's most popular these days.