Genetic Sequencing Learnings
tldr: I spent the last 2 days learning about genetic sequencing. This is a quick post to share some of what I learned about it.
Genetic sequencing used to be slow and costly and today it is Substantially Cheaper.
Today it is an order of magnitude cheaper.
Sanger’s Chain Termination Sequencing.
One of the first major advancements was Sanger’s Chain Termination Sequencing.
Here’s a quick procedure on how Sanger’s Chain Termination Sequencing Works.
- Isolate DNA
- Synthesize more strands of DNA
- The reaction has a large concentration of ATCG and a small concentration of modified AxTxCxGx.
- When the synthesis adds a modified AxTxCxGx instead of regular ATCG DNA, then it ceases.
With varying lengths of DNA ending in AxTxCxGx, you sort each of these by 3 factors.
- Ending in Ax,Tx,Cx,Gx
- end terminal of each strain. (This used to be done by radioactivity and now, I think** it is done via color spectrum…. but I’m not completely sure.)
The way we might measure the efficiency of the Sanger process is by understanding how much DNA you can read per each reaction.
These days, the Sanger chain-termination is used for verification at a small scale.
Next Gen Sequencing
There’s a ton of different types of next Generation sequencing techniques. The top company in the space is Illumina.
Illumina sequencing involves a number of complex steps.
- Bind DNA molecules onto a microchip with tiny pockets, one for each DNA molecule.
- You put the chip and some enzymes into a computer.
- The chip is washed in a specific order with colored DNA letters.
- Each one lights up as it goes in place on each DNA molecule.
- Computer reads huge matrice/array of color and tracks the DNA sequences of all the thousands of pockets on the chip.
This whole process is pretty wild and lets you sequence thousands of DNA molecules simultaneously.
DNA sequencing is getting cheaper faster than Moore’s Law.
The Interpretation Problem.
Billions of gigabytes of sequence data are hard to sequence.
The crazy thing is that even if you can read the genome of an organism, it still gets tough because inevitably most of the genome doesn’t explicitly code for genes. You have a bunch of Junk DNA.
The junk DNA might do something mission critical, but sometimes it’s a combinatorially complex issue to identify exactly what it does.
Think about this reality of Junk DNA genes in context of the thousands to hundreds of thousands of genes(for all the different organisms) and you’ll see that the issue gets complex really fast.