Computing Infrastructure for Large-Scale Genomics

The large amounts of data stored within a human being is complex and once the flood gates are opened it’s difficult not to be overwhelmed. Three approaches have been taken traditionally to deal with this data, local implementation, outsourcing to informatics “factories”, and distributed/cloud-based solutions.  Knome “The Human Genome Interpretation Company” released a local solution recently which makes a rough claim to complete all the informatics work for a human genome from assembly to variant calling within one day. That is, 30 genomes per month, per knoSYS.

knoSYS™100

One could imagine wheeling this into a academic laboratory, hospital, or industrial research park. But at $125,000 each and the rate of change in computing hardware, I would be surprised if purchasing such systems are anything more than a gross portrayl of conspicuous consumption on the side of the buyer. A proven, and value-driven method for access to large-scale bioinformatics has been to work with “sequencing factories” i.e. JCVI, Broad Inst. The J. Craig Venter Institute for example, maintains a 1000 node Sun Grid Engine (SGE) cluster utilizing Hadoop / MapRecuce and employs approximately 57 bioinformaticians and software developers.

JCVI

Practically speaking however, both of the aforementioned models pale in comparison to consumer cloud-based solutions such as AWS.  Where the user can have as many equally or more powerful nodes with the same or better software at scalable costs.

Extra Large Instance
15 GB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage
64-bit platform
I/O Performance: High
EBS-Optimized Available: 1000 Mbps
API name: m1.xlarge
High-Memory Quadruple Extra Large Instance
68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
EBS-Optimized Available: 1000 Mbps
API name: m2.4xlarge

Individual nodes on Amazon EC2 are nearly hardware-equivalents to the knoSYS. And they can be scaled instantly. Now there are even more robust cloud solutions from efforts such as Nebula from NASA Ames, which are tailored to research specific applications, especially genomics. Nevertheless, one issue remains which is the transfer of such large datasets from the sequencing machines to the analysis clusters, and it has now arguably become the bottleneck.

Next-Generation Sequencing Statistics

This has been an area of much debate, how do you get the data from it’s point of origin, whether an academic lab, a hospital, or even large “sequencing factory” to where it needs to be processed, accessed and perhaps even stored. It turns out nature can efficiently store and move around petabytes of data with ease, at the drop of a hair. However, humanity’s workarounds to this problem at the moment involves filling shipping crates with hard disks and transporting it with conventional cargo. Thus, there still remains some need for local computing.

The moSYS™600 – It runs on bourbon.

This need to close the gap between large data generation and efficient analysis must be overcome for adequate adoption of genomics in conventional medicine. While some players in the field such as Knome will try to sell a pre-packaged solutions in a box, others such as Nebula are betting on the “private cloud”. While for academic labs and small businesses and hackers, I’d still recommend a mix of your own local solution backed by a powerful cloud pipeline.  Of course, the real solution would be to have a modern internet infrastructure, but that delves quickly in the realm of civic policy, unless you live in Kansas City.

DNA & Iterated Function Systems

H. Sapiens

Genomic code that makes us is made up of four letters, ATGC. Billions of these letters together creates a lifeform. Iterated function systems (IFS) are anything that can be made by repeating the same simple rules over and over. The easiest example being tree branches, add a simple structure repeatedly ad-infinitum and before you know it we have complex and beautiful systems; the popular example being the Sierpinski Triangle or “triforce” for the Zelda fans. As the cost of DNA sequencing becomes cheaper day by day we are confronted with a tsunami of data and it has become exceedingly difficult to derive meaningful answers from all the information contained within us.

Triforce Power

Finding any advantage in ways to organize and view the data helps us discover minute differences between individuals or say a normal cell versus a cancer cell. This is where Chaos Game Representation (CGR) becomes helpful, CGR is just a form of IFS that is helpful in mapping seemingly random information, that we suspect or know to have some sort of underlying structure.

In our case this would be the human genome. Although when looking at the letters coming from our DNA it seems like billions of random babbles, it is of course organized in a manner to give the blueprint for our bodies.  So let’s roll the dice-  do we get any sort of meaningful structure when applying CGR to DNA? If you are so inclined, something fun to try is the following:

genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]

g1346a094 on Chromosome 7

For example, reading the sequence in order, apply T1 whenever C is encountered, apply T2 whenever A is encountered, apply T3 whenever T is encountered, and apply T4 whenever G is encountered. Really though any transformations to C, A, T, and G can be used and multiple methods can be compared. Self-similarity is immediately noticeable in these maps, which isn’t all that surprising since fractals are abundant in nature and DNA after all, is a natural syntax. Being aware that these patterns exist within our data, opens us up to some new questions to evaluate if IFS, CGR and fractals in general are helpful tools in the interpretation of genomic data.

Signal transducer 5B (STAT5B), on chromosome 17

Since the mapping is 1-1 and we see patterns emerge, we are hinted that there may be biological relevance; especially because different genes yield different patterns. But what exactly are the correlations between the patterns and the biological functions? It would also be very interesting to see mappings of introns/exons colored differently or color amino acids and various codons. One thing is for sure, genomes aren’t just endless columns and rows of letters, they are pictures. It is much easier to compare pictures and discover variations, which can ultimately allow us to find meaningful interpretations from this invaluable data.

A cross-post by Mo from petridishtalk.com 

Citations:

Jeffrey, H. J., “Chaos game visualization of sequences,” Computers & Graphics 16 (1992), 25-33.

Ashlock, D. Golden, J.B., III. Iterated function system fractals for the detection and display of DNA reading frame (2000) ISBN: 0-7803-6375-2

VV Nair, K Vijayan, DP Gopinath ANN based Genome Classifier using Frequency Chaos Game Representation (2010)