Computing Infrastructure for Large-Scale Genomics

The large amounts of data stored within a human being is complex and once the flood gates are opened it’s difficult not to be overwhelmed. Three approaches have been taken traditionally to deal with this data, local implementation, outsourcing to informatics “factories”, and distributed/cloud-based solutions.  Knome “The Human Genome Interpretation Company” released a local solution recently which makes a rough claim to complete all the informatics work for a human genome from assembly to variant calling within one day. That is, 30 genomes per month, per knoSYS.

knoSYS™100

One could imagine wheeling this into a academic laboratory, hospital, or industrial research park. But at $125,000 each and the rate of change in computing hardware, I would be surprised if purchasing such systems are anything more than a gross portrayl of conspicuous consumption on the side of the buyer. A proven, and value-driven method for access to large-scale bioinformatics has been to work with “sequencing factories” i.e. JCVI, Broad Inst. The J. Craig Venter Institute for example, maintains a 1000 node Sun Grid Engine (SGE) cluster utilizing Hadoop / MapRecuce and employs approximately 57 bioinformaticians and software developers.

JCVI

Practically speaking however, both of the aforementioned models pale in comparison to consumer cloud-based solutions such as AWS.  Where the user can have as many equally or more powerful nodes with the same or better software at scalable costs.

Extra Large Instance
15 GB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage
64-bit platform
I/O Performance: High
EBS-Optimized Available: 1000 Mbps
API name: m1.xlarge
High-Memory Quadruple Extra Large Instance
68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
EBS-Optimized Available: 1000 Mbps
API name: m2.4xlarge

Individual nodes on Amazon EC2 are nearly hardware-equivalents to the knoSYS. And they can be scaled instantly. Now there are even more robust cloud solutions from efforts such as Nebula from NASA Ames, which are tailored to research specific applications, especially genomics. Nevertheless, one issue remains which is the transfer of such large datasets from the sequencing machines to the analysis clusters, and it has now arguably become the bottleneck.

Next-Generation Sequencing Statistics

This has been an area of much debate, how do you get the data from it’s point of origin, whether an academic lab, a hospital, or even large “sequencing factory” to where it needs to be processed, accessed and perhaps even stored. It turns out nature can efficiently store and move around petabytes of data with ease, at the drop of a hair. However, humanity’s workarounds to this problem at the moment involves filling shipping crates with hard disks and transporting it with conventional cargo. Thus, there still remains some need for local computing.

The moSYS™600 – It runs on bourbon.

This need to close the gap between large data generation and efficient analysis must be overcome for adequate adoption of genomics in conventional medicine. While some players in the field such as Knome will try to sell a pre-packaged solutions in a box, others such as Nebula are betting on the “private cloud”. While for academic labs and small businesses and hackers, I’d still recommend a mix of your own local solution backed by a powerful cloud pipeline.  Of course, the real solution would be to have a modern internet infrastructure, but that delves quickly in the realm of civic policy, unless you live in Kansas City.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>