Computationally Intensive Research Project
Scaling Up for Large Metagenomic Computations with ScalaBLAST
MSC News
- Call for White Papers on Scientific Computing
- EMSL introduces a new video about Chinook on EMSL's YouTube channel (script, .pdf, 9kb)
- Chinook - Living Life in the Fastlane [.pdf, 107kb]
Additional Information
- Meet the MSC Staff
- Related EMSL User Projects
- Completed Projects: Final Reports
- Help
- User Policies
- User Tips
- Accounts
- Sampling of Jobs Running
- Emergencies
- Glossary
- Training and Events
- Next-Generation Supercomputing (Greenbook) [.pdf, 2.4mb]
- FAQ
- Partners and Related Links
- Spokane Cluster for Biogeochemistry Research
Philip Hugenholtz,1 T. P. Straatsma,2 Jarek Nieplocha,2 Heidi J. Sofia,2 Christopher S. Oehmen,2 Ernest Szeto,3 Victor M. Markowitz,3 Nikos C. Kyrpides,1 Peter Zuber4
1Joint Genome Institute, 2Pacific Northwest National Laboratory, 3Lawrence Berkeley National Laboratory, 4Oregon Health Sciences University/Oregon Graduate Institute
FY07 Allocation - 800,000
Abstract
The recent emphasis on understanding genomes at the level of whole organisms from the U.S. Department of Energy, National Institutes of Health, and other major agencies has driven a worldwide effort at improving the rate of sequencing new organisms and toward improving the quality of new sequences coming online. Understanding the molecular composition and interactions of proteins in microbes is a key advancement that holds great promise for cleaner transportation fuels, processing legacy waste products associated with weapons production, developing counter-bioterrorism strategies like early detection of biological agents or field-deployable intervention technologies, or for providing a vehicle for processing spent fuel expected to result from the renewed nuclear reactor program. But extracting meaningful information from the vast collection of sequence data requires comparison of many entire genomes against one another using approaches such as comparative genomics.
Comparative genomics is a powerful approach to understanding how single organisms evolve and function. The burgeoning number of sequenced genomes presents computational challenges that are just being met with existing methods and infrastructure. However, the recent application of high throughput sequencing to environmental samples (metagenomics) represents a grand challenge in biology because it has the potential to reveal how entire ecosystems function and interact. The approach was initially demonstrated on a simple acid mine drainage biofilm community comprising <10 species resulting in significant recovery of the genomes of the dominant populations (Tyson, 2004). Key insights into the biofilm community function and interaction could be inferred from a metabolic reconstruction of the genomic data, such as the partitioning of community essential functions, such as nitrogen fixation (Tyson, 2004).
Other ecosystems currently under investigation include termite hindgut communities for bioenergy production (a key DOE mission), activated sludges, soils, and terephthalate-degrading communities. The computational task of analyzing this massive volume of complex data is quickly outpacing the capacity for existing software and hardware capabilities. Current standalone sequence analysis implementations generally take a single dedicated machine days or weeks to complete the analysis of a single microbial genome against the nonredundant protein database (a growing curated collection of proteins from all sequenced organisms). The grand challenge in this "genomes-to-life" sequence analysis effort is much larger, putting it beyond the capacity of most biologists to complete in a reasonable time.
ScalaBLAST is a high-performance extenstion to the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST). ScalaBLAST has been shown to accelerate the throughput of BLAST sequence analysis in proportion to the number of processors available on machines like the EMSL Molecular Science Computing Facility supercomputer, MPP2. Using a previous pilot project, we have demonstrated nearly perfect scaling to machine capacity for sequence analysis tasks of the size we propose in this computational grand challenge. ScalaBLAST will be the main sequence analysis driver for the proposed project, enabling a body of sequence of analysis to be performed which will provide a critical information resource to the general science community for addressing the driving science problems identified above.

