Cannon WR, BJM Webb-Robertson, AR Willse, M Singhal, LA McCue, JE McDermott, RC Taylor, KM Waters, and CS Oehmen. 2009. "An Integrative Computational Framework for Hypotheses-Driven Systems Biology Research in Proteomics and Genomics." Chapter 4 in Computational and Systems Biology: Methods and Applications, pp. 63-85. Research Signpost, Trivandrum, India. Abstract Systems biology research is sometimes categorized as either discovery science or hypothesis-driven science. However, we believe that hypotheses are always used regardless, and that explicit recognition that hypothesis testing underlies all high-throughput data analysis leads to better experimental designs, data analysis and interpretation of the data. We outline the current use of hypothesis testing for proteomics data analysis in systems biology research for several projects at the Pacific Northwest National Laboratory, and provide examples of where scientific principles can be used to formulate the hypotheses used to analyze the data. We additionally discuss the data infrastructure is required to (1) track the data from different projects and diverse assays, (2) pull the data together in a congruent manner, (3) analyze the data with respect to cellular networks, and (4) visualize the resulting networks and contrast those with information from bioinformatics databases.

Webb-Robertson BJM, WR Cannon, CS Oehmen, AR Shah, V Gurumoorthi, MS Lipton, and KM Waters. 2008. "A Support Vector Machine model for the prediction of proteotypic peptides for accurate mass and time proteomics." Bioinformatics 24(13):1503-1509. doi:10.1093/bioinformatics/btn218 Abstract Motivation: The standard approach to identifying peptides based on accurate mass and elution time (AMT) compares these profiles obtained from a high resolution mass spectrometer to a database of peptides previously identified from tandem mass spectrometry (MS/MS) studies. It would be advantageous, with respect to both accuracy and cost, to only search for those peptides that are detectable by MS (proteotypic). Results: We present a Support Vector Machine (SVM) model that uses a simple descriptor space based on 35 properties of amino acid content, charge, hydrophilicity, and polarity for the quantitative prediction of proteotypic peptides. Using three independently derived AMT databases (Shewanella oneidensis, Salmonella typhimurium, Yersinia pestis) for training and validation within and across species, the SVM resulted in an average accuracy measure of ~0.8 with a standard deviation of less than 0.025. Furthermore, we demonstrate that these results are achievable with a small set of 12 variables and can achieve high proteome coverage. Availability: http://omics.pnl.gov/software/STEPP.php

Webb-Robertson BJM, CS Oehmen, and AR Shah. 2008. "A Feature Vector Integration Approach for a Generalized Support Vector Machine Pairwise Homology Algorithm." Computational Biology and Chemistry 32(6):458-461. doi:10.1016/j.compbiolchem.2008.07.017 Abstract Due to the exponential growth of sequenced genomes, the need to quickly provide accurate annotation for existing and new sequences is paramount to facilitate biological research. Current sequence comparison approaches fail to detect homologous relationships when sequence similarity is low. Support vector machine (SVM) algorithms approach this problem by transforming all proteins into a feature space of equal dimension based on protein properties, such as sequence similarity scores against a basis set of proteins or motifs. This multivariate representation of the protein space is then used to build a classifier specific to a pre-defined protein family. However, this approach is not well suited to large-scale annotation. We have developed an SVM HOmology Tool (SHOT) that formulates remote homology as a single classifier that answers the pairwise comparison problem. SHOT integrates the two feature vectors for a pair of sequences into a single vector representation that can be used to build a classifier that separates sequence pairs into homologs and non-homologs, in lieu of pre-defined families. SHOT has the capability to attains homology scores in run-times competitive with the state-of-the-art PSI-BLAST algorithm. In addition, SHOT yields a dramatic increase in the number of accurate identifications on the benchmark dataset, quantified as the area under the Receiver Operating Characteristic curve; 0.97 for SHOT versus 0.73 and 0.70 for PSI-BLAST and BLAST, respectively.

Shah AR, CS Oehmen, and BJM Webb-Robertson. 2008. "SVM-Hustle - An iterative semi-supervised machine learning approach for pairwise protein remote homology detection." Bioinformatics 24(6):783-790. doi:10.1093/bioinformatics/btn.028 Abstract Motivation: As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular ‘parts list’. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fail to identify remote homologs (proteins with similar function but dissimilar sequence) which often are a significant fraction of the total homolog collection for a given sequence. We introduce a Support Vector Machine (SVM)-based tool to detect Homology Using Semisupervised iTerative LEarning (SVM-HUSTLE) that detects significantly more remote homologs than current state-of-the-art sequence or cluster-based methods. As opposed to building profiles or position specific scoring matrices, SVM-HUSTLE builds an SVM classifier for a query sequence by training on a collection of representative highconfidence training sets. SVM-HUSTLE combines principles of semi-supervised learning theory with statistical sampling to create many concurrent classifiers to iteratively detect and refine on-the-fly patterns indicating homology. Results: When compared against existing methods for identifying protein homologs (BLASTp, PSI-BLAST, RANKPROP, MOTIFPROP and their variants) on the SCOP 1.59 benchmark dataset consisting of 7329 protein sequences, SVM-HUSTLE significantly outperforms each of the above methods using the most stringent ROC1 statistic with p-values less than 1e-20.

Gorton I, CS Oehmen, and JE McDermott. 2008. "It Takes Glue to Tango: MeDICi integration framework creates data-intensive computing pipeline." Scientific Computing 25(7):16-24. Abstract Biologists increasingly rely on high-performance computing (HPC) platforms to rapidly process the tsunami of data generated by high throughput genome and metagenome sequencing technology and high-throughput proteomics. Unfortunately, the platforms that produce the massive data sets rarely work smoothly with the interactive analysis and visualization programs used in bioinformatics. This makes it difficult for researchers to exploit the computational power of HPC platforms to speed scientific discovery. At the Department of Energy’s Pacific Northwest National Laboratory in Richland, Wash., researchers are creating computing environments for biologists that seamlessly integrate collections of data and computational resources. These advantages enable users to rapidly analyze high-throughput data. A major goal is to shield the biologist from the complexity of interacting with multiple dissimilar databases and running tasks on HPC platforms and computational clusters. One of those environments the MeDICi Integration Framework is now available for free download. Short for Middleware for Data-Intensive Computing, MeDICi makes it easy to integrate separate codes into complex applications that operate as a data analysis pipeline.

Webb-Robertson BJM, ES Peterson, M Singhal, KR Klicker, CS Oehmen, JN Adkins, and SL Havre. 2007. "PQuad - A visual analysis platform for proteomic data exploration of microbial organism." Bioinformatics 23(13):1705-1707. doi:10.1093/bioinformatics/btm132 Abstract The visual Platform for Proteomics Peptide and Protein data exploration (PQuad) is a multi-resolution environment that visually integrates genomic and proteomic data at the prokaryotic scale, display biological categorical information and view differential expression experiments. PQuad is built on Java 1.5 and has been tested and runs across different operating systems.

Shah AR, CS Oehmen, JK Harper, and BJM Webb-Robertson. 2007. "Integrating Subcellular Location for Improving Machine Learning Models of Remote Homology Detection in Eukaryotic Organisms." Computational Biology and Chemistry 31(2):138-142. doi:10.1016/j.compbiolchem.2007.02.012 Abstract Motivation: At the center of bioinformatics, genomics, and pro-teomics is the need for highly accurate genome annotations. Producing high-quality reliable annotations depends on identifying sequences which are related evolutionarily (homologs) on which to infer function. Homology detection is one of the oldest tasks in bioinformatics, however most approaches still fail when presented with sequences that have low residue similarity despite a distant evolutionary relationship (remote homology). Recently, discriminative approaches, such as support vector machines (SVMs) have demonstrated a vast improvement in sensitivity for remote homology detection. These methods however have only focused on one aspect of the sequence at a time, e.g., sequence similarity or motif based scores. However, supplementary information, such as the sub-cellular location of a protein within the cell would give further clues as to possible homologous pairs, additionally eliminating false relationships due to simple functional roles that cannot exist due to location. We have developed a method, SVM-SimLoc that integrates sub-cellular location with sequence similarity information into a pro-tein family classifier and compared it to one of the most accurate sequence based SVM approaches, SVM-Pairwise. Results: The SCOP 1.53 benchmark data set was utilized to assess the performance of SVM-SimLoc. As cellular location prediction is dependent upon the type of sequence, eukaryotic or prokaryotic, the analysis is restricted to the 2630 eukaryotic sequences in the benchmark dataset, evaluating a total of 27 protein families. We demonstrate that the integration of sequence similarity and sub-cellular location yields notably more accurate results than using sequence similarity independently at a significance level of 0.006.

Samatova NF, A Gorin, E Uberbacher, TV Karpinets, BH Park, C Pan, TP Straatsma, WR Cannon, H Resat, RD Lins, and CS Oehmen. 2007. "Data driven computing for Biological Systems." SciDAC Review 5(Fall 2007):10-25. Abstract Biological breakthroughs that can lead to improved diagnosis and treatment of diseases, generation of clean energy, and solutions to other critical societal problems require high performance, data-intensive computational tools that have the ability to process, analyze and cohesively integrate massive amounts of data and information in real time. Biological computing problems are typically data-intensive and must share very large sets of data effectively across many processors. However, the various components of biological systems, composed of complex networks and pathways, must be integrated to gain a coherent understanding of the system. The more different types of data that can be integrated, the deeper the insights into the biology of the system being studied. Conventional analysis software, however, hasn’t been able to efficiently deal with such massive data set. The goal of the Data-Intensive Computing for Complex Biological Systems (BioPilot) project, a multiyear project funded by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research (ASCR), is to create an integrated suite of highly flexible, highly adaptable pipelines of computational tools for analyzing large-scale data sets that will be used to address specific challenges facing the U.S. Department of Energy (DOE) and our society.

Oehmen CS, and J Nieplocha. 2006. "ScalaBLAST: A Scalable Implementation of BLAST for High Performance Data-Intensive Bioinformatics Analysis." IEEE Transactions on Parallel and Distributed Systems 17(8):740-749. Abstract Genes in an organism’s DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell’s work. A typical bacterial genome contains on the order of 5000 genes. Mammalian genomes can contain hundreds of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused at unlocking protein information embedded in the genetic code, making it possible to assemble a “tree of life” by comparing new sequences against all sequences from known organisms. But the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high performance sequence alignment application, ScalaBLAST, which accommodates very large databases, and which scales linearly to hundreds of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high performance sequence alignment with scaling and portability. ScalaBLAST, relies on a collection of innovative techniques -- distributing the target database over available memory, multi-level parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching -- to achieve high performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences.

Oehmen CS, TP Straatsma, GA Anderson, G Orr, BJM Webb-Robertson, RC Taylor, RW Mooney, DJ Baxter, DR Jones, and DA Dixon. 2006. "New Challenges Facing Integrative Biological Science in the Post-Genomic Era." Journal of Biological Systems 14(2):275-293. Abstract The future of biology will be increasingly driven by the fundamental paradigm shift from hypothesis-driven research to data-driven discovery research employing the massive amounts of available biological data. We identify key technological developments needed to enable this paradigm shift involving (1) the ability to store and manage extremely large datasets which are dispersed over a wide geographical area, (2) development of novel analysis and visualization tools which are capable of operating on enormous data resources without overwhelming researchers with unusable information, and (3) formalisms for integrating mathematical models of biosystems from the molecular level to the organism population level. This will require the development of tools which efficiently utilize high-performance compute power, large storage infrastructures and large aggregate memory architectures. The end result will be the ability of a researcher to integrate complex data from many different sources with simulations to analyze a given system at a wide range of temporal and spatial scales in a single conceptual model.

Webb-Robertson BJM, CS Oehmen, and MM Matzke. 2005. "SVM-BALSA: Remote Homology Detection based on Bayesian Sequence Alignment." Computational Biology and Chemistry 29(6):440-3. Abstract Using biopolymer sequence comparison methods to identify evolutionarily related proteins is one of the most common tasks in bioinformatics. Recently, support vector machines (SVMs) utilizing statistical learning theory have been employed in the problem of remote homology detection and shown to outperform iterative profile methods such as PSI-BLAST. In this study we demonstrate the utilization of a Bayesian alignment score, which accounts for the uncertainty of all possible alignments, in the SVM construction improves sensitivity compared to the traditional dynamic programming implementation.

Cannon WR, KH Jarman, BJM Webb-Robertson, DJ Baxter, CS Oehmen, KD Jarman, A Heredia-Langner, KJ Auberry, and GA Anderson. 2005. "A Comparison of Probability and Likelihood Models for Peptide Identification from Tandem Mass Spectrometry Data." Journal of Proteome Research 4(5):1687-1698. Abstract We evaluate statistical models used in two-hypothesis tests for identifying peptides from tandem mass spectrometry data. The null hypothesis H0 that a peptide matches a spectrum by chance requires information on the probability of by-chance matches between peptide fragments and peaks in the spectrum. Likewise, the alternate hypothesis HA that the spectrum is due to a particular peptide requires probabilities that the peptide fragments would indeed be observed if it was the causative agent. We compare models for these probabilities by determining the identification rates produced by the models using an independent data set. The initial models use different probabilities depending on fragment ion type, but uniform probabilities for each ion type across all of the labile bonds along the backbone. More sophisticated models for probabilities under both HA and H0 are introduced that do not assume uniform probabilities for each ion type. In addition, the performance of these models using a standard likelihood model is compared to an information theory approach derived from the likelihood model. Also, a simple but effective model for incorporating peak intensities is described. Finally, a support-vector machine is used to discriminate between incorrect hits and correct identifications based on multiple characteristics of the scoring functions. The results are shown to reduce the misidentification rate by 10-fold when compared to a standard cross-correlation based approach.