Bioinformatics Part II: Application, Manipulation, and Uses

Or, How I Learned to Stop Worrying and Love Computational Biology

In Part I of this series on Bioinformatics, we explored how researchers acquire expansive data sets for large-scale research projects. We discussed: DNA microarrays, used to determine the relative levels of expressed genes under a given condition (i.e. disease); Chromatin Immunoprecipitation (ChIP), used to determine the locations in which proteins bind to DNA; DNA sequencing, used to determine the linear arrangement of nucleotide bases.


Part II will discuss some of the applications of this data collection.

Why collect bioinformatics data? How is this data used?

By clever approaches in the field of Bioinformatics, researchers are able to apply this data and produce meaningful conclusions for new approaches to treatment.

While there are many possibilities, we will focus on four major applications: gene annotation and prediction; evolutionary biology; systems biology; and cancer biology.

Genome Annotation and Prediction

Once an organism’s genome is sequenced, the next challenge is to organize the newly acquired wealth of raw information. Genomic annotation is the process of identifying genes within the DNA sequence and labeling them with relevant biological information. This generally includes the gene name, function, any known biochemical interactions, and DNA sequence.  The general concept is the ability to compare unknown sequence data to a sophisticated database annotated genes. Indeed, there are many computer programs designed to this end; some of the earliest bioinformatics software was developed for this purpose.

BLAST, or Basic Local Alignment Search Tool, is a public domain bioinformatics tool used ubiquitously in the biological sciences. BLAST contains a searchable database of genome annotation information. Accessible through the NIH, researchers can enter experimental sequence information, and BLAST will return a list of genes with similar DNA sequences from other, previously sequenced organisms.

Genome prediction software utilizes BLAST search information to analyze the genome of a newly sequenced species. The software will identify regions of the genome most likely to be genes, compare them to known species, and deduce the location and function of genes in a newly sequenced organism (with varying confidence levels). If this organism is closely related to an organism with a known genomic sequence, and oftentimes this is the case, then this type of software is instrumental for comparing the two genomes.

Dozens of genome annotation and prediction software programs are available, each specializing in different species. Some notable examples include geneid, GeneMapper, and GeneWise.

Computational Evolutionary Biology

Ever wonder how closely related humans are to apes? To mice? To bacteria? Similar to genome prediction programs that compare genomes for biological information, computational evolutionary biology programs compare genomes for evolutionary information.

Evolutionary biology is a field that pre-dates even Darwin, as scientists in the 18th and 19th centuries studied fossils and postulated relationships between past and living animals. Now, instead of relying on physical similarities, researchers have genomic data to analyze the evolutionary relationships between species.

There are many programs that generate phylogenetic trees from genomic sequence information. From these trees, phylogenetic networks are created. This information can be used to trace the evolutionary history of organisms, and to predict their future evolutionary patterns. By these techniques, as a real-world application, researchers can elucidate the genetic mechanisms by which certain bacteria acquire drug resistance.

Systems Biology

In systems biology, researchers use bioinformatics data to model cellular function. While some programmers aim to create complete cell and organ simulations, others seek more modest goals, such as explaining the mechanisms of metabolic pathway within the cell. All of the goals in systems biology, however, are enabled due to bioinformatics.

To model a metabolic pathway, a researcher will create an algorithm that compares expression data under various metabolic conditions (acquired from microarrays) to other information, such as sequence data or protein-DNA interactions (acquired from ChIP experiments).

Different programs take different approaches to understand the biological phenomena. For example, a researcher studying a cell’s response to low oxygen (hypoxia) would acquire expression data for cells cultured at various concentrations of oxygen. The researcher could find which genes expression pattern corresponds to the concentration gradient, compare the sequences these genes, look for proteins that bind to any of these genes, and so on. Ultimately, the cellular response to hypoxia can be modeled algorithmically.

To model a complete cell, one must assemble the entire scope of bioinformatics data into a single meaningful network. In 2006, the President of the National Science Foundation (NSF) declared that the “Grand Challenge of systems biology is to model the interconnected changes in the whole cell.”  Groups at MIT and Harvard have since developed models.

Merrimack Pharmaceuticals (MACK) used their own model to develop the drug MM-111, a first-in-class bispecific antibody indicated for gastric cancer. Currently in Phase II, MM-111 has been shown in preclinical studies to bind with both specificity and avidity to HER2 and HER3 expressing tumor cells. The HER3 arm, identified as a key tumorigenic node in many types of cancer by the company’s Network Biology approach, is designed to block heregulin-induced cell signaling. According to the company, ligand-induced signaling of HER3 activates cellular pathways that promote the development, growth and progression of cancer. The company states HER3 also serves as a compensatory mechanism in cancer cells developing resistance to targeted therapies, including chemotherapy. The HER2 arm is responsible for initial tumor cell targeting and docking.

Another major application of systems biology is protein folding prediction software. Every protein has a unique folding pattern, and only a small number have been completely solved by experimental methods. In theory, a program could read the amino acid sequence of a protein and predict its three-dimensional shape. In practice, however, no modeling software has supplanted the traditional methods for solving protein structure.

Currently, protein structure solving involves labor-intensive crystallization and x-ray diffraction techniques. Structures that are solved are stored on the Protein Data Bank (PDB), a database maintained by the Research Collaboratory for Structural Bioinformatics. Several programs use data available on the PDB to predict the structure of unsolved proteins, including FoldX, Biskit, and Phyre.

Cancer Biology

After the Human Genome Project was completed, researchers had access to remarkably powerful information: imagine, the full DNA sequence of a healthy human being. A standard!

In a matter of time, researchers began comparing the sequences of cancerous cells with that of the standard. They began comparing the microarray data with that of the standard, looking for genes overexpressed in cancer cells. By this method, the BRCA1 and BRCA2 genes were isolated.

Comparative genomic hybridization is the process by which a researcher can identify abnormal additions or deletions of DNA. Like other microarray techniques, probes can be used to count the number of copies of particular regions of a chromosome. Aberrations in these numbers are often indicative of diseases, including cancer, Down’s Syndrome (caused by an additional 21st chromosome), and Duchenne muscular dystrophy (caused by a gene deletion on the X chromosome). Comparative genomic hybridization is a major source of data in bioinformatics studies.

Single Nucleotide Polymorphisms (SNPs) are genetic loci that are susceptible to variation within the human genome. Though only a single nucleotide in length, the NIH maintains a database (dbSNP) of all such recorded instances in various genomes. SNPs are typically discovered by sequencing experiments, and used in the comparison of genes in healthy and cancerous cells. In theory, a set of SNPs could increase the likelihood of cancer. While more research is required before this method can be used conclusively, interesting connections have been made between certain SNPs and gene function.

Where are we now?

At the crossroads of biology and computer science, bioinformatics continues to bring biological research into the 21st century. With the full sequence of the human genome and advanced experimental techniques, scientists can use computer algorithms to predict disease genotypes, protein structures, gene functions, and evolutionary connections. The development of these programs is the focus of bioinformatics and biotechnology conventions worldwide.

In the third and final part of this series on bioinformatics, we will discuss the real-life applications of bioinformatics research. While part I discussed the acquisition of bioinformatics data and part II discusses the manipulation of that data, part III will discuss the programs developed and conclusions drawn from bioinformatics research. As the field of bioinformatics continues to evolve, the outcomes of these studies are highly anticipated.

The comments are closed.