Bioinformatics
|
Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. Research in computational biology often overlaps with systems biology. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution. The terms bioinformatics and computational biology are often used interchangeably, although the latter typically focuses on algorithm development and specific computational methods. A common thread in projects in bioinformatics and computational biology is the use of mathematical tools to extract useful information from noisy data produced by high-throughput biological techniques. (The field of data mining overlaps with computational biology in this regard.) Representative problems in computational biology include the assembly of high-quality DNA sequences from fragmentary "shotgun" DNA sequencing, and the prediction of gene regulation with data from mRNA microarrays or mass spectrometry.
Contents |
Major research areas
Sequence analysis
Main articles: Sequence alignment, Sequence database
Since the Phage Φ-X174; was sequenced in 1977, the DNA sequence of more and more organisms has been decoded and stored in electronic databases. This data is analyzed to determine genes that code for proteins, as well as regulatory sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Today, computer programs are used to search the genome of thousands of organisms, containing billions of nucleotides. These programs can compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, in order to identify sequences that are related, but not identical. A variant of this sequence alignment is used in the sequencing process itself. The so-called shotgun sequencing technique (which was used, for example, by The Institute for Genomic Research to sequence the first bacterial genome, Haemophilus influenza) does not give a sequential list of nucleotides, but instead the sequences of thousands of small DNA fragments (each about 600-800 nucleotides long). The ends of these fragments overlap and, when aligned in the right way, make up the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. In the case of the Human Genome Project, it took several months of CPU time (on a circa-2000 vintage DEC Alpha computer) to assemble the fragments. Shotgun sequencing is the method of choice for virtually all genomes sequenced today, and genome assembly algorithms are a critical area of bioinformatics research.
Another aspect of bioinformatics in sequence analysis is the automatic search for genes and regulatory sequences within a genome. Not all of the nucleotides within a genome are genes. Within the genome of higher organisms, large parts of the DNA do not serve any obvious purpose. This so-called junk DNA may, however, contain unrecognized functional elements. Bioinformatics helps to bridge the gap between genome and proteome projects, for example in the use of DNA sequence for protein identification.
See also: sequence analysis, sequence profiling tool, sequence motif.
Genome annotation
Main articles: Gene finding
In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence. The first genome annotation software system was designed in 1995 by Owen White, who was part of the team that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae. Dr. White built a software system to find the genes (places in the DNA sequence that encode a protein, the transfer RNA, and other features, and to make initial assignments of function to those genes. Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA are constantly changing and improving. The Ensembl system (http://www.ensembl.org) is a genome annotation pipeline for the human genome developed by Ewan Birney at The Sanger Institute (http://www.sanger.ac.uk) near Cambridge, England.
Computational evolutionary biology
Evolutionary biology is the study of the origin and descent of species, as well as their change over time. Recent developments in genome sequencing and the ubiquity of fast computers enable researchers to trace evolution of species by tracing changes in their DNA. CEB research from the pre-genome era involved building computational models of populations and watching their behavior over time.
The field of genetic algorithms might be described as the rough inverse of CEB --- rather than investigating evolution through computer programs, it aims to improve computer programs through evolutionary principles.
Gene expression analysis
The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), or by measuring protein concentrations with high-throughput mass spectroscopy. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression (HT) studies. HT studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the proteins that cancer up-regulates and down-regulates.
Expression data is also used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. Further analysis could take a variety of directions: one 2004 study analyzed the promoter sequences of co-expressed (clustered together) genes to find common regulatory elements and used machine learning techniques to identify the promoter elements involved in regulating each cluster (see this study (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15084257)).
Protein expression analysis
Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former involves a number of the same problems involve in examining microarrays targeted at mRNA, the latter involves the bioinformatics problem of matching MS data against protein sequence databases.
Analysis of mutations in cancer
Massive sequencing efforts are currently underway to identify point mutations in a variety of genes in cancer. The sheer volume of data produced requires automated systems to read sequence data, and to compare the sequencing results to the known sequence of the human genome, including known germline polymorphisms.
Oligonucleotide microarrays, including comparative genomic hybridization and single nucleotide polymorphism arrays, able to probe simultaneously up to several hundred thousand sites throughout the genome are being used to identify chromosomal gains and losses in cancer. Hidden Markov Model and change-point analysis methods are being developed to infer real copy number changes from often noisy data. Further informatics approaches are being developed to understand the implications of lesions found to be recurrent across many tumors.
Structure prediction
Main article: Protein structure prediction
Protein structure prediction is another important application of bioinformatics. The amino acid sequence of a protein, the so-called primary structure, can be easily determined from the sequence on the gene that codes for it. But, the protein can only function correctly if it is folded in a very special and individual way (if it has the correct secondary, tertiary and quaternary structure). The prediction of this folding just by looking at the amino acid sequence is quite difficult. Several methods for computer predictions of protein folding are currently (as of 2004) under development.
One of the key ideas in bioinformatics research is the notion of homology. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of bioinformatics homology is used to determine which parts of the protein are important in structure formation and interaction with other proteins. In a technique called homology modelling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in both organisms. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.
Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.
See also structural motif.
Preserving biodiversity
Bioinformatics is often used for preserving biodiversity. The most important information collected is the species names, descriptions, distributions, status and size of populations, habitat needs, and how each organism interacts with other species. This information is compiled with computer databases, accessed with software programs to find, visualize, and analyze the information automatically, and most importantly, communicated to other people, especially over the internet. DNA sequences of endangered species can be preserved, and names and descriptions of specimens living in captivity are stored in order to allow as much access to the information needed to preserve biodiversity as possible.
An example this application is the Species 2000 (http://www.sp2000.org/) project. It is an internet-based global research project which intends to provide information about every known species of plant, animal, fungus, and microbe in existence to be the foundation for studies of global biodiversity. Anyone in the world will be able to find vast information about any known species from an array of participating databases.
Modeling biological systems
Main article: Systems biology
Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.
Other applications
Morphometrics is used to analyze pictures of embryos to track and to predict the fate of cell clusters during morphogenesis.
Software tools
The computational biology tool best-known among biologists is probably BLAST, an algorithm for searching large sequence (protein, DNA) databases. NCBI provides a popular implementation that searches their massive sequence databases.
Computer scripting languages such as Perl and Python are often used to interface with biological databases and parse output from bioinformatics programs. Communities of bioinformatics programmers have set up free/open source projects such as EMBOSS, Bioconductor, BioPerl, BioPython, BioRuby, and BioJava which develop and distribute shared programming tools and objects (as program modules) that make bioinformatics easier.
See also
- biologically-inspired computing
- morphometrics
- metabolic network
- Important publications in bioinformatics
Related fields
- applied mathematics — biology — computer science — informatics — mathematical biology — theoretical biology
Bibliography
- R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological sequence analysis. Cambridge University Press, 1998. ISBN 0521629713
- Kohane, et al. Microarrays for an Integrative Genomics. The MIT Press, 2002. ISBN 026211271X
- Mount, David W. "Bioinformatics: Sequence and Genome Analysis" Spring Harbor Press, May 2002. ISBN 0879696087
- JM. Claverie, C. Notredame, Bioinformatics for Dummies. Wiley, 2003. ISBN 0764516965
External links
- Software projects
- CIPRES Project: The Cyber-Infrastructure for Phylogenetic Research (http://www.phylo.org/)
- BIOMAP Project: Creating a Unified Global Map of various Macromolecular Biological Structures (http://biomap.org/)
- Proteome Ontology Project: An effort to build a Protein Ontology Specification, a part of BIOMAP Project (http://proteomeontology.org/)
- [http://amos.sourceforget.net/ AMOS: a modular, open-source genome assembler
- Bioinformatics.org: a portal and repository for open source bioinformatics software (http://bioinformatics.org/)
- Bioinformatics.ca: a portal to bioinformatics activities in Canada (http://www.bioinformatics.ca/)
- Bioconductor (http://www.bioconductor.org/)
- BioJava (http://www.biojava.org/)
- BioPerl (http://www.bioperl.org/)
- BioPython (http://www.biopython.org/)
- BioRuby (http://www.bioruby.org/)
- Biomolecular Interaction Network Database (http://www.bind.ca/)
- Seqhound (http://seqhound.blueprint.org)
- EMBOSS (http://emboss.sourceforge.net/)
- EnsEMBL (http://www.ensembl.org/)
- GMOD: The Generic Model Organism Database Project (http://www.gmod.org/)
- [http://manatee.sourceforge.net/ MANATEE: a web-based system for genome annotation and curation
- Organizations
- European Bioinformatics Institute (http://www.ebi.ac.uk/)
- National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)
- European Molecular Biology Laboratory (http://www.embl.org/)
- Open Bioinformatics Foundation: umbrella non-profit organization focused on supporting open source programming in bioinformatics (http://www.open-bio.org/)
- The International Society for Computational Biology (http://www.iscb.org/)
- Canadian Bioinformatics Resource (http://www.cbr.nrc.ca/)
- The Blueprint Initiative (http://www.blueprint.org/)
- Directories
- [http://www.genefinding.org/ GeneFinding.Org A directory of gene finding systems and related tools
- Bioinformatics.net — Software Tools Directory (http://www.bioinformatics.vg/)
- Other
- HARVESTER, bioinformatic meta search engine for proteins in human and mouse (http://harvester.embl.de)
- Human Genome Project and Bioinformatics (http://www.ornl.gov/TechResources/Human_Genome/research/informatics.html)
- Bioinformatics journal (http://bioinformatics.oupjournals.org/)
- BMC Bioinformatics journal (http://www.biomedcentral.com/bmcbioinformatics)
- Genome Canada: Canadian Bioinformatics Help Desk (http://gchelpdesk.ualberta.ca/servers/servers.php)
- The OpenScience Project (http://openscience.org/index.php?section=214)
- Books and articles on Bioinformatics from O'Reilly (http://bio.oreilly.com/)
- Bioinformatics News (http://www.bioinfo-online.net/)
Genomics topics |
Genome project | Glycomics | Human Genome Project | Proteomics | Structural genomics |
Bioinformatics | Systems biology |
General subfields within biology |
---|
Anatomy | Astrobiology | Biochemistry | Bioinformatics | Botany | Cell biology | Ecology | Developmental biology | Evolutionary biology | Genetics | Genomics | Marine biology | Human biology | Microbiology | Molecular biology | Origin of life | Paleontology | Parasitology | Physiology | Taxonomy | Zoology |
es:Bioinformática eo:Biokomputiko fr:Bio-informatique he:ביואינפורמטיקה id:Bioinformatika nl:Bio-informatica ja:バイオインフォマティックス pl:Bioinformatyka lb:Bioinformatik lt:Bioinformatika pt:Bioinformática th:ชีวสารสนเทศ vi:Tin sinh học zh:计算生物学 ko:생물정보학