Introduction

An introduction to bioinformatics

Before long, the DNA sequence of the complete human genome will have been determined. This achievement might seem an end in itself, but it is really only the beginning. Already a large number of bacterial genomes have been fully sequenced, an outstanding achievement in science that started 1995 with the completion of the bacterium Haemophilus influencae. Bacterial genomes were to be followed by the first eukaryotic organism, the unicellular genetic model system Saccharomyces cerevisiae, more commonly known as baker's yeast. In December 1998, the first multicellular organism has been added to the list, the nematode Caenorhabditis elegans, which now provide us with information about unique functions in organisms of greater complexity. The sum of all this information is enormous and its potential in our understanding of life processes, if rightly explored, is far reaching and tremendous.

However, today with modern genome wide analytical technologies available, such as micro array transcript analysis or global protein analysis utilising electrophoresis and mass spectrometry, it is biochemical and molecular biology data in general, rather than DNA sequence data alone, that is accumulating at a phenomenal rate. In order to exploit this wealth of information a new field of science has arisen that fuses biology and medicine on one side with mathematics, statistics and computer science on the other side. This new field of science is known as bioinformatics.

All sequence data is compiled in large international databases, soon to be followed by data collections on e.g. expression data, protein-protein interaction data, phenotypic data for mutants etc. Straightforward access to data over the Internet means that a wealth of information is available, literally at our fingertips. The topics covered within Bioinformatics range from retrieving and aligning DNA and protein sequences to predicting structure and function of gene products. We will here only briefly touch on the many facets of bioinformatics.

Genome and sequence analysis

Historically, bioinformatics as a concept was invented to describe the task of handling, presenting and analysing large amounts of sequence data. Today, due to intense efforts at a number of large research centres throughout the world, data can be rather easily accessed by anyone over the Internet and World Wide Web servers. As a consequence, it is currently almost an everyday activity in most molecular biology labs to screen these sequence databases to find sequence homologues of a particular gene. This is not only to find homologues within a species but also to look for similar genes in other organisms, so called orthologous. The discovery of numerous such orthologous groups of genes provide excellent support for the power of using of model organisms. Sequence similarities is also used to cluster organisms according to their evolutionary relatedness and thus to create phylogenetic trees, an important tool in taxonomy. In parallel to the DNA sequencing effort, determination of the location of genes on chromosomes is today performed in large scale projects for a number of organisms, which provide information that need to be efficiently handled and presented.

From sequence to 3D structural prediction

For most macromolecules their function is closely linked to the three-dimensional structure, maybe most apparent for proteins and some RNA molecules. Recent technical developments can now provide us with a more detailed view of how molecules are folded. The experimental determination of these 3D structures is, however, a costly and slow process. Novel procedures for predicting the molecular fold from the primary sequence data is thus urgently needed. Since the protein structure is ultimately carrying the information about the enzymatic active site or surface site for protein-protein interaction, knowledge about protein tertiary structure will in the future be of fundamental importance for the pharmaceutical industry.

Analysis of genome wide biomedical data and functional genomics

In the last couple of years the advent of biomedical large scale analysis tools have for ever changed the way scientists in biology and medicine will do research. These technologies make possible the simultaneous study of the expression of thousands of genes, either at the transcript or at the protein level, or the thousands of possible protein-protein interactions in a cell, or phenotypic analysis of thousands of mutants etc. All this data, regardless of type and format, has to be handled, presented and efficiently analysed. This challenge is already being explored by statisticians for the clustering of e.g. similarly regulated genes. This clustering information is currently being evaluated as a potentially useful way of predicting function of functionally uncharacterised genes in the following up on the genomics projects, a research area called functional genomics. Ultimately, prediction of gene function will include a more complex procedure, i.e. the integrated analysis of many types of large scale molecular data into one tentative function for the studied gene. This latter task will of course also utilise information gained by applying the above described sequence analysis. In addition, genes with similar expression profiles would possibly exhibit consensus sequence elements in their regulatory regions. Identifying these sequences by automated computer methods, which is more difficult than finding clear similarities between the encoded proteins, will be a great challenge that can provide extremely useful information.

Mathematical modelling of life processes.

The vast amounts of data generated by the genome wide analytical technologies will not only have to be clustered, but also more importantly, interpreted in a physiological context. To be able to do so in a more sophisticated manner than is currently possible when handling thousands of information units, automated strategies have to be developed. This is a formidable task that incorporates modelling of all molecular processes in a cell at the molecular level. Initially, this task will be approached by modelling of discrete parts of the cell's physiology, like metabolic fluxes or regulatory networks. However, the integration of all these will in many ways be the ultimate challenge for bioinformatics and an important part of the final goal of biomedical science in general - the complete molecular understanding of a living organism.

Database building and management.

Whatever type of information is being generated, analysed and finally interpreted, the data has to be presented to the scientific community by establishing Internet based World Wide Web servers.The presentation of this data can be rather challenging, and problems that arise extend from formalism of data submission to intelligent and clear ways of presentation. Database management is thus not only an engineering problem, but also provides a clear scientific challenge.