Abstracts for the talks



 
 

Inbreeding and outbreeding, from animal mating behaviour to human disease

Bill Amos

Department of Zoology, University of Cambridge, UK

All organisms face the problem of getting rid of the (mainly) deleterious mutations which enter their genomes every generation. Ultimately, there must be preferential loss of individuals with more than average numbers of mutations. However, the way in which mutation number affects fitness remains unclear. Some theories suggest that there is fault tolerance, with only those individuals carrying more than some minimum number of mutations suffering reduced fitness. I present data which suggest that animals have diverse means by which to shed these unwanted problems. My results pose up a number of interesting biological questions which may be addressed mathematicalty. An extension of these ideas relates to gene mapping of complex diseases. Many factors conferring disease susceptibility are recessive, and hence disease incidence will be expected to correlate with patterns of increased relatedness. I show show that indicate that this prediction holds, and suggest a number of avenues which would benefit from statistical input.

Experiences of sequencing microbial genomes

Siv G.E. Andersson

Department of Molecular Evolution, Uppsala University, Sweden

Co-authors: Cecilia Alsmark, Jan Andersson, Bjorn Canback, Olof Karlberg, Thomas Sicheritz, Asa Sjogren, Ivica Tamas, Charles Kurland


We sequence the genomes of microbial parasites and symbionts. Here, I discuss some of the challenges in bioinformatics associated with microbial genomics as well as some of the tools that we have developed to resolve these problems. A specific focus of this talk is on the biological insights gained from our genome sequence data concerning genome degradation and the origin of mitochondria. The 1.1 Mb genome of the obligate intracellular parasite Rickettsia prowazekii is exceptional in that it contains a dozen pseudogenes and has the highest proportion of non-coding DNA detected so far in a microbial genome. Here, I describe a systematic sequence analysis which confirms that the R. prowazekii genome represents degraded remnants of ancestral, inactivated genes that await final elimination from the genome. Based on phylogenetic reconstructions of Rickettsia and Bartonella proteins with more than 400 yeast mitochondrial proteins, we propose a new model for the origin of mitochondria, the so-called ox-tox hypothesis.

References: Andersson, SGE, Zomorodipour A, Andersson JO, Sicheritz-Ponten T, Alsmark ACM, Podowski RM, Näslund AK, Eriksson A-S, Winkler HH and Kurland CG. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396:133-140.


The hunt for BRCA3, BRCA4, ...

Pär-Ola Bendahl

Department of Oncology, University Hospital, Lund, Sweden

Breast cancer is one of the most common female malignancies in the Western countries, affecting roughly 1 woman in 10 before the age of 80. Many environmental risk factors have been suggested, but no one as strong as a familial history of the disease. About 5-10% of all cases might be explained by hereditary predisposition whereas the remaining vast majority constitutes so-called sporadic cases. In 1994, the first breast cancer gene (BRCA1) was found, and a year later the second (BRCA2). Since then - mostly silence. A few additional breast cancer susceptibility genes, such as p53, CHK2, PTEN, and AR, have been described as involved in rare cancer syndromes, but these could explain no more than a few percent of all hereditary breast cancer. Mutation in BRCA1 or BRCA2 was initially thought to account for the majority of all hereditary breast cancer, but recent reports indicate a much lower proportion (50%). Consequently, additional genes of major importance are still to be revealed, and the hunt for BRCA3 is intense in both academic and private laboratories. A presumed genetic heterogeneity in remaining uncharacterized families complicates the search, especially when families of different ethnic and geographic origins are analysed simultaneously. Research within the nordic countries provide exceptional opportunities for localizing and cloning of new breast cancer genes because of our homogeneous populations, local isolates and founder effects, homogeneous population, and cancer regestries. This has encouraged initiatives also in our own group.

References:

[1] Hall, JM., Lee, MK., Newman, B., Morrow, JE., Anderson, LA., Huey, B., King, MC. (1990). Linkage of early onset familial breast cancer on chromosome 17q. In: Science, 250, 1684-1689.

[2] Miki, Y., Swensen, J., Schattuck-Eidens, D., Futreal, PA., Harshman, K., Tavtigian, S., Liu, QU. et al (1994) Isolation of BRCA1, the 17q-linked breast and ovarian cancer susceptibility gene. In: Science, 266, 66-71.

[3] Wooster, R., Bignell, G., Lancaster, J., Swift, S., Seal, S., Mangion, J., Collins, N. et al (1995). Identification of the breast susceptibility gene BRCA2. In: Nature, 378, 789-792.


Linguistic complexity profiles for genomic sequences

Alexander Bolshoy

Institute of Evolution, University of Haifa, Israel

Co-authors: O. Troyanskaya, T. Mourier

Genomic sequences can be analyzed as linear texts. One fundamental characteristic of linear texts is complexity, which could be defined by methods based on either Kolmogorov's complexity or Shannon entropy. Although many different definitions of complexity exist, a low-complexity, or 'simple', sequence is generally characterized by repetiveness, which can represent biologically significant regions. One simple way to define the complexity of the sequence would be the richness of its vocabulary; how many different subwords of length k (k-grams) appear in the sequence. Trifonov (1) first introduced this notion, henceforth known as linguistic complexity. We have already used it in (2,3). Here we used a modified version, wherein linguistic complexity is defined as the ratio of the actual number of all subwords to the maximum possible number of subwords. In our computer program, implicit suffix trees constructed by Ukkonen's algorithm were utilized to count the number of subwords in the string. The major goal of the project was to study patterns of sequence complexity around flanks of coding sequences. For almost all genomes, translated sequences were found more complex than non-translated. The novel result is that in M. tuberculosis genome coding regions in average are simpler than non-coding regions. When looking at regions around flanks of coding sequences, a prokaryotic consensus profile of linguistic complexity is observed. This profile is followed by 17 from 21 prokaryotic genomes. Two eukaryotic genomes were processed as well: the genome of S. cerevisiae and of C. elegans. Within both genomes, a very similar profile of complexity around the coding regions seemed to be shared by all chromosomes. The profiles, however, differ between the two genomes, and show an even larger degree of difference when compared to the prokaryotic profiles. We speculate that the different complexity profiles between prokaryotes and eukaryotes should be related to differences in the translational machinery.

References:

1. Trifonov, E.N. Making Sense of the Human Genome. in Structure & Methods, Vol. 1 (eds. Sarma, R.H. & Sarma, M.H.) 69-77 (Adenine Press, Albany, 1990).

2. Bolshoy, A., Shapiro, K., Trifonov, E.N. & Ioshikhes, I. Enhancement of the nucleosomal pattern in sequences of lower complexity. Nucl. Acids Res 25, 3248-3254 (1997).

3. Gabrielian, A.E. & Bolshoy, A. Sequence complexity and DNA curvature. Comput Chem. 23, 263-274 (1999). 


Data reduction in the analysis of gene expression data

Per Broberg

AstraZeneca R&D Lund, Sweden

Co-authors: Robert Virtala, Jack Gauldie and Stefan Pierrou

Array technologies are becoming a workbench for securing a supply of drug targets in the pharmaceutical industry. This article shows how data reduction techniques from other fields may be useful in this context, and may address the problems of high noise levels and high dimensionality beyond the reach of the human mind.The techniques are Haar wavelet filtering, calculation of tmax,and Principal Components Analysis (PCA) on residuals. The ideas are implemented in simple computer programs , and are tested against manual sorting, which is both cumbersome and dependent on the individual.

References:

Eisen, M.B., Spellman. P.T., Brown, P.O. and Botstein , D. Cluster Analysis and display of genome-wide expression patterns Proc Natl Acad Sci USA 95, 14863-8

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S. , Dmitrovsky, E., Lander, E.S. and Golub, T.D. (1999) Interpreting patterns of gene expression with self-organizing maps: metods and application to hematopoietic differentiation, Proc Natl. Acad. Sci USA Vol. 96, pp 2907-2912 


Determining the proportions of inversions and transpositions in genome rearrangements in Chlamydia

Kimmo Eriksson

Department of Mathematics and Physics, Mälardalens högskola, Sweden

Co-authors: Niklas Eriksen, Daniel Dalevi, Siv Andersson

The theoretical genome rearrangement problem (sorting a permutation by transpositions and inversions) has attracted a good deal of attention from mathematicians and computer scientists. With the advent of completely sequenced data from closely related genomes, we expect many previously unknown properties of genome rearrangements to be uncovered, opening new questions for mathematicians and biologists alike.

The whole genome sequences of two chlamydian parasites, Chlamydia trachomatis and Chlamydia pneumoniae, have recently been published, cf. Kalman et al. A comparative analysis of rearrangement events in the two genomes reveals a large proportion of transpositions and inversions of segments of very small length, consisting of just one or two genes. This phenomenon calls for modified methods for estimating the number of different rearrangement events. In this project we have modified the Derange II computer program of Blanchette et al. We then use a simulation technique for determining appropriate parameters.

References: Blanchette, M. Kunisawa, T., Sankoff, D., Parametric genome rearrangements, Gene, 172:GC 11-17, 1996.
Kalman, S. et al., Comparative genomes of Chlamydia pnemoniae and Chlamydia trachomatis, Nature Genetics, 21:385-389, 1999. 


Evolutionary inference from gene trees

Bob Griffiths

Department of Statistics, University of Oxford, UK.

Co-authors: S. Tavare, M. Bahlo.

A unique gene tree can be constructed from a sample of DNA sequences if mutations are point mutations occurring only once at sites. A coalescent process model for the evolution of sequences implies a stochastic distribution for gene trees. Evolutionary questions of interest are estimating parameters in the model, such as the mutation rate; computing the distribution of the time to the most recent common ancestor and ages of mutations; and demographic questions. Approaches to answering these questions have been computationally intensive, based on importance sampling on genealogies and MCMC. Recent research by Stephens and Donnelly has produced a much improved importance sampling scheme.

References:

Griffiths, R. C. and Tavare, S. (1994) Simulating probability distributions in the coalescent. Theor. Popul. Biol., 46, 131-159.

Griffiths, R. C. and Tavare, S. (1999) The ages of mutations in gene trees. Ann. Appl. Prob. 9, 567-590.

Bahlo, M and Griffiths, R. C. (2000) Inference from gene trees in a subdivided population. Theor. Popln. Biol. (To appear.)

Griffiths, R. C. (2000) Ancestral inference from gene trees. in 'Genes, Fossils, and Behaviour: an Integrated Approach to Human Evolution', Donnelly, P and Foley, R (eds) IOS Press,(To appear.)

Stephens, M and Donnelly, P. (2000) Inference in molecular population genetics. J. R. Statist. Soc. B (To appear.)

The last three papers can be downloaded from www.stats.ox.ac.uk/mathgen/publications.html 


Simulating dense SNP-maps in LD-mapping

Arndt von Haeseler

Max Planck Institute for Evolutionary Anthropology, Germany

Co-author: Sebastian Zoellner

Recent developments of methods for locating disease genes in the human genome have focused on mapping frequent diseases with a complex inheritance patterns. As family studies face significant problems for these complex diseases, as- sociation studies are contemplated to be a superior approach. In particular single nucleotide polymorphisms (SNP) are considered as useful tools in association analysis as these markers are easy to type and are plentiful throughout the genome.

As of now, it is contestable how dense a map of these diallelic markers has to be to guarantee successful mapping attempts. This question is central to the setup of a mapping attempt as on the one hand the costs depend mostly on the number of markers that have to be typed. On the other hand, the probability of successful mapping is improved if more markers are used. As any mapping endeavor is quite costly, preliminary theoretical studies can help to choose a powerful marker map.

We simulate the evolution of linkage disequilibrium (LD) with a whole map of SNPs in a population by using a powerful coalescent approach. Contrary to other her studies, we include the possibility, that a SNP further away from the disease locus displays a higher LD than the SNP close to the locus. This effect can be often observed in linkage data. By taking this observation into account, the power to locate a disease gene is substantially increased, as several SNPs in the vicinity of the disease gene are evaluated and it is sufficient if one of them shows significant LD. We show that the biggest gain in power wer is achieved, if the states of the markers are in mutual linkage equilibrium.

Moreover, the power depends on the demographic history of the population. In constant populations, LD will be observed over longer distances than in expanding populations. Therefore, a small constant population is appropriate for roughly mapping the general area of a disease mutation, as a sparser map is sufficient. For the same reason it will be impossible to pinpoint a disease mutation in such a population. On the other hand, a large expanding population exhibits frequently LD between very close loci. So a very dense map is necessary to find LD. But once a marker showing LD is found, it is very likely close to the disease gene, so this type of population is optimal for fine mapping In summary, we show that mapping setups that use dense SNP-maps will have a higher power than orginally predicted, especially if they use SNPs in mutual linkage equilibrium. We also show that rough-mapping the region of a gene requires a different population history than fine-mapping the exact position gene. 


Domain decomposition of protein structures

Liisa Holm

EMBL-EBI, Cambridge CB10 1SD, UK


The rapid growth in the number of experimentally determined three-dimensional protein structures has sharpened the need for comprehensive and up-to-date surveys of known structures. Classic work on protein structure classification has made it clear that a structural survey is best carried out at the level of domains, i.e., substructures that recur in evolution as functional units in different protein contexts. We present a method for automated domain identification from protein structure atomic coordinates based on quantitative measures of compactness and, as the new element, recurrence. Compactness criteria are used to recursively divide a protein into a series of successively smaller and smaller substructures. Recurrence criteria are used to select an optimal size level of these substructures, so that many of the chosen substructures are common to different proteins at a high level of statistical significance. The joint application of these criteria automatically yields consistent domain definitions between remote homologs, a result difficult to achieve using compactness criteria alone. The method is applied to a representative set of 1,137 sequence-unique protein families covering 6,500 known structures. Clustering of the resulting set of domains (substructures) yields 594 distinct fold classes (types of substructures). The Dali Domain Dictionary (http://www.embl-ebi.ac.uk/dali/) not only provides a global structural classification, but also a comprehensive description of families of protein sequences grouped around representative proteins of known structure. The classification will be continuously updated and can serve as a basis for improving our understanding of protein evolution and function and for evolving optimal strategies to complete the map of all natural protein structures.

References: Holm L, Sander C (1998) Dictionary of recurrent domains in protein structures. Proteins 33, 88-96.


Probabilistic models of DNA sequence evolution with context dependent rates of substitution

Jens Ledet Jensen and Anne-Mette Krabbe Pedersen

Dept. of Theoretical Statistics, University of Aarhus, Denmark

We consider Markov processes of DNA sequence evolution in which the instantaneous rates of substitution at a site are allowed to depend upon the states at the sites in a neighbourhood of the site at the instant of the substitution. We characterize the class of Markov process models of DNA sequence evolution for which the stationary distribution is a Gibbs measure, and give a procedure for calculating the normalizing constant of the measure. We develop an MCMC method for estimating the transition probability between sequences under models of this type. Finally, we analyze an alignment of two HIV--1 gene sequences using the developed theory and methodology. 


From structural comparisons to drug discovery.

Mark Johnson

Department of Biochemistry and Pharmacy, University of Turku, Finland

The genomic sequences containing the nearly 100,000 genes coding for proteins in humans are close to being completed. For full exploitation of these data, the structure of the proteins themselves will be determined either directly or by modeling. Structures and their comparisons provide key information needed to understand a protein's function and to exploit proteins, for example, in the design of novel pharmaceuticals. I will focus on structural comparisons of identical ligands bound to diferent protein folds, which provide us will key knowledge on the critical interactions responsible for molecular recogniton.


Mathematical models for evolution of SNPs: Anonymous loci versus disease-related haplotypes.

Marek Kimmel

Department of Statistics, Rice University, USA

Co-authors: Penelope Bonnen, Ranajit Chakraborty, Ranjan Deka, Li Jin, David Nelson, Alexander Renwick, Dimitra Trikka, Ning Wang.

Single-nucleotide polymorphisms (SNPs) are considered an important new tool in the study of molecular evolution and gene mapping. We developed two mathematical models, one based on a modification of the infinite sites model, and the other based a two-state Markov process, which allow numerical predictions of various characteristics of SNPs, under different demographic scenarios. These two models lead to different predictions of frequency distributions of SNPs and different predictions of the ascertainment bias, as SNPs derived from one population are typed in another population.

Theoretical predictions are illustrated by two recent data sets. The first includes distributions of nearly 400 SNP loci, mostly anonymous, i.e. not associated with known genes. Each of these loci were originally screened in Caucasians, but studied now in six diverse populations. It seems that a two-state Markov model better predicts the observed frequency distribution and ascertainment bias in the data then the modified infinite sites model.

The second data set includes several SNP haplotypes, typed in about 300 Americans of diverse ethnic origins, located in the vicinity of genes implicated in familial cancers. The haplotypes were identified from genotypes using the EM algorithm as well as an original method, which in addition allows estimate of the intensity of recombination. Patterns of haplotype distributions and linkage disequilibrium show a distinctive variability from one locus to another. Frequencies of main haplogroups differ from one population to another. Neither the infinite sites model nor the two-state Markov model seem to account for the patterns observed.

In view of these findings, we also discuss possible concepts of population-based association studies, using SNP haplotypes.
(Research supported by NIH grants CA75432, GM 41399, GM 53545 and GM 45861, and by the Keck's Center for Computational Biology at Rice University). 


A model for predictive mixtures and for classification of sequences

Timo Koski

Department of mathematics, Royal Institute of Technology (KTH), Sweden
and
Department of mathematics, University of Turku, Finland

Co-author: Mats Gyllenberg

Detection of protein sequence homologies can be done by using mixtures of Dirichlet distributions. These are statistical models for motifs in multiple alignments of protein sequences. We derive this mixture using an assumption of infinite exchangeability and predictive sufficiency. By this argument it is immediate that we are dealing with predictive classification of protein sequences in the sense of predicting a portion of a sequence based on a motif. Finally a result about the distribution of the score based on an exchangeable representation is outlined.

References:

K. Sjölander, K. Karplus et. al. : Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. COMPUTER APPLICATIONS IN BIOLOGICAL SCIENCES, 12, 1996, pp. 327 - 345.

J.S. Liu and C.E. Lawrence: Bayesian inference on biopolymer models. BIOINFORMATICS, 15, 1999, pp. 38 - 52.

J.T.L. Wang, Th.G. Marr et. al. : Complementary classification approaches for protein sequences. PROTEIN ENGINEERING, 9, 1996, 381 - 386. 


New algorithms for the duplication-loss model

Jens Lagergren

Stockholm Bioinformatics Center, KTH, Sweden

Co-author: Mike T. Hallett

We consider the problem of constructing a species tree given a number of gene trees. In the frameworks introduced by Goodman et al. [1], Page [3], and Guigó, Muchnik, and Smith [2] this is formulated as an optimization problem; namely, that of finding the species tree requiring the minimum number of  duplications and/ or losses in order to explain the gene trees.

We introduce the Width k Duplication-Loss and Width k Duplication problems. A gene tree has width k w.r.t. a species tree, if the species tree can be reconciled with the gene tree using at most k simultaneously active copies of the gene along its branches. We explain w.r.t. to the underlying biological model, why this width is typically very small in comparison to the total number of duplications and losses. We show polynomial time algorithms for finding optimal species trees having bounded width w.r.t. at least one of the input gene trees. Furthermore, we present the first algorithm for input gene trees that are unrooted. Lastly, we apply our algorithms to a dataset from [2] and show a species tree requiring significantly fewer duplications and fewer duplications/losses than the trees given in the original paper.

References:

[1] Goodman, M. et. al. (1979) Fitting the Gene Lineage into its Species Lineage: A parsimony strategy illustrated by cladograms constructed from globin sequences, Syst.Zool. 28.

[2] Guigó, R. et al. (1996) Reconstruction of Ancient Molecular Phylogeny. Molec. Phylogenet. and Evol., 6(2), pp. 189--213, 1996.

[3] Page, R. (1994) Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Syst. Biol., 43, p. 58--77.


In search of an evolutionary coding style

Torbjörn Lundh

Department of Mathematics, Chalmers University of Technology, Sweden
 
  In a very near future, all the human genes will be identified. But understanding the functions coded in the genes is a much harder problem. For example, by using block entropy, one has that the DNA code is closer to a random code then written text, which in turn is less ordered then an ordinary computer code.

Instead of saying that the DNA is badly written, using our programming standards, we might say that it is written in a different style --- an evolutionary style.

We will suggest a way to search for such a style in a quantified manner by using an artificial life program, and by giving a definition of general codes and a definition of style for such codes. For more details, see www.math.sunysb.edu/cgi-bin/preprint.pl?ims00-03.


Disease gene identification and the use of the new bioinformatics data: A practical approach

Tommy Martinsson

Department of Clinical Genetics, Sahlgrenska University Hospital/East, Göteborg, Sweden

Co-author: Jan Wahlström


Our group at the Department of Clinical Genetics has during the last years been in several collaboration projects aimed at localization, identification and characterization of genes causing diseases in man. For these purposes we have developed and integrated tools necessary for gene localization and for gene characterization, e.g. a vast array of mutation detection techniques. Among the projects studied by our group are the genetics of the Carbohydrate-deficient glycoprotein syndrome type IA (CDG1a, PMM2), an Autosomal Dominant Inclusion body myopathy (IMB3), Psoriasis (PSORS5) and the childhood cancer Neuroblastoma (NBS). The first steps in the gene hunting generally involve linkage studies or other studies where the gene(s) are mapped to a position as detailed as the patient/family materials allow. Later steps involve the use of the large amount of genetic information now available on the internet for researchers of the world, e.g. that of the large human sequencing projects. This presentation will give some practical examples of how we integrate and make use of the new information in our gene hunting projects.

References: Martinsson, T., Darin, N., Kyllerman, M., Oldfors, A., Hallberg, B., and Wahlström, J., 1999, Dominant hereditary inclusion body myopathy Gene (IBM3), maps to chromosome region 17p13.1. Am. J. Hum. Genet. 46:1420-1426.

Bjursell, C., Stibler, H., Wahlström, J., Kristiansson, B., Skovby, F., Strömme, P., Blennow, G., Martinsson, T., 1997, Fine Mapping of the Gene for Carbohydrate Deficient Glycoprotein Syndrome, Type 1 (CDG1); Linkage Disequilibrium and Founder Effect in Scandinavian Families. Genomics 39: 247-253.

Enlund, F., Samuelsson, L., Enerbäck, C., Inerot, A., Wahlström, J., Yhr, M., Torinsson, Å., Riley, J., Swanbeck, G. and Martinsson, T., 1999, Psoriasis susceptibility locus in chromosome region 3q21; identified in patients from southwest Sweden. Eur. J Hum Genet 7:783-790.

Ejeskär, K., Abel, F., Sjöberg, R.M., Bäckström, J., Kogner, P. and Martinsson, T., 2000, Fine mapping of the human preprocortistatin gene (CORT) to neuroblastoma consensus deletion region 1p36.2-3, but absence of mutations in primary tumors. Cytogenetics and cell genetics, (in press).


Bayesian inference of pedigrees

Petter Mostad

Norwegian Computing Center, Norway

Co-author: Thore Egeland


  In a number of diverse situations, there is a need to establish the correct familial relationship between a group of individuals.  The situations can range from paternity cases, family reunification  in connection with immigration, and identification of disaster victims,  to animal breeding. The data available may be DNA typing of some  selected genes of some or all of the individuals involved, and  in general additional data in the form of assumed or probable relationships between some individuals. When finding the most probable pedigree or pedigrees relating the individuals, it is important, especially  when legal issues are involved, to take into account all conceivable factors that may affect the probabilities. 

We will describe a general method for solving this problem in  a Bayesian framework. A prior probability distribution on  pedigrees is established, possibly using non-DNA case data, or even general information about likely breeding patterns in the population.  The data from the typed genes are then introduced using a model  taking into account not only the combinatorial complexities of  the pedigrees and population frequencies of the relevant alleles,  but also the probability of mutations and the amount of kinship. One may then compute posterior probabilities and identify the most probable pedigree or pedigrees, and one may also investigate how sensitive those results are to the assumptions made.  The method is implemented in a computer program called 'Familias', and simpler versions of this program have already been  used in a number of practical applications in several countries. 

  References: 

T. Egeland, P. Mostad, and B. Olaisen.  Computerized probability   assessments of family relations. Science and Justice, pages 269-275, 1997.

B. Olaisen, M. Stenersen, and B. Mevåg. Identification by DNA analysis    of the victims of the August 1996 Spitsbergen civil aircraft disaster.    Nature Genetics, 15, 1997.      


 

A new method for modelling protein evolution

Tobias Müller

Theoretical Bioinformatics (TBI), German Cancer Research Center (DKFZ), Germany

Co-author: Martin Vingron

The estimation of amino acid replacement frequencies during molecular evolution is crucial for many applications in sequence analysis. Score matrices for database search programs or phylogenetic analysis rely on such models of protein evolution. Pioneering work was done by M. Dayhoff, who formulated a Markov model of evolution and derived the PAM score matrices. Her estimation procedure for amino acid exchange frequencies is restricted to pairs of proteins that have a constant and small degree of divergence. Here we present an improved estimator, called the resolvent method, that is not subject to these limitations. This extension of Dayhoff's approach enables us to estimate an amino acid substitution model from alignments of varying degree of divergence. Extensive simulations show the capability of the new estimator to recover accurately the exchange frequencies among amino acids. Based on the SYSTERS database of aligned protein families we recompute a series of score matrices.

References:

[1] Dayhoff, M., Schwartz, R., & Orcutt, B. (1978). A model of evolutionary change in protein. In: Atlas of Protein Sequences and Structures, 5, 345 -352.

[2] Jones, D.T., Taylor, W.R., & Thornton, J.M. (1992). The rapid generation of mutation data matrices from protein sequences. In: CABIOS, 8, 275-282.

[3] Benner, S., Cohen, M., & Gonnet, G. (1994). Amino acid substitution during functionally constrained divergent evolution of protein sequences. 


Statistician at a department of clinical genetics

Staffan Nilsson

Mathematical Statistics, Chalmers University of Technology, Sweden

As a statistician at the department of Clinical Genetics, SU/Östra, I support the activities in ongoing large scale linkage and association analysis for coeliac disease and psoriasis, but also a variety of other issues, e.g. ethical aspects of genetic information & insurance, paternity testing and planning of new studies and teaching. I will describe some of these tasks, with the ambition to illustrate the pleasant diversity of subjects. 


Automated construction of patterns for protein sequence classification

Björn Olsson

Department of Computer Science, University of Skövde, Sweden

Co-author: Kim Laurio

Manual and semi-manual construction of libraries of patterns, such as PROSITE, is a tedious process involving careful analysis of protein function and manual selection of suitable motifs to be modelled by patterns. In this talk, we show how this process can be fully automated, and yet result in patterns with higher classification accuracy than those in the PROSITE library.

The automated system relies on building accurate multiple alignments, as well as on suitable heuristics for selection of alignment columns from which to build the pattern. These heuristics involve the use of mixtures of Dirichlet distributions to estimate the true distribution corresponding to the alignment column, followed by the calculation of entropy of each estimated distribution as a guide for selection of the most "informative" columns.

We present results for more than 900 families represented in PROSITE, showing that fully automated pattern construction can result in higher average sensitivity and specificity in classification of all SWISSPROT sequences.

References:

Sjölander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S., Haussler, D : Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. CABIOS, 12, 1996, 327-345.

Hofmann, K., Bucher, P., Falquet, L., Bairoch, A., The PROSITE database, its status in 1999, Nucleic Acids Res., 27, 1999, 215-219.

Olsson, B., Laurio, K., Discovery of Diagnostic Patterns from Protein Sequence Databases, In: Quafafou, M., Zytkow, J. (eds.), Proceedings of PKDD98 - The 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, Springer-Verlag, 1998, 167-175. 


Studies on short-chain and medium-chain dehydrogenases/reductases using the nonredundant database KIND and hitherto completed genomes

Bengt Persson

Department of Medical Biochemistry and Biophysics, Karolinska Institutet, and
Stockholm Bioinformatic Centre, Sweden

Co-authors: Yvonne Kallberg, Jan-Olov Höög & Hans Jörnvall

The protein families SDR and MDR (short-chain and medium-chain dehydrogenases, respectively) constitute large enzyme families [1]. Thanks to the fast progress in sequencing world-wide, the number of members has inreased considerably over the last years. Presently, the MDR family has over 500 members, and the SDR family over 1000 members. The SDR family also contains members with dehydratase, epimerase and isomerase activity. Thus, the SDR family represents three of the six Enzyme Commission main classes.

In E. coli alone, there are no less than 17 MDR forms, identified as open reading frames, considerably extending previously known MDR relationships in prokaryotes and including ethanol-active alcohol dehydrogenase. Complexity is also large, with several enzyme activity types, subgroups and evolutionary patterns. Repeated duplications can be traced for the alcohol dehydrogenases, with independent enzymogenesis of ethanol activity, showing a general importance of this enzyme activity.

Protein sequence databases are increasing in size at a rapid pace and offer a valuable source of information. In studies of biological variation, protein family characterisation and evolutionary relationships, these databases are of the outermost importance. However, as the databases expand, the time it takes to perform searches and extract the desired information increases as well.

For studies on sequence variation, we have created KIND (Karolinska Institutet Nonredundant Database), where Swissprot, Swissnew, PIR, TrEMBL, GenPept and gpcu are merged into one, and identical sequences and subsequences are removed. A slightly modified version of the naive algorithm is used for the matching process, where the sequences are matched in an all versus all manner. The database consists of around 370 000 entries, half of them originates from the protein databases and the other half from the translation of open reading frames (ORFs). KIND will be updated every 4-6 weeks, and is available via anonymous ftp from ftp://ftp.mbb.ki.se/pub/KIND.

References:

1. Jörnvall, H., Höög, J.-O. & Persson, B. (1999) SDR and MDR: completed genome sequences show these protein families to be large, of old origin, and of complex nature. FEBS Lett. 445, 261-264.

2. Kallberg, Y. & Persson, B. (1999) KIND - a nonredundant protein database. Bioinformatics 15, 260-261. 


Global Oligonucleotide statistics in random DNA models. 
Results, open problems, and applications to computational molecular biology.

Sven Rahmann

Theoretical Bioinformatics (TBI), German Cancer Research Center (DKFZ), Germany

Co-authors: Eric Rivals, LIRMM, Montpellier, France

We developed exact and approximate techniques to determine the statistical distribution of the number of missing oligonucleotides of a given length in DNA sequences under various random models of different complexity. Even in the simplest model, the exact computations can become quite demanding. We propose an efficient method based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present an efficient algorithm.

Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values. From this, we derive an interesting and so far unproven conjecture.

Similar techniques are used to compute the distribution of the number of common oligonucleotides of two sequences. This knowledge allows to analyze fast database search algorithms which use a technique called "q-gram filtration". This means that in a first fast step, one discards regions of the database that do not share sufficiently many subsequences of length q with the query. Then, a thorough but slower search for good approximate matches can focus on the remaining fraction of the database. The recently proposed QUASAR-algorithm [2] operates in this way. It has been shown to outperform standard tools like BLAST in some settings. Using our results, we can understand how the performance of QUASAR depends on its parameters, and hence propose optimal parameter sets for various applications.

Time permitting, we mention some more applications of oligonucleotide statistics. These include monkey tests for random number generators, which help to ensure that artificially generated test data for any algorithm is not biased by fault of the random number generator. We also look at the probability of a gene's unique identifiability by any short oligonucleotide within a genome.

References:
[1] Sven Rahmann and Eric Rivals. Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts. In: David Sankoff and Raffaele Giancarlo. Proceedings of the 11th Symposium on Combinatorial Pattern Matching. Springer-Verlag (2000). To Appear.

[2] Stefan Burkhardt, et al. Q-gram based Database Searching Using a Suffix Array (QUASAR). In: Sorin Istrail, et al. Proceedings of The Third International Conference on Computational Molecular Biology. Pages 77-83. ACM-Press (1999).


Probabilistic and statistical properties of words

Gesine Reinert

King's College and Statslab, Cambridge University, UK

Co-author: Sophie Schbath, INRA, Jouy-en-Josas, France

We discuss statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Here, a sequence is modelled as a stationary ergodic Markov chain. Special emphasis will be on the joint occurrences of multiple motifs. Here, limit results are obtained, as well as bounds on the distance to the limiting distribution. A complication lies in disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words.

References:

G. Reinert, S. Schbath and M.S. Waterman, Probabilistic and statistical properties of words, to appear in J. Comp. Bio., 2000

G. Reinert and S. Schbath, Compound Poisson and Poisson process approximation for occurrences of multiple words in Markov chains. J. Comp. Bio. 5 (1998), 223-253

M.S. Waterman (1995), Introduction to computational biology, Chapman&Hall. 


Exact distribution of motifs occurrences in DNA sequences

Stéphane Robin

Unité Mathématique, Informatique et Génome, INRA Biométrie, France

Statistics based on motifs occurrences - counts or positions - are frequently used in DNA analysis. The probabilistic distribution of this statistics is necessary to assess the significancy of the observed results. For local analysis, asymptotic results are not relevant since they assume that the observed sequence is infinite.

We present here some results on the exact distribution of the occurrences of one or several words in a random sequence generated by a first order Markov process. This results are obtained through probability generating functions. The effective calculation of this distributions sometimes induces very long computational times.

Three applications are presented : i) counting statistics, ii) homogeneity checking based on r-scans, iii) significancy of the presence of certain motifs in promoter regions.


Coalescent patterns in diploid exchangeable population models

Serik Sagitov

School of Mathematical and Computing Sciences, Chalmers University of Technology, Sweden

Co-author: Martin Möhle

We consider a class of two-sex population models with N females and equal number N of males constituting each generation. Reproduction is assumed to undergo three stages: 1) random mating, 2) exchangeable reproduction, 3) random sex assignment. Treating individuals as pairs of genes at a certain locus we introduce the diploid ancestral process (the past genealogical tree) for n such genes sampled in the current generation. Neither mutation nor selection are assumed. A convergence criterium for the diploid ancestral process is proved as N goes to infinity while n remains unchanged. We specify conditions when the limit process (coalescent) is the so-called Kingman coalescent and discuss situations when the coalescent allows for multiple mergers of ancestral lines.

References:

1. S.Sagitov, The general coalescent with asynchronous mergers of ancestral lines. J. Appl. Prob. 36 (1999), 1116-1125.

2. M.Möhle and S.Sagitov, A classification of coalescent processes for haploid exchangeable population models. Chalmers University of Technology, Math. Dept., Preprint no. 10 (1999) (submitted to Ann. Prob.).

3. M.Möhle and S.Sagitov, Coalescent patterns in exchangeable diploid population models. Berichte zur Stochastik und verwandten Gebieten, Johannes Gutenberg-Universität Mainz, July 1999 (submitted to J. Math. Biol.). 


Bayesian QTL mapping in inbred and outbred experimental designs

Mikko J. Sillanpää

Rolf Nevanlinna Institute, University of Helsinki, Finland

Co-author: Elja Arjas

Bayesian QTL mapping in inbred and outbred experimental designs A Bayesian method for mapping quantitative trait loci (QTLs) using inbred and outbred experimental designs is presented. Our model belongs to the variable dimensional model framework where the number of QTLs is treated as a random variable to be estimated jointly with the other unobservables. The estimation is carried out separately for each chromosome. The influence of QTLs and polygenes in the other chromosomes is controlled by treating suitably chosen nearby markers as covariates. The numerical estimation is performed using Markov Chain Monte Carlo (MCMC) methods, especially the Metropolis-Hastings and reversible jump algorithms.

The method has the potential of dealing with incomplete marker data in the offspring. In the outbred case, parental genotypes may be partly missing and their linkage phases may be completely unknown. The ultimate idea is to use segregation indicators without forgetting the original genotype and allelic origin information. Sampling schemes include implementation of a family block-update and chromosome-update for ordered genotypes (in the outbred case). An emphasis is given to the hierarchical model structure and special properties of the presented models. Inference summaries, such as posterior QTL-intensity, are demonstrated with analyses of simulated and real data sets, and with comparisons to other mapping methods.

References:

Sillanpää, M. J. and E. Arjas (1998) Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics 148: 1373-1388.

Sillanpää, M. J. and E. Arjas (1999) Bayesian mapping of multiple quantitative trait loci from incomplete outbred offspring data. Genetics 151: 1605-1619.

Sillanpää, M. J. (1999) Bayesian QTL mapping in inbred and outbred experimental designs. Ph.D. thesis. University of Helsinki. Rolf Nevanlinna Institute Reasearch Reports A30. Yliopistopaino, Helsinki. (electronically available at: http://www.rni.helsinki.fi/~mjs/) 


Jumping alignments

Rainer Spang

ISDS, Duke University, USA

Co-authors: Jens Stoye, Marc Rehmsmeier (Theoretical Bioinformatics, German Cancer Research Center, Heidelberg, Germany)

We describe a new algorithm for amino acid sequence classification and the detection of remote homologues. The general idea is to exploit both horizontal and vertical information of a multiple alignment in a well balanced manner. The algorithm is based on the dynamic programming principle and evaluates the fit of a candidate sequence to a given family of sequences by means of a new score called the ``jumping alignment score''. In a jumping alignment, a candidate sequence is locally aligned to one reference sequence in the family, and in addition the reference sequence may change within the alignment. We show that the algorithm performs well in recovering subfamilies of the SCOP database. 


Some statistical issues in microarray data analysis

Terry Speed

Department of Statistics, University of California at Berkeley, USA and Genetics & Bioinformatics Group, Walter & Eliza Hall Institute of Medical Research, Australia

Co-authors: Sandrine Dudoit, Jean Yee-Hwa Yang

The wealth of microarray-based gene expression data now available poses many statistical questions ranging from very basic ones such as finding the DNA spots on the image and measuring their intensities, through to classifying genes and seeking to elucidate biochemical pathways. Most attention so far has been to clustering methods for grouping genes and/or samples. This talk will touch on many other issues, including more basic ones such as telling which genes' expression levels are up, down or essentially unchanged in treatment/control comparisons, both within and across experiments; assigning a precision to measured intensity changes (on the log scale); and whether different ways of preprocessing the data (spot identification, background adjustment, color normalization) make any discernible difference to the foregoing.

Reference: http://www.stat.Berkeley.EDU/users/terry/zarray/Html/index.html 


Combinatorial and statistical analysis of gene expression matrices

Zoltan Szallasi

Department of Pharmacology, Uniformed Services University of the Health Sciences, Bethesda, MD USA

Co-author: Mattias Wahde, Chalmers

An important goal in biological research of today is to understand how gene expression changes cause a given phenotypic state (e.g. cancer). In many cases, it is likely that this involves not only the identification of single genes but rather the interaction between a number of genes, whose concerted action generate the phenotypic state.

In order to study this problem, we have introduced a generative process which gives rise to gene expression matrices resembling those obtained from actual measurements. Using combinatorial methods coupled with computer simulations, we obtain theoretical limitations on the number of measured samples needed in order to identify a combination of gene expression changes responsible for a phenotypic state. We also describe an algorithm for the detection of a subgroup of samples with above average gene expression similarity, which may present an alternative to traditional clustering techniques.


Reading DNA: SBH and Shotgun

Michael Waterman

Department of Mathematics, Department of Biological Sciences
University of Southern California, USA

Co-authors: Haixu Tang and Pavel Pevzner

Roger Staden implemented a computational method to assemble DNA sequences in the Sanger laboratory where DNA sequencing was being developed in the late 1970s. Since then shotgun seqquence assembly has been the computational workhorse for creating DNA sequence from reads of shorter fragments. Even now in 2000 as the human genome is being sequenced, this basic and computationally intensive method is central. In contrast sequencing by hybridization (SBH) has a data structure that makes assembly easy, but technology has limited its application to sequencing. In this talk I will describe both of these methods and show how to exploit the computational algorithm of SBH while using the data of conventional sequencing projects.

References: Introduction to Computational Biology Michael Waterman Chapman-Hall, 1995


Classification methods applied to DNA microarray data

Christopher Workman

Center for Biological Sequence Analysis, The Technical University of Denmark, Denmark

Co-authors: Soren Brunak, Ulrik Kjems, Thomas Thykaer, Torben Oerntoft, Steen Knudsen

Much effort has been put into the characterization and classification of tissue samples based on expression measurements from DNA microarrays. Unsupervised approaches have varied from cluster analysis to principle component analysis (PCA). Though not strictly classification methods, these approaches have provided valuable characterizations of tissues and genes without the need for prior knowledge. Supervised classification methods such as linear discriminants, neural networks and support vector machines have been applied with some success. Important questions remain to be answered. What is a good significance criteria for the data partition generated (set of clusters)? How much data is required to develop a valid predictor? These issues will be addressed in this talk.

References:

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ, Broad patterns of gene expression revealed by cluster analysis of tumor and normal colon tissues probed by oligonucleotide arrays, PNAS, vol.96, 6745-6750, June 1999.

Alizadeh et. al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, vol.403, 503-511, February 2000.

Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D, Knowledge-based analysis of microarray gene expression data using support vector machines, PNAS, vol.97, no.1 262-267, January 2000.

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, vol.286, 531-537, October 1999.

Perou CM, et al. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers, PNAS, vol.96, 9212-9217, August 1999. 


Structural analysis of DNA sequence: Evidence for lateral gene transfer in Thermotoga maritima

Peder Worning

Center for Biological Sequence Analysis, The Technical University of Denmark

Co-authors: Lars J. Jensen, Karen E. Nelson, Søren Brunak, and David W. Ussery

The recently published complete DNA sequence of the bacterium Thermotoga maritima provides evidence, based on protein sequence conservation, for lateral gene transfer between Archaea and Bacteria. We introduce a new method of periodicity analysis of DNA sequences, based on structural parameters, which brings independent evidence for the lateral gene transfer in the genome of T. maritima. The structural analysis relates the Archaea-like DNA sequences to the genome of Pyrococcus horikoshii. Analysis of 24 complete genomic DNA sequences shows different periodicity patterns for organisms of different origin. The typical genomic periodicity for Bacteria is 11 bp and 10 bp for Archaea. Eukaryotes have more complex spectra but the dominant period in the yeast Saccharomyces cerevisiae is 10.2 bp. These periodicities are most likely reflective of differences in chromatin structure.


Prediction of RNA-binding Sites in Ribosomal Proteins

Jian Zhang

EURANDOM, Eindhoven, The Netherlands

Protein synthesis is mediated by ribosome in every cell. This large ribonucleoprotein complex acts as a mediator for promoting accurate decoding of mRNA and rapid formation of peptide bonds through binding mRNA, aminoacyl- and peptidyl-tRNAs, and translation-associated factors, and orienting them appropriately. Each ribosome is composed of RNAs and ribosomal proteins. It is commonly assumed that the function of ribosomal proteins is to stablize specific RNA structures and to promote a compact folding of the large ribosomal RNAs. It is foundamental to investigate how the ribosomal protein and rRNA interact. The RNA-binding sites of ribosomal proteins are essential for this study. Although using the existing tools like MACAW and ClustalW one can find the motifs from the ribosomal protein families, one is not sure whether these motifs are the binding sites. In this note, we will present a method to show that the most conservative motifs are the putative binding sites in some sense of information content. We also identify the possible motifs for various different ribosomal protein families. This talk is based on the work with Wing Hung Wong in UCLA, USA. 


Last modified: Mon Apr 17 14:09:38 MET DST 2000