Computational analysis of the evolutionary dynamics of proteins on a genomic scale

Hughes, Timothy

Hughes, Timothy

Doctoral thesis

Åpne

Dr.Avh_Timothy_Hughes.pdf (2.750Mb)

Permanent lenke

https://hdl.handle.net/1956/2741

Utgivelsesdato

2007-01-16

Metadata

Vis full innførsel

Samlinger

Department of Informatics [981]

Sammendrag

Biology is primarily concerned with the study of all phenotypic aspects of living organisms and evolutionary biology is more specifically interested in elucidating how different phenotypes evolved. Proteins (and RNA molecules) are the most fundamental level of phenotype and are encoded by the genes in the organism’s genome. Thus, at the most basic level, evolutionary biology seeks to understand how changes in the DNA sequence of genes affect protein functionality and how this modified functionality feeds back to shape the genome (and thus phenotype) of future generations. Every nucleotide of the genome is constantly at risk of mutation and, if a mutation occurs in a gamete, it has a non-null probability of being passed on to the next generation. If the mutation has a negligeable effect on phenotype (neutral mutation) it may rise to fixation through genetic drift. If, however, the effect is non-negligeable and impacts on the oganism’s fitness, it may either stand a higher chance of reaching fixation than a neutral mutation (positive selection) or it may stand a lower chance (negative or purifying selection). It is positive selection that drives the modified or new function which results in adaptation of the organism to its environment. Because life has existed on earth for at least 3.5 billion years and because the state of the physical environment is relatively stable across time, the products of genes are usually well-adapted to a particular function. Most protein coding sequence is either evolving neutrally if the nucleotides encode amino acids that are functionally unimportant, or is under negative selective pressure if a change in the encoded amino acid would affect fitness. However, observation of the organic world both at the macro level (e.g. anatomy and physiology of organisms) and at the micro level (e.g. proteins) reveals what appear to be many cases of recent adaptation involving novel function. Of course, changes in an organism’s physical and biotic environment may occur and would have the potential to drive adaptive changes in a gene’s function. However, most genes, because they encode functions that are essential regardless of the organism’s environment, are not free to evolve in this way. The key process enabling a gene to escape the eye of selection is gene duplication. Through duplication of a gene, redundancy is introduced to the genome as it then contains two copies of the same gene, both of which encode the same functionality. Such a duplication will generally be neutral and can reach fixation by drift. There are many fates for the gene duplicate pair, the most common of which is pseudogenisation (or gene death/loss) which involves one of the genes in the pair losing its protein encoding properties (fixation of a null mutation). The reason for this is that, in most cases, a null mutation to one of the genes in the pair does not have any fitness effect on the mutant individual as the other gene in the pair continues to fulfill the required function. However, some gene duplicates are retained. The process through which retention occurs is an intensively studied subject as differences in the gene content of genomes is one of the main drivers of phenotypic diversity among species. Several models of gene duplicate evolution have been formulated, the first and probably most intuitive model being the “neofunctionalisation” model [Ohno 1970]. The key idea of “neofunctionalisation” is that there is a small chance that one of the genes in the duplicate pair is subject to a mutation confering a new fitness enhancing function on the protein, thus ensuring the retention of both genes in the genome: one gene having the ancestral function and the other the new function (neofunctionalisation). This is one of the most obvious ways in which adaptive evolution can occur at the protein coding level. Thus, gene duplication and the subsequent retention or loss are key processes shaping the evolution of genomes. They drive the actual number of genes in the genome and these genes functions. Moreover, they potentially produce neofunctionalisation. In this thesis, using genomic data from mammalian species, I begin by estimating the rate at which genes duplicate, and the rate at which the sequence of the duplicates diverges and potentially pseudogenises (Paper I). These estimates are of interest in their own right as they represent a quantitative characterisation of an important evolutionary process, but they can also be used to investigate the predominant mode of gene duplicate evolution (Paper I). Further, these estimates can be used to investigate the evolution of the gene content of a genome and, more specifically, the distribution of gene family size (Paper II). Finally, although these estimates are for gene duplicates that are the result of small-scale duplication events (tandem and segmental duplication), the estimates can be applied to investigating some of the particularities of whole genome duplication (Paper III and IV). The background knowledge required to understand the papers is presented in chapter 2. Hopefully, this background knowledge is sufficiently complete for the uninitiated reader to understand the essence of the findings of the papers. Readers familiar with the subject will probably find that they can skip large sections of this chapter. Each of the four papers is then introduced in chapter 3. Each introduction consists of more detailed background information that is relevant for the specific paper, a motivation of the work, a short summary of the results and some ideas for further work. Finally, the core of this thesis, the actual papers together with their bibliographies and supplementary materials, are located in the appendix. This layout may seem somewhat unconventional, but it is made necessary by the guidelines for doctoral degrees at the University of Bergen which require the PhD candidate to produce papers which are later incorporated into the thesis.

Består av

Paper I: Journal of Molecular Evolution 65, Hughes, Timothy and David A. Liberles, The Pattern of Evolution of Smaller-Scale Gene Duplicates in Mammalian Genomes is More Consistent with Neo- than Subfunctionalisation, pp. 574588. Copyright 2007 Springer Science + Business Media. Abstract only. Full-text not available due to publisher restrictions. The published version is available here: http://dx.doi.org/10.1007/s00239-007-9041-9

Paper II: Gene 414 (1-2), Hughes, Timothy and David A. Liberles, The power-law distribution of gene family size is driven by the pseudogenisation rate's heterogeneity between gene families, pp. 85-94. Copyright © 2008 Elsevier B.V. All rights reserved. Preprint version. The published version is available here: http://dx.doi.org/10.1016/j.gene.2008.02.014

Paper III: Hughes, Timothy and David A. Liberles, The whole genome duplications in the ancestral vertebrate are detectable in the distribution of genefamily sizes of tetrapod species. Preprint version. Submitted to the Journal of Molecular Evolution. Published by Springer Science + Business Media.

Paper IV: Genome Biology 8, Hughes, Timothy; Ekman, Diana; Ardawatia, Himanshu; Elofsson, Arne and David A. Liberles, Evaluating dosage compensation as a cause of duplicate gene retention in Paramecium tetraurelia, p. 213. © 2007 BioMed Central Ltd. The published version is available here: http://dx.doi.org/10.1186/gb-2007-8-5-213

Utgiver

The University of Bergen