Molina, Nacho. Genome evolution and regulatory network structure in bacteria. 2010, Doctoral Thesis, University of Basel, Faculty of Science.
|
PDF
3748Kb |
Official URL: http://edoc.unibas.ch/diss/DissB_9310
Downloads: Statistics Overview
Abstract
Funes, in spite of his infallible memory, was not capable of thought since, as J.L. Borges writes, ?to think is to forget differences, generalize, make abstractions.? Due to the latest technological advances, biology seems to be entering in a Funes-like state: biologists can amass more experimental data about the organisms they study than ever before; and, store these ?memories? in huge databases. A fundamental question rises: can the scientific community synthesize this information and turn it into powerful abstract theories? Is abstraction possible or even desirable in such a complex discipline as biology? From the point of view of a physicist I believe that a theoretical biology is both possible and desirable.Several quantitative laws have recently come to light in biology, particularly in the evolution and regulatory architecture of genomes. This thesis explores the implications on genome evolution and regulatory network structure of one such law: the scaling of functional content of genomes with their size. This was the starting point of this thesis which hopefully represents a tiny little step towards a general theory of genome evolution and regulatory network structure in bacteria.
Genome evolution:Darwin's original work established the basis of the theory of evolution postulating that traits spread in populations by natural selection. This fundamental understanding was partially changed by the discovery that DNA carries heritable genetic information leading to the began of the new era of molecular evolution. ÿComparing orthologous mammalian DNA sequences to the fossil record indicated that the rate of amino acid substitutions was roughly constant in time. However, these substitutions fixed in populations too often to have been the result of selection. The high rate of fixation led Kimura to formulate his neutral theory of molecular evolution. Since then, neutral evolution became the null model of sequence evolution which permitted the rigorous reconstruction of phylogenies and detection of selection on gene sequences. Today the sequences available have grown from a few genetic loci to hundreds of whole annotated genomes . ÿThis wealth of data permits us to look beyond amino acid substitutions and study the variation in gene content and structure of genomes at a whole. ÿIn fact, several studies have shown that even closely related genomes with few substitutions often have enormous differences in gene content. These results highlight that changes at higher level of organization have an essential role in the evolutionary process and therefore in life diversity. The main forces causing these changes, i.e. shaping the gene-content of genomes, are gene duplication, gene deletion and horizontal gene transfer leading to the acquisition of genes with new functions, subfunctionalizing existing functions, or deleting genes whose functions are no longer required.Studies of gene content have uncovered several striking quantitative laws that are directly related to genome evolution. First of all, it was noticed that a number of key genomic quantities show power-law distributions. In particular, the distribution of gene families is a power-law in each genome, whose exponent appears to depend mostly on the size of the genome. Several theoretical models have been put forth for explaining these power-law distributions which all include gene duplications, gene deletions and gene innovation as key ingredients. Another striking observation is that the numbers of genes in different functional categories scale as power-laws in the total number of genes in the genome. For example, whereas the numbers of genes involved in different types of metabolism scale approximately linear with genome size, the number of genes involved with regulatory processes such as transcription regulation and signal transduction scales almost quadratically with genome size, and the number of genes involved with basic processes such as DNA replication or cell division scales with an exponent less than 1. Such scaling laws are observed for the large majority of high-level functional categories. As argued before, these scaling laws have important implications for the evolutionary dynamics of gene duplications and deletions. This thesis focuses on how the functional content of genomes scales with genome size. ÿWe show that these scaling laws hold across bacterial clades, and formulate the simplest null model which accounts for these scaling laws. ÿThe scaling exponents emerge as universal constants of genome evolution. ÿWe test the model's predictions against the protein domain content of closely related genomes by estimating the number of domain additions and deletions in each pair of genomes since they diverged from their last common ancestor. ÿThe available data support nearly all of the model's predictions. Finally, we discuss the implications of our work on the role of horizontal gene transfer in genome evolution.
Regulatory Networks:We can view a bacterial cell as an entity made up of many molecular components that is capable of sensing many internal and external physico-chemical signals, and executing specific cellular programs in response. The realization of each program produces certain concentrations of specific proteins that act in some fashion beneficial to the cell. Thus, to understand the cell's dynamics, we must know how the protein concentrations change in response to the environment.Transcription of genes into mRNA molecules is one of the most important stages of protein biosynthesis. Transcription is regulated by specific proteins which are collectively called transcription factors. In response to stimuli, transcription factors bind specifically to DNA by recognizing short DNA sequences upstream of genes. Upon binding, they activate or repress transcription of genes into mRNA, i.e. transcription factor activate or repress gene expression. The set of all interactions between transcription factors and their regulated target genes form the so-called transcriptional regulatory network. Therefore, understanding this network is essential to understand the cell's response to its environment. The topological features of the transcriptional regulatory networks of E. coli and S. cerevisiae have been intensely studied and some of their global and local properties have been uncovered in recent years. For instance, some studies have shown that the distribution of the number of genes that are regulated by a particular transcription factor (or out-degree) follows a power law, while the number of transcription factors regulating a particular gene (or in-degree) follows an exponential distribution.Globally, these network are organized into subnetworks which show a hierarchical internal structure with very few feedback interactions except for self-regulation. Interestingly, it has experimentally been demonstrated that these subnetworks process specific environmental signals. Locally, certain motifs formed by few nodes appear more often than in random networks with the same degree distributions. The information-processing properties of these motifs has been studied individually as well as how they aggregate to form higher structures. However, it is not clear whether these motifs have been positively selected by evolution due to their particular functions, or they are a side effect of the evolution of the regulatory network. Some of these results are still controversial and it is important to recall that they were obtained on incomplete networks. They may not hold once the full networks are known. All the results above come from a small number of model organisms. Therefore, little is known about how the global structure of transcription regulatory networks varies across bacteria. Strikingly, the number of transcription factors grows roughly quadratically with the size of the genome. For example, according to the DBD database, the number of transcription factors per genome in bacteria varies from only 3 (of a total of 504 genes) in Buchnera aphidicola, to 801 (of a total of 7717 genes) in Burkholderia sp. 383. To put the latter number in perspective, the vastly bigger genomes of C. elegans and D. melanogaster have a lower estimated total number of transcription factors according to the same database. The enormous range in the number of transcription factors across bacteria reflects a corresponding range in complexity of gene regulation. For example, Buchnera lives in a very stable environment as an endosymbiont of aphids, and shows little transcriptional regulation. In contrast, Burkholderia can live under extremely diverse ecological conditions including soil, water, as a plant pathogen, and as a human pathogen, which most likely require complex regulatory mechanisms. This scaling property of the number of transcription factors has important implications for the structure of transcription regulatory networks. The total number of interactions between transcription factors and regulated genes is given by the number of transcription factors r times the average number of interactions per transcription factor, but also by the total number of genes times the average number of transcription factor that regulate a gene. Since the number of transcription factors per gene grows linearly with the total number of genes we cannot have that both the average number of interactions per transcription factor and the average number transcription factors that regulate a gene are the same in bacteria of different genome size. That is, either genes are regulated by more transcription factors in larger genomes or the regulon size decreases with genome size. Which of these scenarios is the one that occurs in nature? This thesis addresses this question.However, answering this question directly requires knowing a large number of transcriptional regulatory networks, but very few such networks are available. ÿInstead, we use an indirect procedure based on the assumption that regulatory sites on the genome evolved under purifying selected. ÿWe develop a novel method to measure purifying selection in intergenic regions. Our procedure starts from a set of related bacterial genomes (a clade) as provided by the NCBI microbial genome database, of which one is denoted as the reference species. For each gene and each intergenic region of the reference species we extract orthologous genes and intergenic regions from the other species and produce multiple alignments. We determine cliques of orthologous proteins (sets of genes that are all mutual orthologs between all species in the clade) and infer the topology of the phylogenetic tree from the concatenated alignment of all cliques. Then, we evaluate the amount of selection for each alignment column by the likelihood ratio of two evolutionary models: the background model that assumes a simple F81 substitution rate model which is parameterized by an overall mutation rate and a vector of equilibrium base frequencies. And, the foreground model that assumes the same substation rate model but with a unknown specific set of base frequencies that account for the selection action on that site that are integrate out of the likelihood. Some of these techniques were integrated into MotEvo, a novel tool for detecting binding sites in intergenic alignments given known weight matrices.We applied our method to 22 different bacterial clades which span widely the whole phylogenetic tree. We identified segments in the intergenic regions of the analyzed bacteria that show evidence of purifying selection. To evaluate the performance of our method for detecting real binding sites we studied the overlap between the identified segments and experimental verified binding sites of E. coli. The results show that we are available to detect real binding sites based on conservation. We obtained purifying selection profiles respect to gene start and stop sites revealing universal patterns across species. One of the most remarkable pattern is the selection that takes place around the start codon which is shown to be connected to translational efficiency. We observed, almost in all clades, a relatively higher frequency of adenine around the start codon which we showed is related to the avoidance of RNA secondary structure in that region. Coming back to our starting question: how the number of binding sites scales with genome size? To answer this, we studied the amount of purifying selection from intergenic regions across the 22 bacterial clades. Strikingly, the amount of purifying selection in intergenic regions does not vary with genome size. ÿMoreover, the most conserved DNA words in intergenic regions showed higher diversity in large genomes than in small ones. These results strongly indicate that the structure of transcription regulatory networks changes dramatically with genome size: small genomes have few transcription factors each binding to many sites, while large genomes have many transcription factors each binding to a few sites. In other words, gene regulatory complexity is limited across bacteria while transcription factors become specialized in large genomes.
Genome evolution:Darwin's original work established the basis of the theory of evolution postulating that traits spread in populations by natural selection. This fundamental understanding was partially changed by the discovery that DNA carries heritable genetic information leading to the began of the new era of molecular evolution. ÿComparing orthologous mammalian DNA sequences to the fossil record indicated that the rate of amino acid substitutions was roughly constant in time. However, these substitutions fixed in populations too often to have been the result of selection. The high rate of fixation led Kimura to formulate his neutral theory of molecular evolution. Since then, neutral evolution became the null model of sequence evolution which permitted the rigorous reconstruction of phylogenies and detection of selection on gene sequences. Today the sequences available have grown from a few genetic loci to hundreds of whole annotated genomes . ÿThis wealth of data permits us to look beyond amino acid substitutions and study the variation in gene content and structure of genomes at a whole. ÿIn fact, several studies have shown that even closely related genomes with few substitutions often have enormous differences in gene content. These results highlight that changes at higher level of organization have an essential role in the evolutionary process and therefore in life diversity. The main forces causing these changes, i.e. shaping the gene-content of genomes, are gene duplication, gene deletion and horizontal gene transfer leading to the acquisition of genes with new functions, subfunctionalizing existing functions, or deleting genes whose functions are no longer required.Studies of gene content have uncovered several striking quantitative laws that are directly related to genome evolution. First of all, it was noticed that a number of key genomic quantities show power-law distributions. In particular, the distribution of gene families is a power-law in each genome, whose exponent appears to depend mostly on the size of the genome. Several theoretical models have been put forth for explaining these power-law distributions which all include gene duplications, gene deletions and gene innovation as key ingredients. Another striking observation is that the numbers of genes in different functional categories scale as power-laws in the total number of genes in the genome. For example, whereas the numbers of genes involved in different types of metabolism scale approximately linear with genome size, the number of genes involved with regulatory processes such as transcription regulation and signal transduction scales almost quadratically with genome size, and the number of genes involved with basic processes such as DNA replication or cell division scales with an exponent less than 1. Such scaling laws are observed for the large majority of high-level functional categories. As argued before, these scaling laws have important implications for the evolutionary dynamics of gene duplications and deletions. This thesis focuses on how the functional content of genomes scales with genome size. ÿWe show that these scaling laws hold across bacterial clades, and formulate the simplest null model which accounts for these scaling laws. ÿThe scaling exponents emerge as universal constants of genome evolution. ÿWe test the model's predictions against the protein domain content of closely related genomes by estimating the number of domain additions and deletions in each pair of genomes since they diverged from their last common ancestor. ÿThe available data support nearly all of the model's predictions. Finally, we discuss the implications of our work on the role of horizontal gene transfer in genome evolution.
Regulatory Networks:We can view a bacterial cell as an entity made up of many molecular components that is capable of sensing many internal and external physico-chemical signals, and executing specific cellular programs in response. The realization of each program produces certain concentrations of specific proteins that act in some fashion beneficial to the cell. Thus, to understand the cell's dynamics, we must know how the protein concentrations change in response to the environment.Transcription of genes into mRNA molecules is one of the most important stages of protein biosynthesis. Transcription is regulated by specific proteins which are collectively called transcription factors. In response to stimuli, transcription factors bind specifically to DNA by recognizing short DNA sequences upstream of genes. Upon binding, they activate or repress transcription of genes into mRNA, i.e. transcription factor activate or repress gene expression. The set of all interactions between transcription factors and their regulated target genes form the so-called transcriptional regulatory network. Therefore, understanding this network is essential to understand the cell's response to its environment. The topological features of the transcriptional regulatory networks of E. coli and S. cerevisiae have been intensely studied and some of their global and local properties have been uncovered in recent years. For instance, some studies have shown that the distribution of the number of genes that are regulated by a particular transcription factor (or out-degree) follows a power law, while the number of transcription factors regulating a particular gene (or in-degree) follows an exponential distribution.Globally, these network are organized into subnetworks which show a hierarchical internal structure with very few feedback interactions except for self-regulation. Interestingly, it has experimentally been demonstrated that these subnetworks process specific environmental signals. Locally, certain motifs formed by few nodes appear more often than in random networks with the same degree distributions. The information-processing properties of these motifs has been studied individually as well as how they aggregate to form higher structures. However, it is not clear whether these motifs have been positively selected by evolution due to their particular functions, or they are a side effect of the evolution of the regulatory network. Some of these results are still controversial and it is important to recall that they were obtained on incomplete networks. They may not hold once the full networks are known. All the results above come from a small number of model organisms. Therefore, little is known about how the global structure of transcription regulatory networks varies across bacteria. Strikingly, the number of transcription factors grows roughly quadratically with the size of the genome. For example, according to the DBD database, the number of transcription factors per genome in bacteria varies from only 3 (of a total of 504 genes) in Buchnera aphidicola, to 801 (of a total of 7717 genes) in Burkholderia sp. 383. To put the latter number in perspective, the vastly bigger genomes of C. elegans and D. melanogaster have a lower estimated total number of transcription factors according to the same database. The enormous range in the number of transcription factors across bacteria reflects a corresponding range in complexity of gene regulation. For example, Buchnera lives in a very stable environment as an endosymbiont of aphids, and shows little transcriptional regulation. In contrast, Burkholderia can live under extremely diverse ecological conditions including soil, water, as a plant pathogen, and as a human pathogen, which most likely require complex regulatory mechanisms. This scaling property of the number of transcription factors has important implications for the structure of transcription regulatory networks. The total number of interactions between transcription factors and regulated genes is given by the number of transcription factors r times the average number of interactions per transcription factor, but also by the total number of genes times the average number of transcription factor that regulate a gene. Since the number of transcription factors per gene grows linearly with the total number of genes we cannot have that both the average number of interactions per transcription factor and the average number transcription factors that regulate a gene are the same in bacteria of different genome size. That is, either genes are regulated by more transcription factors in larger genomes or the regulon size decreases with genome size. Which of these scenarios is the one that occurs in nature? This thesis addresses this question.However, answering this question directly requires knowing a large number of transcriptional regulatory networks, but very few such networks are available. ÿInstead, we use an indirect procedure based on the assumption that regulatory sites on the genome evolved under purifying selected. ÿWe develop a novel method to measure purifying selection in intergenic regions. Our procedure starts from a set of related bacterial genomes (a clade) as provided by the NCBI microbial genome database, of which one is denoted as the reference species. For each gene and each intergenic region of the reference species we extract orthologous genes and intergenic regions from the other species and produce multiple alignments. We determine cliques of orthologous proteins (sets of genes that are all mutual orthologs between all species in the clade) and infer the topology of the phylogenetic tree from the concatenated alignment of all cliques. Then, we evaluate the amount of selection for each alignment column by the likelihood ratio of two evolutionary models: the background model that assumes a simple F81 substitution rate model which is parameterized by an overall mutation rate and a vector of equilibrium base frequencies. And, the foreground model that assumes the same substation rate model but with a unknown specific set of base frequencies that account for the selection action on that site that are integrate out of the likelihood. Some of these techniques were integrated into MotEvo, a novel tool for detecting binding sites in intergenic alignments given known weight matrices.We applied our method to 22 different bacterial clades which span widely the whole phylogenetic tree. We identified segments in the intergenic regions of the analyzed bacteria that show evidence of purifying selection. To evaluate the performance of our method for detecting real binding sites we studied the overlap between the identified segments and experimental verified binding sites of E. coli. The results show that we are available to detect real binding sites based on conservation. We obtained purifying selection profiles respect to gene start and stop sites revealing universal patterns across species. One of the most remarkable pattern is the selection that takes place around the start codon which is shown to be connected to translational efficiency. We observed, almost in all clades, a relatively higher frequency of adenine around the start codon which we showed is related to the avoidance of RNA secondary structure in that region. Coming back to our starting question: how the number of binding sites scales with genome size? To answer this, we studied the amount of purifying selection from intergenic regions across the 22 bacterial clades. Strikingly, the amount of purifying selection in intergenic regions does not vary with genome size. ÿMoreover, the most conserved DNA words in intergenic regions showed higher diversity in large genomes than in small ones. These results strongly indicate that the structure of transcription regulatory networks changes dramatically with genome size: small genomes have few transcription factors each binding to many sites, while large genomes have many transcription factors each binding to a few sites. In other words, gene regulatory complexity is limited across bacteria while transcription factors become specialized in large genomes.
Advisors: | Nimwegen, Eric van |
---|---|
Committee Members: | Koonin, Eugene V. and Zavolan, Mihaela |
Faculties and Departments: | 05 Faculty of Science > Departement Biozentrum > Computational & Systems Biology > Bioinformatics (van Nimwegen) |
UniBasel Contributors: | Zavolan, Mihaela |
Item Type: | Thesis |
Thesis Subtype: | Doctoral Thesis |
Thesis no: | 9310 |
Thesis status: | Complete |
Number of Pages: | 136 S. |
Language: | English |
Identification Number: |
|
edoc DOI: | |
Last Modified: | 02 Aug 2021 15:07 |
Deposited On: | 26 Jan 2011 13:38 |
Repository Staff Only: item control page