4 Introduction to Microbial Genomics

Author

Meriam Guellil

4.1 Introduction

Microbial genomics is the identification and study of microbial genomes, their structure and composition. Microbial species come from diverse groups of organisms but are generally defined as organisms which are not visible to the naked eye. Most prominent amongst them are bacteria, which are single-celled prokaryotes with either circular and linear dsDNA genomes (up to ~14 Mbp). Archaea are mostly mutualistic or commensal single-celled prokaryotes with circular dsDNA genomes (~0.5 to ~5.8 Mbp). The often references group of protozoa, is a not phylogenetically defined grouping (often used in databases). They are a wide variety of free-feeding single-celled eukaryotes from the protists group with larger genomes (~2.9 to ~160 Mbp).

Finally, viruses are understood as microbial organisms, even though they are not technically considered as “organisms”. Viruses lack the ability to live or replicate independently of host cells. They are mostly defined as infectious agents and have smaller linear or circular ssDNA, dsDNA & RNA genomes (2 kb to over 1 Mb). Finally, another category of viruses are retroviruses. These will usually be RNA viruses which can integrate into the host genome by converting to DNA. But there are also DNA viruses, which integrate into the human genome. There the virus can either be latent and be later triggered to activate, continuously produce virions or lose the ability to produce virions and become part of the host genome. An integrated genome is called a provirus and vertically inherited proviral sequences are called endogenous retroviruses (ERVs).

There are several types of microbial organisms found within ancient DNA samples of animal hosts.

Pathogens: infectious microorganisms or agents which cause disease in the infected host. Usually specialized organisms with delimited ecological niches.
Commensals: non-infectious microorganisms or agents which live within a host/environment without causing harm and can be beneficial.
Environmental microorganisms: Environmental microbial organisms not endogenous to the host ante-mortem, which e.g. stem from the depositional environment, the storage environment or the lab.

However, it should be noted that not all cases are quite as clear-cut as there are wide varieties of microbial lifestyles. Some organisms can be considered pathogenic in one sampling location and commensal in the next (pathobionts). Some organisms can also be considered harmful, when they are represented in very large amounts or in combination with other organisms. The sampling location can therefore be important information when studying the health of hosts.

4.2 Larger Genomic Elements

When using reference sequences from public depositories such as NCBI you will often be confronted with multiple intervals which represent the chromosome(s) and additional extrachromosomal sequences associated with a species or strain. A chromosome is understood as the main genetic element. It contains the core genome with all essential genetic elements, which usually remain the same across a species.

Plasmids, on the other hand, are extrachromosomal DNA replicons. They are replicating and acquiring/losing genetic material independently of the chromosome and can be present in high copy numbers. Plasmids can be vertically or horizontally transferred. The number of plasmids of a bacterium varies a lot (0 to 30+), and can vary within a species. They can carry virulence genes and genes which can give selective advantages. Plasmids generally have much smaller circular genomes (1 to over 400 Kbp). It should be noted that they are under-represented in databases, meaning that they are often ignored during the assembly/sequencing process and thus not present in all reference sequences, even if they could otherwise have been there.

Besides plasmids, bacteriophages (or phages) are also important extrachromosomal (and sometimes chromosomally integrated) genetic material associated with bacterial genomes. Although they are not considered as part of the genome, and therefore part of reference sequences, if they are not integrated in the host cell genome. Bacteriophages are dsDNA viruses of small size that infect bacteria. As phages have mostly dsDNA genomes, they are also well suited for the recovery using standard aDNA techniques. They can be found everywhere and carry gene sets of diverse size and composition. They enter the cytoplasm of bacteria, where they can then replicate. There are a huge number of phages, and some are specific to certain bacterial genera or species, which can be useful for identifying taxa or phylogenetic clades. When bacteriophage genomes integrate into the host cell chromosomal genome (e.g. through horizontal gene transfer) or exist as a plasmid within the bacterial cell, they are called prophages. They can make up a large fraction of the pan-genome of plastic species.

4.3 Sampling & Sequencing

For ancient DNA samples, screenings for organisms of interest is usually performed on “shotgun” sequencing data, which can then be either further gnomically analysed following species identification or the library will be enriched for organisms identified during screening. Prior to sequencing, laboratory workflows, and particularly library building setups, can have a major impact on your final output (e.g. if you are interested in RNA/ssDNA viruses but have dsDNA libraries).

Each pathogen is subject to tissue tropism, meaning that each species will have a range of cells or tissue types it will preferentially or exclusively proliferate and replicate in. This phenomenon is called tissue tropism. In some instances, the pathogen can also become latent within said tissues, meaning it will remain within the host tissue without causing disease. Organisms which cause bacteremia or viremia, meaning they are present in the bloodstream, will not be as restricted and are generally considered easier to detect. Although this can be limited by the disease phenotype or the stage of infection (e.g. Haemophilus influenzae).

Accordingly, what organism you can find is highly dependent on the sample and where it was taken from. For example, if the organism of interest enters the bloodstream, teeth, which are vascularized and are excellent DNA archives, will probably be a good choice for sampling. Some pathogens (e.g. Mycobacterium leprae) are present in higher quantities within lesions caused by their infections, making those lesions better suited for sampling. In the case of infections, which are unlikely to be retrievable from hard tissue, calcified nodules can be an interesting type of sample. Bone will generally be less well suited, with some exceptions. Integrated viruses and retroviruses will be likelier to be found in samples types with higher host DNA, such as the petrous bone or the ossicles. And early childhood infection can also be found within low remodelling cortical bone.

Some Myths to Dispel…

Microbial content and pathogen content is NOT directly correlated to human DNA content. Human DNA content is not a measure for overall sample preservation. You can have really bad samples for human DNA but get a full microbial genome from the same sample!
You can also find microbial signatures in petrous bones etc. it will just be a different range of organisms. Everything is worth getting screened!

The typical number of sequences recovered for an organism will depend on a range of factors which can only rarely be predicted even in samples for which osteological/historical record show a clear association with a pathogen. Major factors are: overall sample preservation, depth of sequencing, abundance of the organism peri-mortem, disease phenotype, genome size & composition, “noise” from other organism, etc.

Overall, it is considered a good result if your target organism makes out around 0.1% of shotgun datasets, but in many cases it will be way less. However, depending on the organism, the amount of data you need to confidently classify it will be very different, mostly due to sequence conservation and genome size and complexity. While a bacterium is generally easier to detect, their large genome size and sequence conservation, makes them harder to verify. On the other hand, viruses, having much smaller genomes, are harder to detect, especially at low sequencing depth, but generally easier to verify within a margin of uncertainty.

For some applications in microbial genomics, unselective/biased sequencing can be key, and for this shotgun sequencing is a powerful tool at our disposal since it allows for the indiscriminate sequencing of all DNA found within the sample. Particularly in cases of organisms with large pangenomes, where you might be interested in looking for a large variety of intervals, which would be very costly to do using target enrichment. Or in cases where novel insights and genomes become relevant following the design of a target enrichment kit. Additionally, it allows for the simultaneous analysis of both microbial and host DNA. Understanding the composition of the microbial community of your sample can also be highly relevant with regard to identifying co-infectants, reconstructing disease histories and excluding gene intervals which could also stem from commensals or contaminants.

Shotgun sequencing for microbial genomics:

Pro(s):

Non-targeted taxa can be found.
Allows the analysis of DNA from both microbial organisms and host simultaneously.
Allows you to detect untargeted co-infections.
Allows you to understand the composition of the microbial community (can be relevant for genomic analysis).
No knowledge of the taxon genomic diversity is needed beforehand.
Enables the long time use of the data with ever growing databases.

Con(s):

Can be much more expensive (depending on the sample).
Overall less effective at generating adequate/high coverage data.

However, microbial species make up only very small fractions of genomic libraries. So small that it can either be impossible or very expensive to try to assemble a genome using only shotgun data in most circumstances. This is usually where target enrichment, or capture, comes into play. Often, the baits required for capture are custom designed for each study. The design of such kits necessities appropriate knowledge of the genetic diversity to be expected during analysis. In most cases, a mixture of shotgun and target enrichment can be most effective, especially with regards to authentication if enrichment is performed on UDG/Half-UDG genomic libraries.

Target enrichment for microbial genomics:

Pro(s):

Much cheaper per genome, especially in samples with bad preservation.
Effective at generating high coverage genomes.
Allows for the generation of large number of genomes effectively.
Great for core-genome reconstructions.

Con(s):

Necessitates knowledge of the genetic diversity relevant to your research questions during the design phase.
Any sequence not included, within a margin of variation, will not be captured.
Will not generate any data with regards to the host or the rest of the microbial community (in fact that is the point).
For some species, capturing the pangenome would be extremely expensive provided it doesn’t fully exceed kit sizes.

4.4 Validating the Presence of Organisms of Interest

Following analysis of your data using appropriate databases and taxonomic classifiers, it is time to validate your hit. The results of a classifier should not be your end result. Properly validating the presence of a species is key to any genomic analysis! An identification by classifiers is not a sufficient validation on its own, as it can be very tricky to work around false-positive hits and missing database entries. Databases are biased towards pathogenic species, as they represent the bulk of research. Closely non-pathogenic species will either be represented less or not at all, which leads to high read assignments to the LCA (Lowest Common Ancestor) and pathogenic taxa. You should also consider whether it makes biological sense for the species you found to be detected in the sample’s tissue and based on your laboratory workflow.

For ancient DNA, the validation will often consist of three steps:

Authentication of the data as aDNA using deamination signatures.
Comparative or competitive mappings to relevant reference sequences (target organism and closely related environmental/commensal species).
Identification of intervals specific to the target species or sub-species.

Tip

It can also be a good idea in some cases to double check the modern data you are using during your analysis. There are cases in which the metadata doesn’t match reality.

Increased coverage in selected intervals in an alignment is often cause by sequence conservation. Meaning that you could have the same number of reads mapping to a genome but in the first case all reads are mapping to 5% of the genome with high depth of coverage, whether in the other one you will see low coverage across the whole sequence. The first one is likely to be a false positive, and your detection is based on intervals from either a commensal or an environmental organism which have very high sequence similarity to the ones in your reference genome. This can be further worsened by low complexity intervals. These peaks will usually also reflect in the edit distance, as such tiling will also result in more sequence mismatches. While this can be reduced with higher mapping stringency, this in turn will probably decrease your coverage significantly and in some cases this can cause a reference bias. However, in most cases, these intervals will not pass the mapping quality filter.

Definition

Conserved Intervals: Regions of genomes common to organisms from the same taxonomic units (can affect all taxonomic ranks and child taxa, and very distantly related organisms). They will show high sequence similarity and cause noise/contamination within the analysis, particularly for ancient DNA.

Tip

In many cases, running a range comparative and/or competitive mappings to closely related species is advised in order to exclude a misidentification and the presence of multiple species with similar sets of genes in your dataset (e.g. Neisseria meningitidis). It can also be useful to use multiple reference sequences per species if it has multiple phylogenetically strongly delimited clades (e.g. H. influenzae)

Viruses are less affected, as they do not carry house-keeping genes, and only have a very small usually highly specialized set of genes, which makes them easier to validate. However, viral sequencing of non-pathogenic species has only recently started to become more popular due to metagenomic studies, i.e. non-pathogenic taxa are even more under-represented, and many viral genomes have low complexity repeats over large portions of their sequence.

Finally, intervals or mutations specific to the target organism should be identified and investigated. This can range from plasmids, genes, phages to specific mutations. Often a combination of these factors is required to confidently validate the presence of the organism. This step should not be overlook as this will not always be clearly visible in a phylogeny, depending on the composition of the alignment used.

4.5 Genomes

Strains VS Genomes

A strain is a genetic variant or subtype of a species or sub-species. They usually exhibit significant genetic differences to other strains of the same taxonomic unit. The level of change required for significance can vary.
A genome can be the exact copy of an already sequenced strain.

The case of aDNA: Since our samples are so old, most of our genome constitute new strains. However, caution is advised. Particularly with clonal species (e.g. Black Death genomes).

There are many different microbial species, and they all have very different evolutionary dynamics and genomes. This means that depending on which species you are working on, you might have to approach genomic analysis differently. Additionally, the research questions which could be answered with your data will/can also be rather different. This sections will summarise some of the core principles underlying many possible research questions.

Mapping is an important tool in ancient DNA to investigate the presence and absence of genomic intervals. However, mapping can be impeded by the reference sequence itself. Gene duplication, low complexity sequences and GC Skew (over- or under-abundance of GC in intervals) can impact the mappability (and mapping evenness) of sequencing reads to the reference sequence and their mapping quality score. Which in turn might be problematic during analysis. This can be expressed in mappability estimates, which estimate how likely it is for short reads to map to a sequence interval based on its composition and uniqueness.

Applying a mapping quality filter by default after a bwa aln mapping, will not only cause all bad quality reads to be removed from the alignment, but also any read which could align to multiple sections of your reference genome. This means that most likely every duplicated interval/gene, specific or conserved, and low complexity regions will also be removed! Mapping quality filters are important tools, but you should make sure you understand how much of the genome is actually covered (e.g. terminal repeats in viral genomes) and when to apply them. The same applies for competitive mappings using bwa aln!

Definition

Sequence Complexity: Defined as the observed vocabulary usage for word size. Meaning, how complex is the sequence of nucleotides within an interval. Often given as values calculated using either entropy or DUST algorithms. E.g.:

AAAAAAAAAAAAAAAAAAAA >> LOWEST COMPLEXITY

ATGATGATGATGATGATGATG >> LOW COMPLEXITY

ATGTTTCGAGGCATGATAACCGTATG >> COMPLEX SEQUENCE

Definition

GC Content: Is usually given as a percentage of guanine and cytosine bases within the sequence. Can impact DNA stability, amplification, sequencing, and capture if the GC content is too high or too low. Too high or too low GC-content will also lead to a loss in sequence complexity. E.g.:

GCCCCCCCGGAATGGACCCGCGCCT >> 80% GC

ATTTTGAACCTAATTTATATAGCAA >> 20% GC

4.5.1 Recombination

Bacteria and viruses are haploid and have small genomes, making some things much easier, but… They like to mix and match! Microbial organisms will exchange genetic material using a range of biological mechanisms (conjugation, transduction, and/or transformation) this leads to increased genetic diversity via horizontal gene transfer. Some parts of the genome will be more heavily affected by this exchange than others. This constant exchange of genetic material can happen while maintaining genome size, meaning that while genes are gained, others are lost. How much recombination a genome will undergo, is highly dependent on the species or sometimes the phylogenetic clade. These dynamics cause recombination breakpoints in reference based alignments where SNP counts will increase, which can impede phylogenetic analysis (e.g.: by causing elongated branches).

However, some species do not recombine at all or do so only very rarely. These species have clonal genomes, meaning they transfer their entire genome vertically, most the of time, without significant changes to their pangenome or the sequence itself. Actually, many of the species which have been extensively studied using aDNA, fall within this group (e.g. Y. pestis, M. leprae).

Most microbial organisms change their genomes via recombination and gene loss/gain, while maintaining genome size and retaining their core genome. Virulence is then often defined by gaining or losing virulence factors. However, some increase their virulence by reducing their genome and becoming highly specialized (e.g. Borrelia recurrentis) in what is called reductive evolution. In these cases, it is of interest to include closely related phylogenetically “basal” species (pathogenic or not), which might inform you on genes/plasmids the modern genomes have lost, but the ancestral genomes might have still retained.

4.5.2 Pangenomes

As mentioned, microbial organisms can be very plastic and exchange genomic material. This can lead to a wide variety of genes being represented within a single species. Pangenomes represent the sum of all genomic intervals represented within the genomes of one species. And thus, characterising the genome of a new strain can be a lot more complex than sticking to a single reference sequence.

Pangenomes are made up of three groups:

Core Genes: a set of essential genes common to all strains of a species.
Accessory/Shell Genes: a set of genes common amongst some strains of a species which encode for supplementary or modified biochemical functions. In some cases these will also be virulence genes.
Singleton/Unique/Cloud Genes: genes, which are specific to single strains of a species. Unlikely to be recovered and identified using ancient DNA.

We also differentiate between open and closed pangenomes. In open pangenomes, the gene set available to the bacterium constantly expands by acquisition and loss of genes via horizontal gene transfer from its environment, for the purpose of adaptation (e.g. environmental adaptation, metabolism, virulence and antibiotic resistance). All while maintaining genome size and vertical transmission of the core genome. Closed pangenome have limited exchanges of genetic material. This is often the case in highly specialized organisms. So while they are much more plastic than strictly clonal species, the available genetic pool for exchange will be smaller.

4.5.3 SNP Effects

Single nucleotide polymorphisms (SNPs) can heavily impact the function of genes and the phenotype of an organism when they happen to affect and change the amino acid sequence of coding elements of a genome. This effect is frequently used in ancient DNA to either predict the presence/absence of significant mutation, which could have changed the phenotype of a disease.

Definitions

Non-synonymous (missense) mutation: SNPs that alters the amino acid sequence of a protein by changing one base of a triplet group. E.g.: GAT>GGT == Asp(D)>Gly(G)

Frameshift: Changes in the amino acid sequence caused by insertions or deletions of nucleotides in coding sequences which are not multiples of three.

Start/Stop Codon: Nucleotide triplet which signals the beginning and end of translation.

Stop/Start-gain mutation: Mutation which leads to a change in the amino acid sequence and leads to a premature stop/start of translation.

Start/Stop-loss mutation: Mutation which leads to a change in the amino acid sequence and leads to the loss of a start/stop codon. Leads to the reduction or elimination of protein production.

The impact of such changes is predicted by tracking changes in the translation of coding sequences using a reference sequence and SNP/MNP/Indel calls. The problem when using ancient DNA, is that genome coverage is often uneven, and it isn’t rare to lack full coverage of intervals of interest or lack resolution to correctly call indels across the reference. Larger genomic insertions and recombination are also not accounted for when a genome is reconstructed solely using a reference based approach. This means that while SNP effects can indeed be very interesting and potentially highly significant, they should be understood as prediction based on the available coverage. It is therefore important to report coverage over the entire coding and promoter intervals, when discussing SNP effect.

Important exception are known and heavily conserved mutations, which for example are involved in pseudogenization. Pseudogenization is the process through which genes lose their function due to mutations but remain in the genome without being expressed or without having a function. It is a mechanism underlying gene loss. They are significant in aDNA research because we can capture “active” versions of pseudogenes and potentially date their loss of function. It is an critical part of the genome reduction process of highly specialised bacteria.

4.5.4 Virulence Associated Intervals

One of the question ancient DNA research is interested in is the evolution of virulence in pathogenic species. The location and nature of virulence factors can vary. An increase in virulence can be caused by gene gain (on chromosomes or plasmids), plasmids, mutations, or changes to complex gene mechanisms (e.g. immune evasion). To understand the underlying processes of virulence adaptation has been one of the recurring questions in the field. Virulence intervals vary but can be found within the literature and in some cases in curated databases.

Tip

Are there loci associated with increased virulence known to the literature?
Are there phylogenetic groups associated with increased virulence?

Not all species have evolutionary dynamics which allow us to detect a temporal signal or recognise geographic structure within their phylogenies. However, phylogenetic clades can also inform us on other aspects of their evolution (virulence, ecological niche etc.). When investigating virulence, you should be checking for the presence of chromosomal Virulence factors (genes, mutations), currently this is mostly done using either as a Presence/Absence matrix or a cluster analysis. As well as investigating the presence/absence of plasmids and plasmid mediated virulence factors and functional mechanisms, and relevant SNPs.

Beyond virulence, some genes, and gene combinations can also inform us on changes in disease phenotype and microbial adaptation to host or ecological niche.

4.6 Coinfections, Multi-strain Infections etc.

The detection and study of co-infection should also be considered. While it may not always be possible to reconstruct full genomes for each pathogen, the knowledge of the presence of multiple additional infectious agents alone can be very informative. Additionally, the detection of multi-strain infections (in high coverage genomes), should also be considered, especially in the case of multi-focal infections for which multiple samples are available.

Reconstructing disease phenotype & etiopathology:

Where was the genome isolated from?
- How can this relate to the disease phenotype?
Do we have knowledge of potential coinfection? Strain differences?
Do we have knowledge of the extent of affected tissues within the host body?
Can we conclude whether it is likely to be a chronic or an acute infection?
In what demographic cohort would the individual(s) belong to?
- What could this tell you about the course of the disease or the likelihood of acute infections?
Does the host genomes show any clinically relevant variants which could impact how the pathogen would affect them?

FINALLY: Integrate the data in a synthesis of all available data types. E.g.: osteology, archaeology, isotopes etc.

4.7 Resources

Some useful websites:

GenBank NCBI Database: https://www.ncbi.nlm.nih.gov/genbank/
The European Nucleotide Archive (ENA): https://www.ebi.ac.uk/ena/browser/home
The virulence factor database (VFDB): http://www.mgc.ac.cn/VFs/main.htm
Viral Neighbour Genomes In The Assembly Resource (NCBI): https://www.ncbi.nlm.nih.gov/genome/viruses/about/assemblies/
NCBI Taxonomic Browser: https://www.ncbi.nlm.nih.gov/taxonomy
BacDiv: https://bacdive.dsmz.de/
Bacterial And Viral Bioinformatics Resource Center: https://www.bv-brc.org/

4.8 Questions to think about

Get to know your genome!

Is the genome circular or linear?
Does the species carry any plasmids?
How genomically diverse are the genomes of the species?
Does the species share large portions of its genome with closely related environmental or commensal microbial organisms? If yes could they also be present in the data?

Get to know your species!

Is the species clonal or heavily recombinant? Overall? Within clades?
Is the pangenome open or closed? How large is it?
Has the genome undergone a reductive evolution?
Is the genome plastic but maintains genome size?
Are there significant genomic/phenotypic differences across clades?
Is your samples grouping within modern diversity or basal too it?

Depending on the organism, available data and databases will be very different…

What data is available?
What type of data is available?
What metadata is available?

Reconstructing disease phenotype & etiopathology:

Where was the genome isolated from?
- How can this relate to the disease phenotype?
Do we have knowledge of potential coinfection? Strain differences?
Do we have knowledge of the extent of affected tissues within the host body?
Can we conclude whether it is likely to be a chronic or an acute infection?
In what demographic cohort would the individual(s) belong to?
- What could this tell you about the course of the disease or the likelihood of acute infections?
Does the host genomes show any clinically relevant variants which could impact how the pathogen would affect them?