15 Phylogenomics

Author

Arthur Kocher and Aida Andrades Valtueña

Self guided: chapter environment setup

For this chapter’s exercises, if not already performed, you will need to download the chapter’s dataset, decompress the archive, and create and activate the conda environment.

Do this, use wget or right click and save to download this Zenodo archive: 10.5281/zenodo.13759782, and unpack.

tar xvf phylogenomics.tar.gz 
cd phylogenomics/

You can then create the subsequently activate environment with

conda env create -f phylogenomics.yml
conda activate phylogenomics

Warning

There are additional software requirements for this chapter

Please see the relevant chapter section in Before you start before continuing with this chapter.

15.1 Introduction

15.1.1 Definitions and concepts

Phylogenetics is the study of evolutionary relationships between entities such as species, genes or individuals (but also non-biological entities such as languages), through the reconstruction of phylogenetic trees. Phylogenetic trees are composed of branches (or edges) representing continuous lines of descent (or lineages), nodes representing the split of an ancestral lineage into descending lineages, and tips (or leaves) representing the sampled entities (Baum et al. 2008) (Figure 15.1).

Figure 15.1: Main elements of a phylogenetic tree.

In a phylogenetic tree, a group containing all descendants of a given ancestor (or node), is called a monophyletic group, or a clade (Figure 15.2). Any group that is not monophyletic is called paraphyletic.

Figure 15.2: Monophyletic *versus* paraphyletic group.

The oldest node in a phylogenetic tree is called the root (Figure 15.3). It corresponds to the most recent common ancestor (MRCA) of the whole group under consideration. The root is not defined in all phylogenetic trees. If so, it means that the tree is oriented in time, such that we can define which node is the oldest. We call such tree a rooted tree. Note that time-trees are rooted by definition (but the opposite is not true). Substitution trees are in principle unrooted, as these contain information about the genetic distance between nodes, but not about the relative age of nodes. Unrooted trees are best represented in a star-shaped manner (although visualization softwares such as figtree present all trees as if they were rooted, by default).

Figure 15.3: Rooted *versus* unrooted tree.

An unrooted tree can be turned into a rooted tree if one can define an outgroup among the analysed entities (Figure 15.4). An outgroup is a tip or a clade that can be confidently considered to not belong to the group formed by the rest of the tips (the ingroup). Typically, this can be an individual from a different species than the one under consideration. If one can define such an outgroup a priori, then the root of the tree can be placed on the branch between the outgroup and the ingroup. Outgroups can be included intentionally in phylogenetic analyses for this purpose.

Figure 15.4: Outgroups can be used to root unrooted phylogenies.

15.1.2 Phylogenetics: what for?

Phylogenetic analyses are useful for a variety of applications, but its primary goal is to characterise and classify the observed biological diversity with an evolutionary perspective. Thus, phylogenies are essential to organise and interpret all knowledge about biological organisms in the light of evolution (Figure 15.5).

Figure 15.5: Phylogenetics is essential for classifying biological diversity.

Furthermore, phylogenetic trees can be used in combination with various information about the sampled entities to reconstruct the characteristics of past lineages and changes that have occurred during the course of evolution (Figure 15.6). Typically, phylogenies can be combined with information about the geographic location of sampled entities to reconstruct the geographic spread of the studied organism in the past. Similarly, information about the host-species of pathogen strains can be used to identify past cross-species transmission events. Similar approaches can be applied to genetic features to map genomic changes such as gene loss or gain on the phylogeny. Note that if the phylogeny is calibrated in time, the date of these events can be estimated.

Figure 15.6: Various features can mapped onto phylogenetic trees.

Phylogenies can also be used to obtain information about past epidemiological or population dynamics (Figure 15.7). Typically the time to the most recent common ancestor (tMRCA) of pathogen strains can be considered as an upper date limit for the emergence of the pathogen in the considered population. Using phylodynamic approaches, population or epidemiological parameters can be formally estimated through the phylogeny.

Figure 15.7: Timetrees can be used to reconstruct past epidemiological dynamics.

15.1.3 Phylogenetic analysis of ancient genomes

Ancient genomes are powerful data for phylogenetic inference for several reason. The most obvious one is that they might allow uncovering lineages that have gone extinct in the past, thereby including in the phylogeny parts of the evolutionary history that are invisible from modern data alone. Furthermore, by providing direct observations from the past, ancient genomes may permit accurate reconstruction of past evolutionary events and phylogeographic scenarios which can be otherwise impossible to unveil (Figure 15.8).

Figure 15.8: Ancient genomes can unveil past evolutionnary and phylogeographic events.

Another fundamental advantage of ancient genomes for phylogenetic inference is that they can be used to estimate the rate of genetic changes of the studied group and calibrate phylogenies in time (Figure 15.9) using molecular clock models (as performed in the last part of the practical). This is often difficult or impossible to do with modern sequences data alone, except for fast-evolving entities like RNA viruses.

Figure 15.9: Ancient genomes allow the calibration of phylogenetic trees in time.

15.1.4 How to estimate phylogenetic trees?

Before reconstructing a phylogenetic tree from a set of DNA sequences, these sequences need to be aligned. Sequence alignment consist in rearranging sequences to ensure that homologous nucleotide positions (i.e. descending from a common ancestor) are aligned and compared (Figure 15.10). To do so, gaps are inserted in sequences to accommodate genetic insertions or deletions that might have occurred in these sequences during the course evolution. Although it will not be covered in this practical, sequence alignment is an essential first step for phylogenetic analyses (or any comparative genetic analysis), and can be performed in different ways depending on the starting material. If one starts from already assembled gene or genome sequences, multiple sequence alignment can be performed using softwares such as clustal or MAFFT. When analysing NGS data, one can map all data to the same reference genome and produce combined genotyping data and sequence alignments using softwares such as MultiVCFAnalyzer (as performed in the Genome Mapping practical).

Figure 15.10: Sequence alignment is an essential first step of phylogenetic analyses.

Various methods can be employed to infer a phylogenetic tree from a DNA sequence alignment (Figure 15.11). Distance-based methods are based on pairwise genetic distance matrices derived from the sequences rather than from the sequences themselves. This includes techniques like Neighbour-Joining and UPGMA. Maximum parsimony is another method in which it is assumed that the most likely tree is the one minimizing the number of changes necessary to explain the observed sequences. Finally probabilistic methods are methods based on an explicit probabilistic model of evolution which allows calculating the probability of the observed sequence data given a certain phylogenetic tree. This measure is referred to as the phylogenetic likelihood. Probabilistic approaches are, by far, the most popular phylogenetic approaches nowadays, and fall into two categories: Maximum Likelihood (ML) and Bayesian approaches.

Figure 15.11: Phylogenetic inference methods.

An important component of phylogenetic models is the genetic substitution model (Yang 2014). Substitution models are used in distance-based and probabilistic phylogenetic approaches (Figure 15.12). In essence, these models allow estimating genetic distances between sequences accounting for cases in which multiple substitutions might have occurred at the same sites. Different substitution models exist allowing for more or less complexity, from the Jukes-Cantor (JC69) model in which all types of substitutions are assumed to occur at the same rate to General Time-Reversible (GTR) model (which is the GTR model) in which each type of substitution occurs at a different rate.

Figure 15.12: Nucleotide substitution models.

15.2 Practical

15.2.1 Visualise the sequence alignment

In this practical session, we will work with an alignment produced as you learned in the practical Genome mapping.

What is in the data?

The alignment contains 33 Yersinia pestis sequences and 1 Yersinia pseudotuberculosis sequence which can be used as an outgroup
It is a SNP alignment: it contains only the variable genomic positions, not the full genomes
In this practical, we will investigate the phylogenetic position of four prehistorical Y. pestis strains that we have recently discovered: KZL002, GZL002, CHC004 and VLI092

We start by exploring the alignment in MEGA (Tamura, Stecher, and Kumar 2021). Open the MEGA desktop application in the terminal.

mega &

Tip

Adding & at the end of a command line allows to run a program in the background while letting the terminal accessible. This is particularly useful when starting a graphical interface from the terminal.

Then, load the alignment by clicking on File -> Open A File/Session -> Select the snpAlignment_session5.fasta (in the working directory of the session).

Figure 15.13: Loading the SNP alignment in MEGA

It will ask you what you want to do with the alignment. In MEGA you can also align sequences. However, since our sequences are already aligned, press onAnalyze.

Then, selectNucleotide Sequences since we are working with a DNA alignment. Note thatMEGA can also work with Protein Sequences as well as Pairwise Distance Matrix (which we will cover shortly). In the same window, change the character forMissing Data to N and clickOK.

Figure 15.14: Defining sequence type in MEGA.

A window would open up asking if our alignment contains protein encoding sequences, and we will selectNo.

To explore the alignment, click on the box with TA.

Figure 15.15: Opening the alignment in MEGA.

You should see an alignment containing sequences from the bacterial pathogen Yersinia pestis. Within the alignment, we have four sequences of interest (KZL002, GZL002, CHC004 and VLI092) that date between 5000-2000 years Before Present (BP), and we want to know how they relate to the rest of the Yersinia pestis genomes in the alignment.

Figure 15.16: Visualizing the alignment in MEGA.

Question

How many sequences are we analysing?

Answer

34 sequences

Question

What are the Ns in the sequences?

Answer

They represent positions where we have missing data. We toldMEGA to encode missing positions as N.

Question

What do you think the dots represent?

Tip

The first line is a consensus sequence: it indicates the nucleotide supported by the majority of the sequences in the alignment (90% of the sequences should agree, otherwise an N is displayed)

Answer

Dots represent positions in which sequences are identical to the reference the consensus

Question

Once you know this, can you already tell by looking at the alignment which sequence is the most divergent (scroll down)

Answer

We can easily see that the last sequence in the alignment (Y. pseudotuberculosis) contains more disagreements to the consensus. This is normal since this is the only genome not belonging to the Y. pestis species: we will use it as an outgroup

15.2.2 Distance-based phylogeny: Neighbour-Joining

The Neighbour-Joining (NJ) method is an agglomerative algorithm which can be used to derive a phylogenetic tree from a pairwise distance matrix. In essence, this method groups taxa that have the shortest distance together first, and repeats this operation iteratively until all the taxa/sequences included in your alignment have been placed in a tree.

Here are the details of the calculations for a small NJ tree example with 6 taxa:

Figure 15.17: Neighbour-Joining algorithm.

Luckily, you won’t have to do this by hand since MEGA allows you to build a NJ tree. For that go back to MEGA and click on the Phylogeny symbol (toolbar of the main menu window) and then select Construct Neighbour Joining Tree. Click Yes when you are asked if you want to use the currently active data. In the window that pop up, you will then change theModel/Method to p-distance. Then press OK and a window with the calculated phylogenetic tree will pop up.

Figure 15.18: Constructing a NJ tree in MEGA.

p-distances?

A NJ tree can be built from any type of distances. This includes: - p-distances (also called raw distances): these are simply the proportion of differences between sequences - Corrected distances: these are based on an underlying substitution model (JC69, K80, GTR…) and account for multiple substitutions at the same sites (which would result in only one visible difference) - p-distances and corrected distances should be similar when the number of substitutions is low compared to the genome length note: a “substitution” is a type of mutation in which a nucleotide is replaced by another.

Since the tree is not easily visualised in MEGA, we will export it in newick format (a standard text format for trees) and open it in figtree. This tool has a better graphical interface for visualizing and manipulating trees.

To do that, click on File, then Export current tree (Newick) and click on Branch Lengths to include those in the newick annotation. When you press OK, a new window with the tree in newick format will pop up. Then, press File -> Save and saved it as NJ_tree.nwk. You can then close the text editor and tree explorer windows (no need to save the session).

Figure 15.19: Exporting the tree in newick format.

As said above, we will explore oun NJ tree in figtree. Open the software (if you use the same terminal window as the one in which you ran MEGA, you might have to press enter first).

figtree &

Then, open the NJ tree by clicking on File -> Open and selecting the file with the NJ tree NJ_tree.nwk.

Note that even though a root is displayed by default in figtree, NJ trees are actually unrooted. We know that Yersinia pseudotuberculosis (labelled here as Y. pseudotuberculosis) is an outgroup to Yersinia pestis. You can reroot the tree by selecting Y.pseudotuberculosis and pressing Reroot.

Figure 15.21: Rerooting a tree in FigTree

Now we have a rooted tree.

Question

How much time did the NJ-tree calculation take?

Answer

~1 second

Question

How many leaves/tips has our tree?

Answer

34, i.e. the number of sequences in our SNP alignment.

Question

Where are our taxa of interest? (KZL002, GZL002, CHC004 and VLI092)

Answer

They all fall ancestral to the rest of Yersinia pestis in this tree.

Question

Do they form a monophyletic group (a clade)?

Answer

Yes, they form a monophyletic group. We can also say that this group of prehistoric strains form their own lineage.

15.2.3 Probabilistic methods: Maximum Likelihood and Bayesian inference

These are the most commonly used approach today. In general, probabilistic methods are statistical techniques assuming that the observed data is generated through a stochastic process with a probablility depending on a set of parameters which we want to estimate. The probability of the data given the model parameters is called the likelihood.

Figure 15.22: Graphical representation of a probabilistic model.

Question

In a phylogenetic probabilistic model, what are the data and what are the parameters?

Answer

In a phylogenetic probabilistic model, the data is the sequence alignment and the parameters, are:

The parameters of the chosen substitution model (substitution rates and base frequencies)
The phylogenetic tree

Figure 15.23: Phylogenetic probabilistic model (see Cornuault and Sanmartín (2022) for more details).

15.2.4 Maximum likelihood estimation and bootstrapping

One way to make inferences from a probabilistic model is by finding the combination of parameters which maximises the likelihood. These parameter values are called Maximum Likelihood (ML) estimates. We are usually not able to compute the likelihood value for all possible combinations of parameters and have to rely on heuristic algorithms to find the ML estimates.

Figure 15.24: Maximum Likelihood inference.

The ML estimates are point estimates, i.e. single parameter values (for example, a tree), which does not allow to measure the associated uncertainty. A classic method to measure the uncertainty in ML trees is bootstrapping. This consists in repeatedly “disturbing” the alignment by masking sites randomly and estimating a tree from each of these bootstrap alignments.

For each clade in the ML tree, a bootstrap support value is computed which corresponds to the proportion of bootstrap trees containing the clade. This gives an indication of how confidently the clade is supported by the data (i.e. whether it is robust to small changes in the data). A bootstrap value of 70% or more is generally considered as good support.

Note

Bootstrapping can be used to measure uncertainty with any type of inference method, including distance-based methods

Let’s make our own ML tree!

Here is a command to estimate an ML phylogenetic tree together with bootstraps using RAxML((Stamatakis 2014); you may find the list of parameters in the RAxMLmanual).

raxmlHPC-PTHREADS -m GTRGAMMA -T 3 -f a -x 12345 -p 12345 -N autoMRE -s snpAlignment_session5.fasta -n full_dataset.tre

Here is the meaning of the chosen parameters:

Figure 15.26: Details of the RAxML command.

Once the analysis has been completed, you can open the tree using FigTree(RAxML_bipartitions.full_dataset.tre file, change “label” to “bootstrap support” at the prompt).

Figure 15.27: Changing the label of boostrap support values in FigTree.

The tree estimated using this model is a substitution tree (branch lengths represent genetic distances in substitutions/site). As for the NJ tree, it is not oriented in time: this is an unrooted tree (displayed with a random root in Figtree). You can reroot the tree in figtree using Y. pseudotuberculosis as an outgroup, as previously.

Question

Can you confirm the position of our genomes of interest (KZL002, GZL002, CHC004 and VLI092)?

Answer

Yes. Just as in the NJ tree, they form a clade which is basal to the rest of the Y. pestis diversity.

Question

Is that placement well-supported (look at the bootstrap support value: click on the “Node Labels” box and open the drop-down menu, change “Node ages” to “bootstrap support”)?

Answer

The placement is strongly supported as indicated by a bootstrap support of 100% for this clade (it is not very easy to see, you probably need to zoom in a bit)

Figure 15.28: Boostrap support value for the placement of ancient *Y. pestis* genomes.

You can notice that the phylogeny is difficult to visualise due to the long branch leading to Y. pseudotuberculosis Having a very distant outgroup can also have deleterious effects on the estimated phylogeny (due to the so-called “long branch attraction” effect). We can construct a new phylogeny after removing the outgroup: - Go back to the original alignment in mega (that we used for the Neighbour Joining tree), untick Y.pseudotuberculosis, and export in fasta format (Data -> Export Data and change Format to Fasta and click Ok; you can save it as: snpAlignment_without_outgroup.fas) - Run raxml on this new alignment (change input to snpAlignment_without_outgroup.fas and output prefix (-n) to without_outgroup in the command line) - Open the bipartition... file in figtree and reroot the tree based on the knowledge we have gained previously: place the root on the branch leading to the prehistoric Y. pestis strains (KZL002, GZL002, CHC004 and VLI092).

Figure 15.29: Substitution tree estimated with RAxML without the *Y. pseudotuberculosis* outgroup.

Lastly, we will export the rooted tree from figtree: File -> Export trees -> select the save as currently displayed box and save as ML_tree_rooted.tre (this will be useful for the section Temporal signal assessment at the end of this tutorial)

Figure 15.30: Exporting a tree after re-rooting in FigTree.

15.2.5 Estimating a time-tree using Bayesian phylogenetics (BEAST2)

Now, we will try to use reconstruct a phylogeny in which the branch lengths do not represent a number of substitutions but instead represent the time of evolution. To do so, we will use the dates of ancient genomes (C14 dates) to calibrate the tree in time. This assumes a molecular clock hypothesis in which substitutions occur at a rate that is relatively constant in time so that the time of evolution can be estimated based on the number of substitutions.

Note

A great advantage of ancient pathogen genomes is that they provide calibration points to estimate molecular clocks and dated phylogenies. This might not be possible to do with modern data alone.

We will estimate a time-tree from our alignment using Bayesian inference as implemented in the BEAST2 software (Bouckaert et al. 2014; Drummond and Bouckaert 2015). Bayesian inference is based on a probability distribution that is different from the likelihood: the posterior probability. The posterior probability is the probability of the parameters given the data. It is easier to interpret than the likelihood because it directly contains all the information about the parameters: point estimates such as the median or the mean can be directly estimated from it, but also percentile intervals which can be used to measure uncertainty.

Figure 15.31: Posterior probability of model parameters.

The Bayes theorem tells us that the posterior probability is proportional to the product of the likelihood and the prior probability of the parameters:

Therefore, for Bayesian inference, we need to complement our probabilistic model with prior distributions for all the parameters. Because we want to estimate a time tree, we also add another parameter: the molecular clock (average substitution rate in time units).

Figure 15.33: Bayesian phylogenetic model.

To characterise the full posterior distribution of each parameter, we would need in theory to compute the posterior probability for each possible combination of parameters. This is impossible, and we will instead use an algorithm called Markov chain Monte Carlo (MCMC) to approximate the posterior distribution. The MCMC is an algorithm which iteratively samples values of the parameters from the posterior distribution. Therefore, if the MCMC has run long enough, the posterior distribution of the parameters can be approximated by a histogram of the sampled values.

Figure 15.34: MCMC algorithm used to sample the posterior distribution of parameters (credit: Paul O. Lewis)

Tip

The “Taming the beast” website (https://taming-the-beast.org/tutorials/) has great tutorials for various analyses that can be done with BEAST2. In particular, the “Introduction to BEAST2”, “Prior selection” and “Time-stamped data” are good starts.

The different components of the BEAST2 analysis can be set up in the program BEAUti:

Figure 15.35: Different component of a Bayesian phylogenetic analysis with BEAST.

Open BEAUTi (if asked to update, press ‘not now’).

beauti &

Set up an analysis as follows:

Load the alignment without outgroup in the Partitions tab (File -> Import alignment; select nucleotide)

Figure 15.36: Partitions panel in Beauti.

Set the sampling dates in the Tip dates tab:
- Select Use tip dates
- Click on Auto-configure -> Read from file and select the sample_ages.txt file
- Change Since some time in the past to Before present

Figure 15.37: Tip dates panel in Beauti.

Select the substitution model in the Site model tab:
- Chose a GTR model
- Use 4 Gamma categories for the Gamma site model: this is to account for variations of the substitution rate across sites (site=nucleotide position)

Figure 15.38: Site model panel in Beauti.

choose the molecular clock model in the Clock model tab:
- Use a relaxed clock lognormal model (this is to allow for some variation of the clock rate across branches)
- Change the initial value of the clock rate to 10^-4 substitution/site/year (10^-4 can be written 1E-4)

Figure 15.39: Clock model panel in Beauti.

Choose the prior distribution of parameters in the Priors tab:
- Use a Coalescent Bayesian Skyline tree prior
- Change the mean clock prior to a uniform distribution between 1E-6 and 1E-3 subst/site/year
- Leave everything else to default

Set up the MCMC in the MCMC tab:
- Use a chain length of 300M
- Sample the mono-dimensional parameters and trees every 10,000 iterations (unfold tracelog and treelog menus and change “log every” to 10,000)

Save the analysis setup as an xml file: File -> Save as; you can name the file beast_analysis_Y_pestis.xml

Now that the analysis is setup, we can run it using BEAST.

beast -java beast_analysis_Y_pestis.xml

Once the analysis is running two files should have been created and are continuously updated:

The snpAlignment_without_outgroup.log file which contains the values sampled by the MCMC for various mono-dimensional parameters such as the clock rate, as well as other values that are a logged along the MCMC such as the posterior probability and the likelihood
The snpAlignment_without_outgroup.trees file which contains the MCMC trees sampled by the MCMC

While the analysis is running, you can start reading the next section!

15.2.6 Assessing BEAST2 results

Reminder

We are using an MCMC algorithm to sample the posterior distribution of parameters. If the MCMC has run long enough, we can use the sampled parameters to approximate the posterior distribution itself. Therefore, we have to check first that the MCMC chain has run long enough.

We can assess the MCMC sampling using the program Tracer (Rambaut et al. 2018). Tracer can read BEAST log files an generate statistics and plots for each of the sampled parameters. Most importantly, Tracer provides:

Trace plots: show the sampled parameter values along the MCMC run

Trace plots are a very useful MCMC diagnostic tool.

Figure 15.42: Trace plot visualised in Tracer.

The first thing that one needs to assess is whether the MCMC has passed the so called “burn-in” phase. The MCMC starts with a random set of parameters and will take some time to reach a zone of high posterior probability density. The parameter values that are sampled during this initial phase are usually considered as noise and discarded (by default, tracer discards the first 10% of samples). The burn-in phase can be visualised on trace plots as an initial phase during which the posterior probability of sampled parameters is constantly increasing before reaching a plateau:

Figure 15.43: Burnin phase of the MCMC.

Once the burn-in phase is passed, one can look at the trace plots to assess if the parameters have been sampled correctly and long enough. Usually, when this is the case, the trace should be quite dense and oscillating around a central value (nice trace plots should look like “hairy caterpillars”). In the figure below, the trace on the left doesn’t look good, the one on the right does:

Figure 15.44: Assessing correct MCMC sampling using trace plots.
ESS values: tracer also calculates effective sample sizes (ESS) for each of the sampled parameters.

ESSs are estimates of the number of sampled parameter values after correcting for auto-correlation along the MCMC. As a rule of thumb, one usually considers that an MCMC has run long enough if all parameter’s ESS are > 200. Note that if the trace looks like a hairy caterpillar, the corresponding ESS value should be high.

Figure 15.45: Assessing effective sample sizes (ESS) in Tracer.
Parameter estimates: Tracer also provides statistics and plots to explore the posterior distribution of the parameters.

These should be considered only if the trace plot and ESS values look fine. In the Estimates tab, after selecting the chosen parameter in the left panel, the upper-right panel shows point estimates (mean, median) and measures of uncertainty (95% HPD interval), and the bottom-right panel shows a histogram of the sampled value:

Figure 15.46: Marginal estimates of parameters in Tracer.

Let’s now load the snpAlignment_without_outgroup.log file into Tracer.

While the run is still going, open a new separate terminal and activate the conda environment.

conda activate phylogenomics

Then, open Tracer.

tracer &

Open the log file: File -> Import trace file and select snpAlignment_without_outgroup.log. Note that one can load a BEAST2 log file into Tracer even if the analysis is still running. This allows to assess if the MCMC is running correctly or has run long enough before it’s completed.

Question

Has the MCMC run long enough?

Answer

You have probably run your analysis for 10-20 mins before looking at the log file, and this is definitely not sufficient: the burnin phase has recently been passed, the trace plots do not look very dense and ESS values are low. It would probably take a few hours for the analysis to complete. Luckily we have run the analysis in advance and saved the log files for you in the intermediateFiles folder: snpAlignment_without_outgroup.log and snpAlignment_without_outgroup.trees.

Figure 15.47: Trace plot indicating unsufficient MCMC sampling.

You can now load the intermediateFiles/snpAlignment_without_outgroup.log file into Tracer.

Question

Has the MCMC run long enough?

Answer

Yes! The trace plots look good and all ESSs are > 200.

Figure 15.48: Trace plot indicating sufficient MCMC sampling.

Question

What is your mean estimate of the clock rate (ucld mean)?

Answer

~7.10^-6 substitution/site/year. Note, however, that this estimate is largely biased since we used a SNP alignment containing only variable positions. In order to get an unbiased estimate of the substitution rate, we should have used the full alignment or account for the number of constant sites by using a “filtered” alignment (see here: https://www.beast2.org/2019/07/18/ascertainment-correction.html). In general, this is good practice since not accounting for conserved positions in the alignment can sometimes affect the tree topology as well (although this should usually be minor).

Figure 15.49: Mean estimate of the clock rate.

15.2.7 MCC tree

Since we are working in a Bayesian framework, we do not obtain a single phylogenetic tree as with Maximum likelihood, but a large set of trees which should be representative of the posterior distribution. In contrast with mono-dimensional parameters, a tree distribution cannot be easily summarised with mean or median estimates. Instead, we need to use specific tree-summarizing techniques. One of the most popular is the maximum clade credibility (MCC) tree (Heled and Bouckaert 2013), which works as follows:

For any node in any of the sampled trees, compute a posterior support: the proportion of trees in the sample which contain the corresponding clade
Select the MCC tree: this is the tree in which the product of node posterior supports is the highest
Calculate node/branch statistics on the MCC tree: typically, the mean/median estimates and HPD interval of node ages are calculated from the full tree sample and annotated on the MCC tree

Figure 15.50: Maximum clade credibility tree as a summary of the tree posterior distribution.

Let’s generate an MCC tree from our tree sample. We can do this using the TreeAnnotator software, which has both a command line and graphical interface. Let’s use the command line here and run the following (using a burn-in of 10%).

 treeannotator -burnin 10 intermediateFiles/snpAlignment_without_outgroup.trees snpAlignment_without_outgroup.MCC.tree

Once this is completed, we can open the MCC tree (snpAlignment_without_outgroup.MCC.tree) with figtree. Let’s then add a few elements to the plot:

Tick the Scale Axis box, unfold the corresponding menu, and select Reverse axis (now the timescale is in years BP)
Tick the Node Labels box, unfold the corresponding menu, and select Display: posterior The posterior support of each node is now displayed Note that the support value is a proportion (1=100%)
Tick the Node Bars box, unfold the corresponding menu, and select Display: height_95%_HPD The 95% HPD intervals of node ages are now displayed

Figure 15.51: Maximum clade credibility tree of *Y. pestis*.

Question

Is the root of the tree consistent with what we found previously?

Answer

Yes! The root is placed between our prehistoric strains and the rest of Y. pestis strains. Note that this time we didn’t have to use an outgroup because we estimated a time-tree: the root is identified as the oldest node in the tree.

Question

What is your estimate for the age of the most recent common ancestor of all Y. pestis strains?

Answer

~5800 years BP (HPD 95%: ~8000-4500 years BP)

15.2.8 Bonus: Temporal signal assessment

It is a good practice to assess if the genetic sequences that we analyse do indeed behave like molecular clocks before trying to estimate a time tree (i.e. we should have done this before the actual BEAST2 analysis). A classic way to assess the temporal signal of a dataset is the root-to-tip regression (Rambaut et al. 2016). The rationale of the root-to-tip regression is to verify that the older a sequence is, the closer it should be to the root in a (rooted) substitution tree because there was less time for substitution to accumulate. In other words, there should be a correlation between sample age and distance to the root, which we can assess using a linear regression (root-to-tip regression). This can be done using the program TempEst: 1. Open TempEst

```bash
tempest &
```

Load the rooted ML tree that we produced previously (you should have saved it as ML_tree_rooted.tre)
Click on Import Dates in the Sample Dates tab, select the sample_age.txt file and click on OK
Still in the Sample Dates tab, change Since some time in the past to Before present (one might need to extend the TempEst window to see the pull down menu)
Look at the Root-to-tip tab: is there a positive correlation between time and root-to-tip divergence as expected under the molecular clock hypothesis?

Figure 15.52: Root-to-tip regression using TempEst.

15.3 (Optional) clean-up

Let’s clean up your working directory by removing all the data and output from this chapter.

You can close all windows and any terminals. If you have a terminal with a command still running, just press ctrl + c a couple of times until it drops you to an empty prompt.

Once completed, the command below will remove the /<PATH>/<TO>/phylogenomics directory as well as all of its contents.

Pro Tip

Always be VERY careful when using rm -r. Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!

rm -r /<PATH>/<TO>/phylogenomics*

Once deleted you can move elsewhere (e.g. cd ~).

We can also get out of the conda environment with.

conda deactivate

To delete the conda environment.

conda remove --name phylogenomics --all -y

15.4 Summary

Phylogenetics is the study of the evolutionnary history and relationships between biological entities
Phylogenetic analyses are essential to classify and organise all knowledge about biological organisms in the light of evolution
Phylogenetic trees can be used to trace back various features along past lineages, such as the geographic location, host species or genomic content
Ancient genomes permit more accurate phylogenetic reconstructions and the study of past extinct lineages
Ancient genomes allow estimating molecular clock rates and the scaling of phylogenetic trees to time
Phylogenetic inference relies on the comparison of homologous characters, such as aligned DNA sequences
Probabilistic methods, including Maximum Likelihood and Bayesian inference, are the most popular phylogenetic approaches today

15.5 References

Baum, David et al. 2008. “Reading a Phylogenetic Tree: The Meaning of Monophyletic Groups.” Nature Education 1 (1): 190. https://www.nature.com/scitable/topicpage/reading-a-phylogenetic-tree-the-meaning-of-41956/.

Bouckaert, Remco, Joseph Heled, Denise Kühnert, Tim Vaughan, Chieh-Hsi Wu, Dong Xie, Marc A. Suchard, Andrew Rambaut, and Alexei J. Drummond. 2014. “BEAST 2: A Software Platform for Bayesian Evolutionary Analysis.” PLoS Computational Biology 10 (4): e1003537. https://doi.org/10.1371/journal.pcbi.1003537.

Cornuault, Josselin, and Isabel Sanmartín. 2022. “A Road Map for Phylogenetic Models of Species Trees.” Molecular Phylogenetics and Evolution 173 (April): 107483. https://doi.org/10.1016/j.ympev.2022.107483.

Drummond, Alexei J., and Remco R. Bouckaert. 2015. Bayesian Evolutionary Analysis with BEAST. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139095112.

Heled, Joseph, and Remco R. Bouckaert. 2013. “Looking for Trees in the Forest: Summary Tree from Posterior Samples.” BMC Evolutionary Biology 13 (1): 221. https://doi.org/10.1186/1471-2148-13-221.

Rambaut, Andrew, Alexei J Drummond, Dong Xie, Guy Baele, and Marc A Suchard. 2018. “Posterior Summarization in Bayesian Phylogenetics Using Tracer 1.7.” Systematic Biology 67 (5): 901–4. https://doi.org/10.1093/sysbio/syy032.

Rambaut, Andrew, Tommy T. Lam, Luiz Max Carvalho, and Oliver G. Pybus. 2016. “Exploring the Temporal Structure of Heterochronous Sequences Using TempEst (Formerly Path-O-Gen).” Virus Evolution 2 (1): vew007. https://doi.org/10.1093/ve/vew007.

Stamatakis, Alexandros. 2014. “RAxML Version 8: A Tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies.” Bioinformatics 30 (9): 1312–13. http://bioinformatics.oxfordjournals.org/content/early/2014/01/21/bioinformatics.btu033.short.

Tamura, Koichiro, Glen Stecher, and Sudhir Kumar. 2021. “MEGA11: Molecular Evolutionary Genetics Analysis Version 11.” Molecular Biology and Evolution 38 (7): 3022–27. https://doi.org/10.1093/molbev/msab120.

Yang, Ziheng. 2014. Molecular Evolution: A Statistical Approach. Oxford: Oxford University Press.