18 Ancient Metagenomic Pipelines
For this chapter’s exercises, if not already performed, you will need to download the chapter’s dataset, decompress the archive, and create and activate the conda environment.
Do this, use wget
or right click and save to download this Zenodo archive: 10.5281/zenodo.8413239, and unpack
tar xvf ancient-metagenomic-pipelines.tar.gz
cd ancient-metagenomic-pipelines/
You can then create the subsequently activate environment with
conda env create -f ancient-metagenomic-pipelines.yml
conda activate ancient-metagenomic-pipelines
There are additional software requirements for this chapter
Please see the relevant chapter section in Before you start before continuing with this chapter.
18.1 Introduction
A pipeline is a series of linked computational steps, where the output of one process becomes the input of the next. Pipelines are critical for managing the huge quantities of data that are now being generated regularly as part of ancient DNA analyses. In this chapter we will go through three dedicated ancient DNA pipelines - all with some (or all!) functionality geared to ancient metagenomics - to show you how you can speed up the more routine aspects of the basic analyses we’ve learnt about earlier in this text book through workflow automation.
We will introduce:
- nf-core/eager - a generalised aDNA ‘workhorse’ pipeline that can do both ancient genomics and (basic) metagenomics (Fellows Yates et al. 2021)
- aMeta - a pipeline for resource efficient and accurate ancient microbial detection and authentication (Pochon et al. 2022)
- nf-core/mag - a de novo metagenomics assembly pipeline (Krakau et al. 2022) that includes a dedicated ancient DNA mode for damage correction and validation.
Keep in mind that that there are many other pipelines that exist, and picking which one often they come down to personal preference, such as which functionality they support, which language they are written in, and whether their computational requirements can fit in your available resources.
Other examples of other ancient DNA genomic pipelines include Paleomix (Schubert et al. 2014), and Mapache (Neuenschwander et al. 2023), and for ancient metagenomics: metaBit (Louvel et al. 2016) and HAYSTAC (Dimopoulos et al. 2022).
18.2 Workflow managers
All the pipelines introduced in this chapter utilise workflow managers. These are software that allows users to ‘chain’ together the inputs and outputs distinct ‘atomic’ steps of a bioinformatics analysis - e.g. separate terminal commands of different bioinformatic tools, so that ‘you don’t have to’. You have already seen very basic workflows or ‘pipelines’ when using the bash ‘pipe’ (|
) in the Introduction to the Command Line chapter, where the each row of text the output of one command was ‘streamed’ into the next command.
However in the case of bioinformatics, we are often dealing with non-row based text files, meaning that ‘classic’ command line pipelines don’t really work. Instead this is where bioinformatic workflow managers come in: they handle the passing of files from one tool to the next but in a reproducible manager.
Modern computational bioinformatic workflow managers focus on a few main concepts. To summarise Wratten, Wilm, and Göke (2021), these areas are: data provenance, portability, scalability, re-entrancy - all together which contribute to ensuring reproducibility of bioinformatic analyses. Data provenance refers to the ability to track and visualise where each file goes and gets processed, as well as metadata about each file and process (e.g., What version of a tool was used? What parameters were used in that step? How much computing resources were used). Portability follows from data provenance where it’s not just can the entire execution of the pipeline be reconstructed - but can it also be run with the same results on a different machine? This is important to ensure that you can install and test the pipeline on your laptop, but when you then need to do heavy computation using real data, that it will still be able to execute on a high-performance computing cluster (HPC) or on the cloud - both that have very different configurations. This is normally achieved through the use of reusable software environments such as as conda or container engines such as docker, and tight integration with HPC schedulers such as SLURM. As mentioned earlier, not having to run each command manually can be a great speed up to your analysis. However this needs to be able to be Scalable that it the workflow is still efficient regardless whether you’re running with one or ten thousands samples - modern workflow managers perform resource requirement optimisation and scheduling to ensure that all steps of the pipeline will be executed in the most resource efficient manner so it completes as fast as possible - but regardless of of the number of input data. Finally, as workflows get bigger and longer, re-entrancy has become more important, i.e., the ability to re-start a pipeline run that got stuck halfway through due to an error.
All workflow managers have different ways of implementing the concepts above, and these can be very simple (e.g. Makefiles) to very powerful and abstract (e.g. Workflow Description Language). In this chapter we will use pipelines that use two popular workflow managers in bioinformatics, Nextflow and Snakemake.
This chapter will not cover how to write your own workflow, as this would require a whole other textbook. However it is recommended to learn and use workflow managers when carrying out repetitive or routine bioinformatic analysis (Nextflow and snakemake being two popular ones in bioinformatics). Use of workflow managers can help make your work more efficiently (as you only run one command, rather than each step separately), but also more reproducible by reducing the risk of user error when executing each step: the computer will do exactly what you tell it, and if you don’t change anything, will do the exact same thing every time. If you’re interested in writing your own workflows using workflow managers, many training and tutorials exist on the internet (e.g., for Nextflow there is the official training or from software carpentries, or the official training for snakemake.
18.3 What is nf-core/eager?
nf-core/eager is a computational pipeline specifically designed for preprocessing and analysis of ancient DNA data (Figure 18.1). It is a reimplementation of the previously published EAGER (Efficient Ancient Genome Reconstruction) pipeline (Peltzer et al. 2016) written in the workflow manager Nextflow. In addition to reimplementing the original genome mapping and variant calling pipeline, in a more reproducible and portable manner the pipeline also included additional new functionality particularly for researchers interested in microbial sciences, namely a dedicated genotyper and consensus caller designed for low coverage genomes, the ability to get breadth and depth coverage statistics for particular genomic features (e.g. virulence genes), but also automated metagenomic screening and authentication of the off-target reads from mapping (e.g. against the host reference genome).
A detailed description of steps in the pipeline is available as part of nf-core/eager’s extensive documentation. For more information, check out the usage documentation here.
Briefly, nf-core/eager takes at a minimum standard input file types that are shared across the genomics field, i.e., raw FASTQ files or aligned reads in bam format, and a reference fasta. nf-core/eager performs preprocessing of this raw data, including adapter clipping, read merging, and quality control of adapter-trimmed data. nf-core/eager then carries mapping using a variety of field-standard shot-read alignment tools with default parameters adapted for short and damaged aDNA sequences. The off-target reads from host DNA mapping can then go into metagenomic classification and authentication (in the case of MALT). After genomic mapping, BAM files go through deduplication of PCR duplicates, damage profiling and removal, and finally variant calling. A myriad of additional statistics can be generated depending on the users preference. Finally, nf-core eager uses MultiQC to create an integrated html report that summarizes the output/results from each of the pipeline steps.
18.3.1 Running nf-core/eager
For the practical portion of this chapter, we will utilize sequencing data from four aDNA libraries, which you should have already downloaded from NCBI. If not, please see the Preparation section above . We will use nf-core/eager to perform a typical microbial genomic analysis, i.e., reconstruction of an ancient genome to generate variant calls that can be used for generating phylogenomic trees and other evolutionary analysis, and gene feature coverage statistics to allow insight into the functional aspects of the genome.
These four libraries come from from two ancient individuals, GLZ002 and KZL002. GLZ002 comes from the Neolithic Siberian site of Glazkovskoe predmestie (Yu et al. 2020) and was radiocarbon dated to 3081-2913 calBCE. KZL002 is an Iron Age individual from Kazakhstan, radiocarbon dated to 2736-2457 calBCE (Andrades Valtueña et al. 2022). Both individuals were originally analysed for human population genetic analysis, but when undergoing metagenomic screening of the off-target reads, both set of authors identified reads from Yersinia pestis from these individuals - the bacterium that causes plague. Subsequently the libraries from these individuals were processed using hybridization capture to increase the number of Y. pestis sequences available for analysis.
Our aims in the following tutorial are to:
- Preprocess the FASTQ files by trimming adapters and merging paired-end reads
- Align reads to the Y. pestis reference and compute the endogenous DNA percentage
- Filter the aligned reads to remove host DNA
- Remove duplicate reads for accurate coverage estimation and genotyping
- Generate statistics on gene features (e.g. virulence factors)
- Merge data by sample and perform genotyping on the combined dataset
- Review quality control data to evaluate the success of the previous steps
Let’s get started!
nf-core/eager v2 requires an older version of Nextflow - if installing manually, ensure you do not have Nextflow version any greater than v22.10.6!
First, download the latest version of the nf-core/eager repo (or check for updates if you have a previously-installed version):
nextflow pull nf-core/eager
Next we can re-visit AMDirT (see Accessing Ancient Metagenomic Data) to download a pre-prepared configuration file for nf-core/eager
- Load AMDirt (
AMDirT viewer
), and select the latest release and the ‘ancientsinglegenome-hostassociated’ table - Filter the
sample_name
column to just show KZL002 and GLZ002, and select these two rows- Press the burger icon on the column
- Press the filter tab and deselect everything
- Search GLZ002 and select in filter menu
- Search KZL002 and select in filter menu
- Close filter menu and select the two rows
- Press the Validate Selection button
- Press the ‘Download Curl sample download script’, ‘Download nf-core/eager input TSV’, and ‘Download Citations as BibTex’ buttons
- Move the downloaded files into
eager/
of this tutorial’s directory- i.e., if the files were downloaded to you
Downloads/
folder rather than the chapters directory, - e.g., assuming you’ve got no other previous downloads you can run
mv ~/Downloads/AncientMetagenomeDir_* eager/
- i.e., if the files were downloaded to you
We won’t actually use the BibTex citations file for anything in this tutorial, but it is good habit to always to record and save all citations of any software or data you use!
To download the FASTQ files
- Move into the
eager/
directory - Run
bash AncientMetagenomeDir_curl_download_script.sh
to download the files (this may take ~3 minutes)
Next we now inspect the AMDirT generated input TSV file for nf-core/eager!
cat AncientMetagenomeDir_nf_core_eager_input_table.tsv
Sample_Name Library_ID Lane Colour_Chemistry SeqType Organism Strandedness UDG_Treatment R1 R2 BAM
GLZ002 ERR4093961 0 4 PE Homo sapiens double half ERR4093961_1.fastq.gz ERR4093961_2.fastq.gz NA
GLZ002 ERR4093962 0 4 PE Homo sapiens double full ERR4093962_1.fastq.gz ERR4093962_2.fastq.gz NA
KZL002 ERR8958768 0 4 PE Homo sapiens double half ERR8958768_1.fastq.gz ERR8958768_2.fastq.gz NA
KZL002 ERR8958769 0 4 PE Homo sapiens double half ERR8958769_1.fastq.gz ERR8958769_2.fastq.gz NA
Here we see 10 columns, all pre-filled. The first two columns correspond to sample/library IDs that will be used for data provenance and grouping. When you have sequencing multiple lanes you can speed up preprocessing these independently before merging, so lane can specify this (although not used in this case as we have independent libraries per sample. You can indicate the colour chemistry to indicate whether your data requires additional pre-processing to remove poly-G tails, and then also strandedness and UDG damage treatment status of the libraries if you require further damage manipulation. Finally you provide paths to the FASTQ files or BAM files
Other than the raw FASTQ files, we will need a reference genome and annotation coordinates of genes present on the genome. In this case you will use a Yersinia pestis (accession: GCF_001293415.1) reference genome (.fna
) and gene feature file (.gff
) from NCBI Genome.
These have already been placed in the reference/
directory for you.
To download the required reference genome and annotation file run the following command to download from the NCBI Genome database.
## Download from NCBI
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_001293415.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_001293415.1.zip" -H "Accept: application/zip"
unzip *.zip
mv ncbi_dataset/data/GCF_001293415.1/* .
## We have to sort the gff file to make it eager compatible
gffread genomic.gff GCF_001293415.1_ASM129341v1_genomic.gff
With all of these, we can run the pipeline!
First lets enter a screen session to make sure we can leave the pipeline running in the background and continue using our terminal.
screen -R eager
conda activate ancient-metagenomic-pipelines
Now we can construct an eager command from within the data/
directory so that it looks like this:
nextflow run nf-core/eager -r 2.4.7 \
\
-profile docker \
--fasta ../reference/GCF_001293415.1_ASM129341v1_genomic.fna \
--input AncientMetagenomeDir_nf_core_eager_input_table.tsv \
--anno_file ../reference/GCF_001293415.1_ASM129341v1_genomic.gff --bam_unmapped_type discard \
--run_bam_filtering \
--skip_preseq --genotyping_tool ug --gatk_ug_out_mode EMIT_ALL_SITES \
--run_genotyping --run_bedtools_coverage --run_bcftools_stats
When the run starts, you see a long list of process
lines with progress bars slowly appearing over time. If you press ctrl + a and then [ to access a ‘navigation’ (called ‘copy’) mode, then use your arrow keys on your keyboard to scroll up and down. To quit this mode just press q
.
We don’t normally recommend running analyses in the directory your data is in! It is better to keep data and analysis results in separate directories. Here we are just running eager alongside the data for convenience (i.e., you don’t have to modify the downloaded input TSV)
This run will take between 20m-40m to run on a standard laptop! Time for a break, or you can continue reading this chapter as the critical output data has already been provided for you in the data directory.
So what is this command doing? The different parameters do the following:
- Tell nextflow to run the nf-core/eager pipeline with version 2.4.7
- Specify which computing and software environment to use with
-profile
- In this case we are running locally so we don’t specify a computing environment (such as a configuration for an institutional HPC)
- We use
docker
as our container engine, which downloads all the software and specific versions needed for nf-core/eager in immutable ‘containers’, to ensure nothing get broken and is as nf-core/eager expects
- Provide the various paths to the input files (TSV with paths to FASTQ files, a reference fasta, and the reference fasta’s annotations)
- Activate the various of the steps of the pipeline you’re interested in
- We turn off preseq (e.g. when you know you can’t sequence more)
- We want to turn on BAM filtering, and specify to generate unmapped reads in FASTQ file (so you could check off target reads e.g. for other pathogens)
- You turn on genotyping using GATK UnifiedGenotyper (preferred over HaplotypeCaller due to in compatibiity with that method to low-coverage data)
- We turn on variant statistics (from GATK) using bcftools, and coverage statistics of gene features using bedtools
For full parameter documentation, click here.
And now we wait…
As a reminder, to detach from the screen session type ctrl + a then d. To reattach screen -r eager
Specifically for ancient (meta)genomic data, the following parameters and options may be useful to consider when running your own data:
--mapper circularmapper
:- This aligner is a variant of
bwa aln
that allows even mapping across the end of linear references of circular genomes - Allows reads spanning the start/end of the sequence to be correctly placed, and this provides better coverage estimates across the entire genome
- This aligner is a variant of
--hostremoval_input_fastq
:- Allows re-generation of input FASTQ files but with any reads aligned to the reference genome removed
- Can be useful when dealing with modern or ancient data where there are ethical restrictions on the publication of host DNA
- The output can be used as ‘raw data’ upload to public data repositories
--run_bam_filtering --bam_unmapped_type
:- A pre-requisite for performing the metagenomic analysis steps of nf-core/eager
- Generates FASTQ files off the unmapped reads present in the reference mapping BAM file
--run_bedtools_coverage --anno_file /<path>/<to>/<genefeature>.bed
- Turns on calculating depth/breadth of annotations in the provided bed or GFF file, useful for generating e.g. virulence gene presence/absence plots
--run_genotyping --genotyping_tool ug --gatk_ug_out_mode EMIT_ALL_SITES
- Turns on genotyping with GATK UnifiedGenotyper
- A pre-requisite for running MultiVCFAnalyzer for consensus sequencing creation
- It’s recommend to use either GATK UnifiedGenotyper or freeBayes (non-MultiVCFAnalyzer compatible!) for low-coverage data
- GATK HaplotypeCaller is not recommended for low coverage data as it performs local de novo assembly around possible variant sites, and this will fail with low-coverage short-read data
--run_multivcfanalyzer --write_allele_frequencies
- Turns on SNP table and FASTA consensus sequence generation with MultiVCFAnalyzer (see pre-requisites above)
- By providing
--write_allele_frequencies
, the SNP table will also provide the percentage of reads at that position supporting the call. This can help you evaluate the level of cross-mapping from related (e.g. contaminating environmental) species and may question the reliablity of any resulting downstream analyses.
--metagenomic_complexity_filtering
- An additional preprocessing step of raw reads before going into metagenomic screening to remove low-complexity reads (e.g. mono- or di-nucleotide repeats)
- Removing these will speed up and lower computational requirements during classification, and will not bias profiles as these sequences provide no taxon-specific information (i.e., can be aligned against thousands to millions of genomes)
--run_metagenomic_screening
- Turns on metagenomic screening with either MALT or Kraken2
- If running with MALT, can supply
--run_maltextract
to get authentication statistics and plots (damage, read lengths etc.) for evaluation.
18.3.2 Top Tips for nf-core/eager success
Screen sessions
Depending on the size of your input data, infrastructure, and analyses, running nf-core/eager can take hours or even days (e.g. the example command above will likely take around 20 minutes!). To avoid crashes due to loss of power or network connectivity, try running nf-core/eager in a
screen
ortmux
session:screen -R eager
Multiple ways to supply input data
In this tutorial, a tsv file to specify our input data files and formats. This is a powerful approach that allows nf-core eager to intelligently apply analyses to certain files only (e.g. merging for paired-end but not single-end libraries). However inputs can also be specified using wildcards, which can be useful for fast analyses with simple input data types (e.g. same sequencing configuration, file location, etc.).
See the online nf-core/eager documentation for more details.
nextflow run nf-core/eager -r 2.4.7 -profile docker --fasta ../reference/GCF_001293415.1_ASM129341v1_genomic.fna \
"data/*_{1,2}.fastq.gz" <...> \
--input --udg_type half
Get your MultiQC report via email
If you have GNU mail or sendmail set up on your system, you can add the following flag to send the MultiQC html to your email upon run completion:
--email "your_address@something.com"
Check out the EAGER GUI
For researchers who might be less comfortable with the command line, check out the nf-core/eager launch GUI! The GUI also provides a full list of all pipeline options with short explanations for those interested in learning more about what the pipeline can do.
When something fails, all is not lost!
When individual jobs fail, nf-core/eager will try to automatically resubmit that job with increased memory and CPUs (up to two times per job).
When the whole pipeline crashes, you can save time and computational resources by resubmitting with the
-resume
flag. nf-core/eager will retrieve cached results from previous steps as long as the input is the same.Monitor your pipeline in real time with the Nextflow Tower
Regular users may be interested in checking out the Nextflow Tower, a tool for monitoring the progress of Nextflow pipelines in real time. Check here for more information.
18.3.3 nf-core/eager output
18.3.3.1 Results files
Once completed, the results/
directory of your nf-core/eager run will contain a range of directories that will have output files and tool logs of all the steps of the pipeline. nf-core/eager tries to only save the ‘useful’ files.
Everyday useful files for ancient (microbial) (meta)genomics typically are in folders such as:
deduplication/
ormerged_bams
- These contain the most basic BAM files you would want to use for downstream analyses (and used in the rest of the genomic workflow of the pipeline)
- They contain deduplicated BAM files (i.e., with PCR artefacts removed)
merged_bams/
- This directory will contain BAMs where multiple libraries of one sample have been merged into one final BAM file, when these have been supplied
damage_rescaling/
ortrimmed bam/
- These contain the output BAM files from in silico damage removal, if you have turned on e.g. BAM trimming or damage rescaling
- If you have multiple libraries of one sample, the final BAMs you want will be in
merged_bams/
genotyping/
- This directory will contain your final VCF files from the genotyping step, e.g. from the GATK or freeBayes tools
MultiVCFAnalyzer/
- This will contain consensus sequences and SNP tables from
MultiVCFAnalyzer
, which also allows generation of ‘multi-allelic’ SNP position statistics (useful for the evaluation of cross-mapping from contaminants or relatives)
- This will contain consensus sequences and SNP tables from
bedtools/
- This directory will contain the depth and breadth statistics of genomic features if a
gff
orbed
file has been provided to the pipeline - The files can be used to generate gene heatmaps, that can be used to visualise a comparative presence/absence of virulence factor across genomes (e.g. for microbial pathogens)
- This directory will contain the depth and breadth statistics of genomic features if a
metagenomic_classification
ormalt_extract
- This directory contains the output RMA6 files from MALT, the profiles and taxon count tables from Kraken2, or the aDNA authentication output statistics from maltExtract.
Most other folders contain either intermediate files that are only useful for technical evaluation in the case of problems, or statistics files that are otherwise summarised in the run report.
So, before you delve into these folders, it’s normally a good idea to do a ‘quality check’ of the pipeline run. You can do this using the interactive MultiQC pipeline run report.
18.3.3.2 MultiQC General Statistics
cd multiqc/
If you’re impatient, and your nf-core/eager run hasn’t finished yet, you can cancel the run with ctrl + c (possibly a couple of times), and you can open a premade file in the data directory under ancient-metagenomic-pipelines/multiqc_report.html
In here you should see a bunch of files, but you should open the multiqc_report.html
file in your browser. You can either do this via the commandline (e.g. for firefox firefox multiqc_report.html
) or navigate to the file using your file browser and double clicking on the HTML file.
Once opened you will see a table, and below it many figures and other tables (Figure 18.2). All of these statistics can help you evaluate the quality of your data, pipeline run, and also possibly some initial results!
Typically you will look at the General Statistics table to get a rough overview of the pipeline run. If you hover your cursor over the column headers, you can see which tool the column’s numbers came from, however generally the columns go in order of left to right, where the left most columns are from earlier in the pipeline run (e.g. removing adapters), to variant calling statistics (e.g. number of variants called). You can also configure which columns to display using the Configure columns
button. It’s important to note that in MultiQC tables, you may have duplicate rows from the same library or sample. This is due to MultiQC trying to squish in as many statistics from as many steps of the pipeline as possible (for example, statistics on each of a pair of FASTQ files, and then statistics on the single merged and mapped BAM file), so you should play around with the column configuration to help you visualise the best way to initially evaluate your data.
The bulk of the MultiQC report is made up of per-tool summary plots (e.g., barcharts, linecharts etc). Most of these will be interactive, i.e. you can hover over lines and bars to get more specific numbers of each plot. However the visualisations are aimed at helping you identify possible outliers that may represent failed samples or libraries.
Evaluating how good the data is and how well the run went will vary depending on the dataset and the options selected. However the nf-core/eager tutorials have a good overview of questions you can ask from your MultiQC report to see whether your data is good or not. We shamelessly copy these questions here (as the overlap authors of both the the nf-core/eager documentation and this text book is rather high).
Once completed, you can try going through the MultiQC report the command you executed above, and compare against the questions below. Keep in mind you have a sample N of two, so many of the questions in regards to identifying ‘outliers’ may be difficult.
The following questions are by no means comprehensive, but rather represent the ‘typical’ questions the nf-core/eager developers asked of their own data and reports. However they can act as a good framework for thinking critically about your own results.
18.3.3.2.1 General Stats Table
- Do I see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing?
- Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA?
- Does the Endogenous DNA (%) columns look reasonable (high enough to indicate you have received enough coverage for downstream, and/or do you lose an unusually high reads after filtering )
- Does ClusterFactor or ‘% Dups’ look high (e.g. >2 or >10% respectively - high values suggesting over-amplified or badly preserved samples i.e. low complexity; note that genome-enrichment libraries may by their nature look higher).
- Do you see an increased frequency of C>Ts on the 5’ end of molecules in the mapped reads?
- Do median read lengths look relatively low (normally <= 100 bp) indicating typically fragmented aDNA?
- Does the % coverage decrease relatively gradually at each depth coverage, and does not drop extremely drastically
- Does the Median coverage and percent >3x (or whatever you set) show sufficient coverage for reliable SNP calls and that a good proportion of the genome is covered indicating you have the right reference genome?
- Do you see a high proportion of % Hets, indicating many multi-allelic sites (and possibly presence of cross-mapping from other species, that may lead to false positive or less confident SNP calls)?
18.3.3.2.2 FastQC (pre-AdapterRemoval)
- Do I see any very early drop off of sequence quality scores suggesting problematic sequencing run?
- Do I see outlier GC content distributions?
- Do I see high sequence duplication levels?
18.3.3.2.3 AdapterRemoval
- Do I see high numbers of singletons or discarded read pairs?
18.3.3.2.4 FastQC (post-AdapterRemoval)
- Do I see improved sequence quality scores along the length of reads?
- Do I see reduced adapter content levels?
18.3.3.2.5 Samtools Flagstat (pre/post Filter)
- Do I see outliers, e.g. with unusually low levels of mapped reads, (indicative of badly preserved samples) that require downstream closer assessment?
18.3.3.2.6 DeDup/Picard MarkDuplicates
- Do I see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples?
18.3.3.2.7 MALT (metagenomics only)
- Do I have a reasonable level of mappability?
- Somewhere between 10-30% can be pretty normal for aDNA, whereas e.g. <1% requires careful manual assessment
- Do I have a reasonable taxonomic assignment success?
- You hope to have a large number of the mapped reads (from the mappability plot) that also have taxonomic assignment.
18.3.3.2.8 PreSeq (genomics only)
- Do I see a large drop off of a sample’s curve away from the theoretical complexity? If so, this may indicate it’s not worth performing deeper sequencing as you will get few unique reads (vs. duplicates that are not any more informative than the reads you’ve already sequenced)
18.3.3.2.9 DamageProfiler (genomics only)
- Do I see evidence of damage on the microbial DNA (i.e. a % C>T of more than ~5% in the first few nucleotide positions?) ? If not, possibly your mapped reads are deriving from modern contamination.
18.3.3.2.10 QualiMap (genomics only)
- Do you see a peak of coverage (X) at a good level, e.g. >= 3x, indicating sufficient coverage for reliable SNP calls?
18.3.3.2.11 MultiVCFAnalyzer (genomics only)
- Do I have a good number of called SNPs that suggest the samples have genomes with sufficient nucleotide diversity to inform phylogenetic analysis?
- Do you have a large number of discarded SNP calls?
- Are the % Hets very high indicating possible cross-mapping from off-target organisms that may confounding variant calling?
As above evaluating these outputs will vary depending on the data and or pipeline settings, and very much. However the extensive output documentation of nf-core/eager can guide you through every single table and plot to assist you in continuing any type of ancient DNA project, assisted by fun little cartoony schematic diagrams (Figure 18.3)!
18.3.4 Clean up
Before continuing onto the next section of this chapter, you will need to deactivate from the conda environment
conda deactivate
If the nf-core/eager run is still not finished, you should enter the screen session with screen -r eager
, press ctrl + c until you drop to the prompt, and type exit
.
Then may also need to delete the contents of the eager directory if you are low on harddrive space.
cd /<path>/<to>/ancient-metagenomic-pipelines/eager
rm -r *
18.4 What is aMeta?
For this chapter’s exercises, if not already performed, you will need to create the conda environment from the following yml
file, and activate the environment:
conda activate aMeta
While nf-core/eager is a solid pipeline for microbial genomics, and can also perform metagenomic screening via the integrated HOPS pipeline or Kraken2, in some cases you may wish to have a more accurate and resource efficient pipeline In this section, we will demonstrate an example of using aMeta, a snakemake workflow proposed by Pochon et al. (2022) that aims to minimise resource usage by combining both low-resource requiring k-mer based taxonomic profiling as well as accurate read-alignment (Figure 18.5).
Rather than the very computationally heavy HOPS pipeline (Hübler et al. 2019), that requires extremely large computational nodes with large RAM (>1 TB) to load MALT databases into memory, aMeta does this via a two step approach. Firstly it uses KrakenUniq (a k-mer based and thus memory efficient method) to do a screening of sequencing reads against a broad generalised microbial database. Once all the possible taxa have been detected, aMeta will then make a new database of just the genomes of the taxa that were reported from KrakenUniq (i.e. a specific database) but using MALT. MALT on thus much reduced database is then used to perform computationally much heavier alignment against the reference genomes and LCA taxonomic reassignment. The output from MALT is then sent to the MaltExtract program of the HOPS pipeline for ancient DNA authentication statistics.
18.4.1 Running aMeta
In this tutorial we will try running the small test data that comes with aMeta.
aMeta has been written in Snakemake, which means running the pipeline has to be installed in a slightly different manner to the nextflow pull
command that can be used for nf-core/eager.
cd /<path>/<to>/ancient-metagenomic-pipelines/aMeta
As aMeta also includes tools that normally require very large computational resources that cannot fit on a standard laptop, we will instead try to re-use the internal very small ‘fake’ data the aMeta developers use to test the pipeline.
You don’t have to worry about trying to understand what the following commands are doing, they will not be important for the rest of the chapter. However generally the commands try to pull all the relevant software (via conda), make a fake database and download other required files, and then reconstruct the basic directory and file structure required for running the pipeline.
The next steps, particularly the set up conda envs
will take a very long time!
If you are impatient, you can speed this process up by using mamba
rather than conda.
## Download installation script and run
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh
## When asked: Accept the license
## When asked: Press 'y' to run conda init
## Turn off base
conda config --set auto_activate_base false
## Re-add original conda environments to be recognised by mamba
conda config --append envs_dirs /home/ubuntu/bin/miniconda3/envs/
## Change into ~/.test to set up all the required test resources (Databases etc.)
cd ~/.test
## Set up conda envs
## If you can an error about a 'non-default solver backend' the run `conda config --set solver classic` and re-start the command
## If you have installed the standalone mamba, change --conda-frontend to mamba
snakemake --use-conda --show-failed-logs -j 2 --conda-cleanup-pkgs cache --conda-create-envs-only -s ../workflow/Snakefile --conda-frontend conda
source $(dirname $(dirname $CONDA_EXE))/etc/profile.d/conda.sh
## If you had to change the solver back to classic, you can switch to libmamba with `conda config --set solver libmamba`
## Build dummy KrakenUniq database
env=$(grep krakenuniq .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g")
conda activate $env
krakenuniq-build --db resources/KrakenUniq_DB --kmer-len 21 --minimizer-len 11 --jellyfish-bin $(pwd)/$env/bin/jellyfish
conda deactivate
## Get Krona taxonomy tax dump
env=$(grep krona .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
cd $env/opt/krona
./updateTaxonomy.sh taxonomy
cd -
conda deactivate
## Adjust malt max memory usage
env=$(grep hops .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
version=$(conda list malt --json | grep version | sed -e "s/\"//g" | awk '{print $2}')
cd $env/opt/malt-$version
sed -i -e "s/-Xmx64G/-Xmx3G/" malt-build.vmoptions
sed -i -e "s/-Xmx64G/-Xmx3G/" malt-run.vmoptions
cd -
conda deactivate
touch .initdb
## Run a quick test
snakemake --use-conda --show-failed-logs --conda-cleanup-pkgs cache -s ../workflow/Snakefile $@ --conda-frontend conda
snakemake -s ../workflow/Snakefile --report --report-stylesheet ../workflow/report/custom.css --conda-frontend conda
## Now we move back into the main repository where we can symlink all the database files back to try running our 'own' test
cd ../
cd resources/
ln -s ../.test/resources/* .
cd ../config
mv config.yaml config.yaml.bkp
mv samples.tsv samplest.tsv.bkp
cd ../
ln -s .test/data/ .
## Again get hte taxonomy tax dump for Krona, but this time for a real run
env=$(grep krona .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
cd $env/opt/krona
./updateTaxonomy.sh taxonomy
cd -
conda deactivate
OK now aMeta is all set up, we can now simulate running a ‘real’ pipeline job!
18.4.2 aMeta configuration
First, we will need to configure the workflow. First, we need to create a tab-delimited samples.tsv
file inside aMeta/config/
and provide the names of the input fastq-files. In a text editor (e.g. nano
), write the following names paths in TSV format.
sample fastq
foo data/bar.fq.gz
bar data/foo.fq.gz
Make sure when copy pasting into your test editor, tabs are not replaced with spaces, otherwise the file might not be read!
Then we need to write a config file. This tells aMeta where to find things such as database files and other settings.
These paths and settings go inside a config.yaml
file inside aMeta/config/
. A minimal example config.yaml
files can look like this.
samplesheet: "config/samples.tsv"
krakenuniq_db: resources/KrakenUniq_DB
bowtie2_db: resources/ref.fa
bowtie2_seqid2taxid_db: resources/seqid2taxid.pathogen.map
pathogenomesFound: resources/pathogenomesFound.tab
malt_nt_fasta: resources/ref.fa
malt_seqid2taxid_db: resources/KrakenUniq_DB/seqid2taxid.map
malt_accession2taxid: resources/accession2taxid.map
ncbi_db: resources/ncbi
n_unique_kmers: 1000
n_tax_reads: 200
As this is only a dummy run (due to the large-ish computational resources required for KrakenUniq), we re-use some of the resource files here. While this will produce nonsense output, it is used here to demonstrate how you would execute the pipeline.
18.4.3 Prepare and run aMeta
Make sure you’re still in the aMeta
conda environment, and in the main aMeta directory with
cd /<path/<to>/ancient-metagenomic-pipelines/ameta/aMeta/
And, finally, we are ready to run aMeta!
## change conda-frontend to mamba if you set this up earlier!
snakemake --snakefile workflow/Snakefile --use-conda -j 10 --conda-frontend conda
You can modify -j
to represent the number of available CPUs you have on your machine.
If you are using conda
is the frontend this could be quite slow!
Firstly you’ll have displayed on the console a bunch of messages about JSON schemas, building DAG of jobs and downloading installing of conda environments. You will know the pipeline has started when you get green messages saying things like ‘Checkpoint’ and ‘Finished job XX’.
You’ll know the job is completed once you get the following messages
Finished job 0.
31 of 31 steps (100%) done Complete log: .snakemake/log/2023-10-05T155051.524987.snakemake.log
18.4.4 aMeta output
All output files of the workflow are located in aMeta/results
directory. To get a quick overview of ancient microbes present in your samples you should check a heatmap in results/overview_heatmap_scores.pdf
.
The heatmap demonstrates microbial species (in rows) authenticated for each sample (in columns) (Figure 18.5). The colors and the numbers in the heatmap represent authentications scores, i.e. numeric quantification of seven quality metrics that provide information about microbial presence and ancient status. The authentication scores can vary from 0 to 10, the higher is the score the more likely that a microbe is present in a sample and is ancient.
Typically, scores from 8 to 10 (red color in the heatmap) provide good confidence of ancient microbial presence in a sample. Scores from 5 to 7 (yellow and orange colors in the heatmap) can imply that either: a) a microbe is present but not ancient, i.e. modern contaminant, or b) a microbe is ancient (the reads are damaged) but was perhaps aligned to a wrong reference, i.e. it is not the microbe you think about. The former is a more common case scenario. The latter often happens when an ancient microbe is correctly detected on a genus level but we are not confident about the exact species, and might be aligning the damaged reads to a non-optimal reference which leads to a lot of mismatches or poor evennes of coverage. Scores from 0 to 4 (blue color in the heatmap) typically mean that we have very little statistical evidence (very few reads) to claim presence of a microbe in a sample.
To visually examine the seven quality metrics
- deamination profile,
- evenness of coverage,
- edit distance (amount of mismatches) for all reads,
- edit distance (amount of mismatches) for damaged reads,
- read length distribution,
- PMD scores distribution,
- number of assigned reads (depth of coverage),
Corresponding to the numbers and colors of the heatmap, one can find them in results/AUTHENTICATION/sampleID/taxID/authentic_<Sample>_<sampleID>.trimmed.rma6_<TaxID>_<taxID>.pdf
for each sample sampleID and each authenticated microbe taxID. An example of such quality metrics is shown below in Figure 18.6.
18.4.5 Clean up
Before continuing onto the next section of this chapter, you will need to deactivate from the conda environment
conda deactivate
18.5 What is nf-core/mag?
nf-core/mag (Krakau et al. 2022) is another Nextflow best-practise pipeline for the de novo assembly, binning, and annotation of metagenomes (Figure 18.7). While it was originally designed for modern metagenomes (with a focus on optional support co-assembly with Nanopore long reads), it has since been updated to include specific functionality and parameters for ancient metagenomes too.
nf-core/mag covers the same major steps as you will have run if you followed the chapter on De Novo assembly. Firstly, as with nf-core/eager, nf-core/mag do some initial sequencing QC and cleanup with fastp
or AdapterRemoval
, running FastQC before and after this step to make sure that adapters and overly short reads have been removed. It can then optionally do basic taxonomic classification of raw reads with Kraken2
or Centrifuge
. The processed reads then ungo assembly with MEGAHIT, SPAdes, or SPAdeshybrid (the latter being for combining short and long reads), with contig annotation with Prodigal
to get gene prediction and assembly statistics with QUAST
. A unique feature of nf-core/mag over other de novo assembly pipelines is an aDNA validation and correction step: contigs under go ‘reference free’ damage profiling using pyDamage
, and then remaining damage bases that are mistakenly incorporated by assemblies are corrected using freebayes
and BCFTools
. Contigs then optionally grouped into genomic ‘bins’ with a range of metagenomic binners (MetaBAT2
, MaxBin2
, and CONCOCT
), as well as also optional binning refinement with DAS Tool
. Bins are then re-annotated with Prokka, taxonomically classified with either CAT
or GTDB-Tk
Evaluation of the resulting bins is carried out with BUSCO, CheckM, GUNC and/or QUAST to assess the bin completeness. And, as with all nf-core pipelines, all the results are wrapped up into a MultiQC
report. Detailed documnetation can be see on the nf-core/mag parameters and output pages.
18.5.1 Running nf-core/mag
First re-activate the chapter’s conda environment.
conda activate ancient-metagenomic-pipelines
In this tutorial, we will use the same data and aim to replicate the steps taken in the De Novo assembly chapter. It is important to note that generally, de novo assembly, is very computational intensive. In many cases it will not be able to run a de novo assembly on a standard laptop, therefore some of the parameters used here have been tweaked to get the pipeline runable on powerful laptop level machines.
This tutorial will re-use the data from the de novo assembly chapter. If you have not-run that practical, or deleted the data, pleas follow the instructions at the top of the de novo assembly page to download the denovo-assembly.tar.gz
file and unpack it into the denovo-assembly directory
.
Make sure once you’ve finished, change back into this chapter’s directory with
cd /<path>/<to>/ancient-metagenomic-pipelines
The manual steps of the previous chapter included:
- Read clean up with
fastp
, - Assembly with
MEGAHIT
, - Ancient DNA assessment and correction with
freebayes
andpyDamage
- Binning using
MetaBAT2
andMaxBin2
(originally using theMetaWRAP
pipeline), - Bin assessment with
CheckM
- Contig taxonomic classification with
MMSeqs2
via theGTDB
database - Genome annotation with
prokka
So, what command would we use to recreate these?
## 🛑 Don't run this yet! 🛑
nextflow run nf-core/mag -r 2.3.2 \
\
-profile test,docker '../../denovo-assembly/seqdata_files/*{R1,R2}.fastq.gz' --outdir ./results \
--input --binning_map_mode own \
--ancient_dna --checkm_db data/checkm_data_2015_01_16/ \
--binqc_tool checkm --kraken2_db false --skip_krona --skip_spades --skip_spadeshybrid --skip_maxbin2 --skip_concoct --skip_prodigal --gtdb false --busco_reference false --centrifuge_db false
In fact, nf-core/mag will do most of the steps for you! In this case, we mostly just need to deactivate most of the additional steps (the last line). Otherwise, as with eager we have done the following:
Tell nextflow to run the nf-core/mag pipeline with version 2.3.2
Specify which computing and software environment to use with
-profile
- In this case we are running locally so we don’t specify a computing environment (such as a configuration for an institutional HPC)
- We use
docker
as our container engine, which downloads all the software and specific versions needed for nf-core/mag in immutable ‘containers’, to ensure nothing get broken and is as nf-core/mag expects
Provide the various paths to the input files - the raw FASTQ files and the output directory
Specify we want to run the aDNA mode with mapping of reads back to the original contigs rather than for co-assembly
Binning mapping mode with ancient DNAThe aDNA mode of nf-core/mag only supports mapping reads back to the original contigs. The other mapping modes are when doing co-assembly, where you assume there are similar organisms across all your metagenomes, and wish to map reads from all samples in a group to a single sample’s contig (for example). Doing this on aDNA reads would risk making false positive damage patterns, as the damaged reads may derive from reads from a different sample with damage, that are otherwise not present in the sample used for generating the given contig.
Specify we want to generate completeness statistics with
CheckM
(rather thanBUSCO
), with the associated pre-downloaded databasenf-core pipelines and reference filesMost nf-core pipelines will actually download reference databases, build reference indices for alignment, and in some cases reference genomes for you if you do not specify them as pipeline input. These download and reference building steps are often very time consuming, so it’s recommended that once you’ve let the database download it once, you should move the files somewhere safe and you can re-use in subsequent pipeline runs.
Skip a few steps that are either ‘nice to have’ (e.g. read taxonomic classification), or require large computational resources (e.g.
metaSPAdes
,CONCOCT
orBUSCO
)Information on skipped testsThe steps we are skipping are: host/arefact removal (
Bowtie2
removal of phiX), taxonomic classification of reads (centrifuge
,kraken2
,krona
), the additional assemblers (metaSPAdes
andSPAdeshybrid
, as these require very large computational resources), additional binners (MaxBin2
andCONCOCT
, while they are used in the de novo assembly chapter, they take a long time to run), ship raw contig annotation (withProdigal
, as we do this at the bin level), and contig taxonomic classification (withGTDB
andBUSCO
as they require very large databases).
With this we are almost ready for running the pipeline!
18.5.2 Configuring Nextflow pipelines
Before we execute the command however, we need to again consider about the resource requirements. As described above, particularly de novo assembly (but also many other bioinformatic analyses) require a lot of computational resources, depending on the type of data and analyses you wish to run.
For most pipelines you must tweak the resource requirements to ensure a) they will fit on your cluster, and b) will run optimally so you don’t under- or over- use your computational resources, where latter will either make yourself (takes too long), or your cluser/server system administrators (blocks other users) unhappy!
While nf-core pipelines all have ‘reasonable default’ settings for memory, CPU, and time limits, they will not work in all contexts. Here we will give a brief example of how to tweak the parameters of the pipeline steps so that they can run on our compute node or laptop.
For Nextflow, we must make a special ‘config’ that defines the names of each step of the pipeline you wish to tweak and the corresponding resources.
For this our nf-core/mag run you will need to open your text editor an empty file called custom.config
, and copy and paste the contents of the next code block.
{
process
: FASTP {
withName= 8
cpus }
: BOWTIE2_PHIX_REMOVAL_ALIGN {
withName= 8
cpus }
: BOWTIE2_ASSEMBLY_ALIGN {
withName= 8
cpus }
: PYDAMAGE_ANALYZE {
withName= 8
cpus }
: PROKKA {
withName= 4
cpus }
: MEGAHIT {
withName= 8
cpus }
: CHECKM_LINEAGEWF{
withName= 24.GB
memory = 8
cpus .args = "--reduced_tree"
ext}
}
Here we set the number of CPUs for most tools to a maximum of 8
, and then we limit the amount of memory for CheckM to 24 GB (down from the default of 36 GB - which exceeds most laptop memory). We also give a specifc CheckM
flag command to use a smaller database.
We can save this custom.config
file, and supply this to the Nextflow command with the nextflow parameter (-c
). The good thing about the config file, is you can re-use across runs on the same machine! So set once, and never think about it again.
And with that, you can run your nf-core/mag pipeline!
As recommended in the nf-core/eager section above, start a screen session.
screen -R mag
Then execute the command.
nextflow run nf-core/mag -r 2.3.2 \
\
-profile test,docker '../../denovo-assembly/seqdata_files/*{R1,R2}.fastq.gz' --outdir ./results \
--input --binning_map_mode own \
--ancient_dna --checkm_db data/checkm_data_2015_01_16/ \
--binqc_tool checkm --kraken2_db false --skip_krona --skip_spades --skip_spadeshybrid --skip_maxbin2 --skip_concoct --skip_prodigal --gtdb false --busco_reference false \
--centrifuge_db false -c custom.config
Again as with nf-coire/eager when the run starts, you see a long list of process
lines with progress bars slowly appearing over time. If in a screen session, you can press ctrl + a and then [ to access a ‘navigation’ (called ‘copy’) mode, then use your arrow keys on your keyboard to scroll up and down. To quit this mode just press q
.
The pipeline run above should take around 20 minutes or so to complete.
Configuration files don’t need be personal nor custom!
You can also generate a HPC / server specific profile which can be used on all nf-core pipelines (and even your own!).
See nf-co.re/configs for all existing institutional configs
18.5.3 nf-core/mag output
As with nf-core/eager, you will have a results directory containing many folders representing different stages of the pipeline, and a MultiQC report.
18.5.3.1 Results Files
Relevant directories for evaluting your assemblies are as follows:
QC_shortreads/
- Contains various folders from the sequencing QC and raw read processing (e.g., FastQC, fastp/AdapterRemoval, host-removal results etc)
Taxonomy/
- If you have not already performed taxonomic classification of your reads, the output of the Kraken2 or Centrifuge integrated steps of nf-core/mag will be deposited here, including an optional Krona piechart
- If activated, the contig-level taxonomic assignments will be deposited here (e.g. from CAT or GTDB)
Assembly/
- This directory will contain one directory per chosen assembler which will contain per-sample output files for each assembly.
- Within each of the assembler sub directories, a QC folder will contain the QUAST output
Ancient_DNA
- This folder will contain the output of pyDamge contig authentication to allow you to select or filter for just putatively ancient contigs
variant_calling
- This directory will contain the ‘damage-corrected’ contigs that will be used in downstream steps of the pipeline, if the
aDNA
mode is turned on
- This directory will contain the ‘damage-corrected’ contigs that will be used in downstream steps of the pipeline, if the
GenomeBinning/
- This directory will contain all the output from each of any of the selected contig binning tools. As well as the binned FASTAs, you also have read-depth statistics from when input reads are re-mapped back to the contigs
- The directory will also contain the output from the
DASTool
binning refinement if turned on - In the
QC
sub-directory of this folder, you will find the QUAST results on the genome bins (rather than raw contigs as above), as well as BUSCO/CheckM/GUNC MAG completeness estimates
Prodigal
/Prokka
- These directories will contain the genome annotation output of the raw contigs (
Prodigal
) or from bins (Prokka
)
- These directories will contain the genome annotation output of the raw contigs (
18.5.3.2 Report Table
As of v2.3.* the MultiQC report is unfortunately broken, however this has been fixed for 2.4, and many of the questions asked in the preprocessing section of the nf-core/eager results interpretation will also apply here.
Other than the QUAST results in the Assembly/*/QC/
or GenomeBinning/*/QC/
directories, the main table used for evaluating the output files is the bin_summary.tsv
table in the GenomeBinning/
directory.
In this file you woud typically want to assess the following columns:
% Complete *
columns - with the higher the number of expected domain-specific genes, normally representing a better quality of the resulting bin.# contigs (>=* bp)
columns - which represnts the number of contigs at different lengths, the fewer the shorter reads and the greater the longer reads you have again suggests a better assembly.- This can also be evaluated by the N50 or L75 columns (as described in the De novo assembly chapter).
fastani
andclosest_placement_taxonomy
- these can tell you if your particular bin has a genome very similar to existing species in taxonomic databaseswarnings
- for specific comments on each assignment
18.6 (Optional) clean-up
Let’s clean up your working directory by removing all the data and output from this chapter.
The command below will remove the /<PATH>/<TO>/ancient-metagenomic-pipelines
directory as well as all of its contents.
Always be VERY careful when using rm -r
. Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!
rm -rf /<PATH>/<TO>/ancient-metagenomic-pipelines*
Once deleted you can move elsewhere (e.g. cd ~
).
We can also get out of the conda
environment with
conda deactivate
To delete the conda environment
conda remove --name ancient-metagenomic-data --all -y
18.7 Questions to think about
- Why is it important to use a pipeline for genomic analysis of ancient data?
- How can the design of pipelines such as nf-core/eager pipeline help researchers comply with the FAIR principles for management of scientific data?
- What metrics do you use to evaluate the success/failure of ancient DNA sequencing experiments? How can these measures be evaluated when using nf-core/eager for data preprocessing and analysis?