18  Ancient Metagenomic Pipelines

Author

James A. Fellows Yates, Megan Michel, Nikolay Oskolkov

Tip

For this chapter’s exercises, if not already performed, you will need to download the chapter’s dataset, decompress the archive, and create and activate the conda environment.

Do this, use wget or right click and save to download this Zenodo archive: 10.5281/zenodo.8413239, and unpack

tar xvf ancient-metagenomic-pipelines.tar.gz 
cd ancient-metagenomic-pipelines/

You can then create the subsequently activate environment with

conda env create -f ancient-metagenomic-pipelines.yml
conda activate ancient-metagenomic-pipelines
Warning

There are additional software requirements for this chapter

Please see the relevant chapter section in Before you start before continuing with this chapter.

18.1 Introduction

A pipeline is a series of linked computational steps, where the output of one process becomes the input of the next. Pipelines are critical for managing the huge quantities of data that are now being generated regularly as part of ancient DNA analyses. In this chapter we will go through three dedicated ancient DNA pipelines - all with some (or all!) functionality geared to ancient metagenomics - to show you how you can speed up the more routine aspects of the basic analyses we’ve learnt about earlier in this text book through workflow automation.

We will introduce:

Keep in mind that that there are many other pipelines that exist, and picking which one often they come down to personal preference, such as which functionality they support, which language they are written in, and whether their computational requirements can fit in your available resources.

Other examples of other ancient DNA genomic pipelines include Paleomix (Schubert et al. 2014), and Mapache (Neuenschwander et al. 2023), and for ancient metagenomics: metaBit (Louvel et al. 2016) and HAYSTAC (Dimopoulos et al. 2022).

18.2 Workflow managers

All the pipelines introduced in this chapter utilise workflow managers. These are software that allows users to ‘chain’ together the inputs and outputs distinct ‘atomic’ steps of a bioinformatics analysis - e.g. separate terminal commands of different bioinformatic tools, so that ‘you don’t have to’. You have already seen very basic workflows or ‘pipelines’ when using the bash ‘pipe’ (|) in the Introduction to the Command Line chapter, where the each row of text the output of one command was ‘streamed’ into the next command.

However in the case of bioinformatics, we are often dealing with non-row based text files, meaning that ‘classic’ command line pipelines don’t really work. Instead this is where bioinformatic workflow managers come in: they handle the passing of files from one tool to the next but in a reproducible manager.

Modern computational bioinformatic workflow managers focus on a few main concepts. To summarise Wratten, Wilm, and Göke (2021), these areas are: data provenance, portability, scalability, re-entrancy - all together which contribute to ensuring reproducibility of bioinformatic analyses. Data provenance refers to the ability to track and visualise where each file goes and gets processed, as well as metadata about each file and process (e.g., What version of a tool was used? What parameters were used in that step? How much computing resources were used). Portability follows from data provenance where it’s not just can the entire execution of the pipeline be reconstructed - but can it also be run with the same results on a different machine? This is important to ensure that you can install and test the pipeline on your laptop, but when you then need to do heavy computation using real data, that it will still be able to execute on a high-performance computing cluster (HPC) or on the cloud - both that have very different configurations. This is normally achieved through the use of reusable software environments such as as conda or container engines such as docker, and tight integration with HPC schedulers such as SLURM. As mentioned earlier, not having to run each command manually can be a great speed up to your analysis. However this needs to be able to be Scalable that it the workflow is still efficient regardless whether you’re running with one or ten thousands samples - modern workflow managers perform resource requirement optimisation and scheduling to ensure that all steps of the pipeline will be executed in the most resource efficient manner so it completes as fast as possible - but regardless of of the number of input data. Finally, as workflows get bigger and longer, re-entrancy has become more important, i.e., the ability to re-start a pipeline run that got stuck halfway through due to an error.

All workflow managers have different ways of implementing the concepts above, and these can be very simple (e.g. Makefiles) to very powerful and abstract (e.g. Workflow Description Language). In this chapter we will use pipelines that use two popular workflow managers in bioinformatics, Nextflow and Snakemake.

This chapter will not cover how to write your own workflow, as this would require a whole other textbook. However it is recommended to learn and use workflow managers when carrying out repetitive or routine bioinformatic analysis (Nextflow and snakemake being two popular ones in bioinformatics). Use of workflow managers can help make your work more efficiently (as you only run one command, rather than each step separately), but also more reproducible by reducing the risk of user error when executing each step: the computer will do exactly what you tell it, and if you don’t change anything, will do the exact same thing every time. If you’re interested in writing your own workflows using workflow managers, many training and tutorials exist on the internet (e.g., for Nextflow there is the official training or from software carpentries, or the official training for snakemake.

18.3 What is nf-core/eager?

nf-core/eager is a computational pipeline specifically designed for preprocessing and analysis of ancient DNA data (Figure 18.1). It is a reimplementation of the previously published EAGER (Efficient Ancient Genome Reconstruction) pipeline (Peltzer et al. 2016) written in the workflow manager Nextflow. In addition to reimplementing the original genome mapping and variant calling pipeline, in a more reproducible and portable manner the pipeline also included additional new functionality particularly for researchers interested in microbial sciences, namely a dedicated genotyper and consensus caller designed for low coverage genomes, the ability to get breadth and depth coverage statistics for particular genomic features (e.g. virulence genes), but also automated metagenomic screening and authentication of the off-target reads from mapping (e.g. against the host reference genome).

Figure 18.1: nf-core/eager workflow summary in the form of a ‘metro map’ style diagram. FASTQ or BAM input for red and yellow lines (representing eukaryotic and prokaryotic workflows) goes through FastQC, Adapter Removal, alignment to a reference FASTA input with a range of alignments, before going through BAM filtering, deduplication, in silico damage removal, and variant calling. Multiple statistics steps come out of the deduplication ‘station’. The blue line represents the metagenomic workflow, where offtarget reads come out of the BAM filtering ‘station’, and goes through compexity filtering with BBDuk, MALT or Kraken2, and optionally into MaltExtract for MALT output.

A detailed description of steps in the pipeline is available as part of nf-core/eager’s extensive documentation. For more information, check out the usage documentation here.

Briefly, nf-core/eager takes at a minimum standard input file types that are shared across the genomics field, i.e., raw FASTQ files or aligned reads in bam format, and a reference fasta. nf-core/eager performs preprocessing of this raw data, including adapter clipping, read merging, and quality control of adapter-trimmed data. nf-core/eager then carries mapping using a variety of field-standard shot-read alignment tools with default parameters adapted for short and damaged aDNA sequences. The off-target reads from host DNA mapping can then go into metagenomic classification and authentication (in the case of MALT). After genomic mapping, BAM files go through deduplication of PCR duplicates, damage profiling and removal, and finally variant calling. A myriad of additional statistics can be generated depending on the users preference. Finally, nf-core eager uses MultiQC to create an integrated html report that summarizes the output/results from each of the pipeline steps.

18.3.1 Running nf-core/eager

For the practical portion of this chapter, we will utilize sequencing data from four aDNA libraries, which you should have already downloaded from NCBI. If not, please see the Preparation section above . We will use nf-core/eager to perform a typical microbial genomic analysis, i.e., reconstruction of an ancient genome to generate variant calls that can be used for generating phylogenomic trees and other evolutionary analysis, and gene feature coverage statistics to allow insight into the functional aspects of the genome.

These four libraries come from from two ancient individuals, GLZ002 and KZL002. GLZ002 comes from the Neolithic Siberian site of Glazkovskoe predmestie (Yu et al. 2020) and was radiocarbon dated to 3081-2913 calBCE. KZL002 is an Iron Age individual from Kazakhstan, radiocarbon dated to 2736-2457 calBCE (Andrades Valtueña et al. 2022). Both individuals were originally analysed for human population genetic analysis, but when undergoing metagenomic screening of the off-target reads, both set of authors identified reads from Yersinia pestis from these individuals - the bacterium that causes plague. Subsequently the libraries from these individuals were processed using hybridization capture to increase the number of Y. pestis sequences available for analysis.

Our aims in the following tutorial are to:

  1. Preprocess the FASTQ files by trimming adapters and merging paired-end reads
  2. Align reads to the Y. pestis reference and compute the endogenous DNA percentage
  3. Filter the aligned reads to remove host DNA
  4. Remove duplicate reads for accurate coverage estimation and genotyping
  5. Generate statistics on gene features (e.g. virulence factors)
  6. Merge data by sample and perform genotyping on the combined dataset
  7. Review quality control data to evaluate the success of the previous steps

Let’s get started!

Warning

nf-core/eager v2 requires an older version of Nextflow - if installing manually, ensure you do not have Nextflow version any greater than v22.10.6!

First, download the latest version of the nf-core/eager repo (or check for updates if you have a previously-installed version):

nextflow pull nf-core/eager

Next we can re-visit AMDirT (see Accessing Ancient Metagenomic Data) to download a pre-prepared configuration file for nf-core/eager

  1. Load AMDirt (AMDirT viewer), and select the latest release and the ‘ancientsinglegenome-hostassociated’ table
  2. Filter the sample_name column to just show KZL002 and GLZ002, and select these two rows
    • Press the burger icon on the column
    • Press the filter tab and deselect everything
    • Search GLZ002 and select in filter menu
    • Search KZL002 and select in filter menu
    • Close filter menu and select the two rows
  3. Press the Validate Selection button
  4. Press the ‘Download Curl sample download script’, ‘Download nf-core/eager input TSV’, and ‘Download Citations as BibTex’ buttons
  5. Move the downloaded files into eager/ of this tutorial’s directory
    • i.e., if the files were downloaded to you Downloads/ folder rather than the chapters directory,
    • e.g., assuming you’ve got no other previous downloads you can run mv ~/Downloads/AncientMetagenomeDir_* eager/
Tip

We won’t actually use the BibTex citations file for anything in this tutorial, but it is good habit to always to record and save all citations of any software or data you use!

To download the FASTQ files

  1. Move into the eager/ directory
  2. Run bash AncientMetagenomeDir_curl_download_script.sh to download the files (this may take ~3 minutes)

Next we now inspect the AMDirT generated input TSV file for nf-core/eager!

cat AncientMetagenomeDir_nf_core_eager_input_table.tsv
Sample_Name Library_ID  Lane    Colour_Chemistry    SeqType Organism    Strandedness    UDG_Treatment   R1  R2  BAM
GLZ002  ERR4093961  0   4   PE  Homo sapiens    double  half    ERR4093961_1.fastq.gz   ERR4093961_2.fastq.gz   NA
GLZ002  ERR4093962  0   4   PE  Homo sapiens    double  full    ERR4093962_1.fastq.gz   ERR4093962_2.fastq.gz   NA
KZL002  ERR8958768  0   4   PE  Homo sapiens    double  half    ERR8958768_1.fastq.gz   ERR8958768_2.fastq.gz   NA
KZL002  ERR8958769  0   4   PE  Homo sapiens    double  half    ERR8958769_1.fastq.gz   ERR8958769_2.fastq.gz   NA

Here we see 10 columns, all pre-filled. The first two columns correspond to sample/library IDs that will be used for data provenance and grouping. When you have sequencing multiple lanes you can speed up preprocessing these independently before merging, so lane can specify this (although not used in this case as we have independent libraries per sample. You can indicate the colour chemistry to indicate whether your data requires additional pre-processing to remove poly-G tails, and then also strandedness and UDG damage treatment status of the libraries if you require further damage manipulation. Finally you provide paths to the FASTQ files or BAM files

Other than the raw FASTQ files, we will need a reference genome and annotation coordinates of genes present on the genome. In this case you will use a Yersinia pestis (accession: GCF_001293415.1) reference genome (.fna) and gene feature file (.gff) from NCBI Genome.

These have already been placed in the reference/ directory for you.

To download the required reference genome and annotation file run the following command to download from the NCBI Genome database.

## Download from NCBI
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_001293415.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_001293415.1.zip" -H "Accept: application/zip"

unzip *.zip
mv ncbi_dataset/data/GCF_001293415.1/* .

## We have to sort the gff file to make it eager compatible
gffread genomic.gff GCF_001293415.1_ASM129341v1_genomic.gff

With all of these, we can run the pipeline!

First lets enter a screen session to make sure we can leave the pipeline running in the background and continue using our terminal.

screen -R eager
conda activate ancient-metagenomic-pipelines

Now we can construct an eager command from within the data/ directory so that it looks like this:

nextflow run nf-core/eager -r 2.4.7 \
-profile docker \
--fasta ../reference/GCF_001293415.1_ASM129341v1_genomic.fna \
--input AncientMetagenomeDir_nf_core_eager_input_table.tsv \
--anno_file ../reference/GCF_001293415.1_ASM129341v1_genomic.gff \
--run_bam_filtering --bam_unmapped_type discard \
--skip_preseq \
--run_genotyping --genotyping_tool ug --gatk_ug_out_mode EMIT_ALL_SITES \
--run_bcftools_stats --run_bedtools_coverage

When the run starts, you see a long list of process lines with progress bars slowly appearing over time. If you press ctrl + a and then [ to access a ‘navigation’ (called ‘copy’) mode, then use your arrow keys on your keyboard to scroll up and down. To quit this mode just press q.

Tip

We don’t normally recommend running analyses in the directory your data is in! It is better to keep data and analysis results in separate directories. Here we are just running eager alongside the data for convenience (i.e., you don’t have to modify the downloaded input TSV)

Warning

This run will take between 20m-40m to run on a standard laptop! Time for a break, or you can continue reading this chapter as the critical output data has already been provided for you in the data directory.

So what is this command doing? The different parameters do the following:

  1. Tell nextflow to run the nf-core/eager pipeline with version 2.4.7
  2. Specify which computing and software environment to use with -profile
    • In this case we are running locally so we don’t specify a computing environment (such as a configuration for an institutional HPC)
    • We use docker as our container engine, which downloads all the software and specific versions needed for nf-core/eager in immutable ‘containers’, to ensure nothing get broken and is as nf-core/eager expects
  3. Provide the various paths to the input files (TSV with paths to FASTQ files, a reference fasta, and the reference fasta’s annotations)
  4. Activate the various of the steps of the pipeline you’re interested in
    • We turn off preseq (e.g. when you know you can’t sequence more)
    • We want to turn on BAM filtering, and specify to generate unmapped reads in FASTQ file (so you could check off target reads e.g. for other pathogens)
    • You turn on genotyping using GATK UnifiedGenotyper (preferred over HaplotypeCaller due to in compatibiity with that method to low-coverage data)
    • We turn on variant statistics (from GATK) using bcftools, and coverage statistics of gene features using bedtools

For full parameter documentation, click here.

And now we wait…

Tip

As a reminder, to detach from the screen session type ctrl + a then d. To reattach screen -r eager

Specifically for ancient (meta)genomic data, the following parameters and options may be useful to consider when running your own data:

  • --mapper circularmapper:
    • This aligner is a variant of bwa aln that allows even mapping across the end of linear references of circular genomes
    • Allows reads spanning the start/end of the sequence to be correctly placed, and this provides better coverage estimates across the entire genome
  • --hostremoval_input_fastq:
    • Allows re-generation of input FASTQ files but with any reads aligned to the reference genome removed
    • Can be useful when dealing with modern or ancient data where there are ethical restrictions on the publication of host DNA
    • The output can be used as ‘raw data’ upload to public data repositories
  • --run_bam_filtering --bam_unmapped_type:
    • A pre-requisite for performing the metagenomic analysis steps of nf-core/eager
    • Generates FASTQ files off the unmapped reads present in the reference mapping BAM file
  • --run_bedtools_coverage --anno_file /<path>/<to>/<genefeature>.bed
    • Turns on calculating depth/breadth of annotations in the provided bed or GFF file, useful for generating e.g. virulence gene presence/absence plots
  • --run_genotyping --genotyping_tool ug --gatk_ug_out_mode EMIT_ALL_SITES
    • Turns on genotyping with GATK UnifiedGenotyper
    • A pre-requisite for running MultiVCFAnalyzer for consensus sequencing creation
    • It’s recommend to use either GATK UnifiedGenotyper or freeBayes (non-MultiVCFAnalyzer compatible!) for low-coverage data
    • GATK HaplotypeCaller is not recommended for low coverage data as it performs local de novo assembly around possible variant sites, and this will fail with low-coverage short-read data
  • --run_multivcfanalyzer --write_allele_frequencies
    • Turns on SNP table and FASTA consensus sequence generation with MultiVCFAnalyzer (see pre-requisites above)
    • By providing --write_allele_frequencies, the SNP table will also provide the percentage of reads at that position supporting the call. This can help you evaluate the level of cross-mapping from related (e.g. contaminating environmental) species and may question the reliablity of any resulting downstream analyses.
  • --metagenomic_complexity_filtering
    • An additional preprocessing step of raw reads before going into metagenomic screening to remove low-complexity reads (e.g. mono- or di-nucleotide repeats)
    • Removing these will speed up and lower computational requirements during classification, and will not bias profiles as these sequences provide no taxon-specific information (i.e., can be aligned against thousands to millions of genomes)
  • --run_metagenomic_screening
    • Turns on metagenomic screening with either MALT or Kraken2
    • If running with MALT, can supply --run_maltextract to get authentication statistics and plots (damage, read lengths etc.) for evaluation.

18.3.2 Top Tips for nf-core/eager success

  1. Screen sessions

    Depending on the size of your input data, infrastructure, and analyses, running nf-core/eager can take hours or even days (e.g. the example command above will likely take around 20 minutes!). To avoid crashes due to loss of power or network connectivity, try running nf-core/eager in a screen or tmux session:

    screen -R eager
  2. Multiple ways to supply input data

    In this tutorial, a tsv file to specify our input data files and formats. This is a powerful approach that allows nf-core eager to intelligently apply analyses to certain files only (e.g. merging for paired-end but not single-end libraries). However inputs can also be specified using wildcards, which can be useful for fast analyses with simple input data types (e.g. same sequencing configuration, file location, etc.).

    See the online nf-core/eager documentation for more details.

Exampl commands - do not run!
nextflow run nf-core/eager -r 2.4.7 -profile docker --fasta ../reference/GCF_001293415.1_ASM129341v1_genomic.fna \
--input "data/*_{1,2}.fastq.gz" <...> \
--udg_type half
  1. Get your MultiQC report via email

    If you have GNU mail or sendmail set up on your system, you can add the following flag to send the MultiQC html to your email upon run completion:

    --email "your_address@something.com"

  2. Check out the EAGER GUI

    For researchers who might be less comfortable with the command line, check out the nf-core/eager launch GUI! The GUI also provides a full list of all pipeline options with short explanations for those interested in learning more about what the pipeline can do.

  3. When something fails, all is not lost!

    When individual jobs fail, nf-core/eager will try to automatically resubmit that job with increased memory and CPUs (up to two times per job).

    When the whole pipeline crashes, you can save time and computational resources by resubmitting with the -resume flag. nf-core/eager will retrieve cached results from previous steps as long as the input is the same.

  4. Monitor your pipeline in real time with the Nextflow Tower

    Regular users may be interested in checking out the Nextflow Tower, a tool for monitoring the progress of Nextflow pipelines in real time. Check here for more information.

18.3.3 nf-core/eager output

18.3.3.1 Results files

Once completed, the results/ directory of your nf-core/eager run will contain a range of directories that will have output files and tool logs of all the steps of the pipeline. nf-core/eager tries to only save the ‘useful’ files.

Everyday useful files for ancient (microbial) (meta)genomics typically are in folders such as:

  • deduplication/ or merged_bams
    • These contain the most basic BAM files you would want to use for downstream analyses (and used in the rest of the genomic workflow of the pipeline)
    • They contain deduplicated BAM files (i.e., with PCR artefacts removed)
    • merged_bams/
      • This directory will contain BAMs where multiple libraries of one sample have been merged into one final BAM file, when these have been supplied
  • damage_rescaling/ or trimmed bam/
    • These contain the output BAM files from in silico damage removal, if you have turned on e.g. BAM trimming or damage rescaling
    • If you have multiple libraries of one sample, the final BAMs you want will be in merged_bams/
  • genotyping/
    • This directory will contain your final VCF files from the genotyping step, e.g. from the GATK or freeBayes tools
  • MultiVCFAnalyzer/
    • This will contain consensus sequences and SNP tables from MultiVCFAnalyzer, which also allows generation of ‘multi-allelic’ SNP position statistics (useful for the evaluation of cross-mapping from contaminants or relatives)
  • bedtools/
    • This directory will contain the depth and breadth statistics of genomic features if a gff or bed file has been provided to the pipeline
    • The files can be used to generate gene heatmaps, that can be used to visualise a comparative presence/absence of virulence factor across genomes (e.g. for microbial pathogens)
  • metagenomic_classification or malt_extract
    • This directory contains the output RMA6 files from MALT, the profiles and taxon count tables from Kraken2, or the aDNA authentication output statistics from maltExtract.

Most other folders contain either intermediate files that are only useful for technical evaluation in the case of problems, or statistics files that are otherwise summarised in the run report.

So, before you delve into these folders, it’s normally a good idea to do a ‘quality check’ of the pipeline run. You can do this using the interactive MultiQC pipeline run report.

18.3.3.2 MultiQC General Statistics

cd multiqc/
Note

If you’re impatient, and your nf-core/eager run hasn’t finished yet, you can cancel the run with ctrl + c (possibly a couple of times), and you can open a premade file in the data directory under ancient-metagenomic-pipelines/multiqc_report.html

In here you should see a bunch of files, but you should open the multiqc_report.html file in your browser. You can either do this via the commandline (e.g. for firefox firefox multiqc_report.html) or navigate to the file using your file browser and double clicking on the HTML file.

Once opened you will see a table, and below it many figures and other tables (Figure 18.2). All of these statistics can help you evaluate the quality of your data, pipeline run, and also possibly some initial results!

Figure 18.2: Screenshot of initial view of a MultiQC report. The left hand sidebar provides links to various sections of the report, most of them containing summary images of the outputs of all the different tools. On the main panel you see the MultiQC and nf-core/eager logos, some descriptive information about how the report was generated, and finally the top of the ‘General Statistics’ tables, that has columns such as Sample Name, Nr. Input reads, length of input reads, %trimmed, etc. Each numeric column contains a coloured barchart to help readers to quickly evaluate the number of a given sample against all others in the table

Typically you will look at the General Statistics table to get a rough overview of the pipeline run. If you hover your cursor over the column headers, you can see which tool the column’s numbers came from, however generally the columns go in order of left to right, where the left most columns are from earlier in the pipeline run (e.g. removing adapters), to variant calling statistics (e.g. number of variants called). You can also configure which columns to display using the Configure columns button. It’s important to note that in MultiQC tables, you may have duplicate rows from the same library or sample. This is due to MultiQC trying to squish in as many statistics from as many steps of the pipeline as possible (for example, statistics on each of a pair of FASTQ files, and then statistics on the single merged and mapped BAM file), so you should play around with the column configuration to help you visualise the best way to initially evaluate your data.

The bulk of the MultiQC report is made up of per-tool summary plots (e.g., barcharts, linecharts etc). Most of these will be interactive, i.e. you can hover over lines and bars to get more specific numbers of each plot. However the visualisations are aimed at helping you identify possible outliers that may represent failed samples or libraries.

Evaluating how good the data is and how well the run went will vary depending on the dataset and the options selected. However the nf-core/eager tutorials have a good overview of questions you can ask from your MultiQC report to see whether your data is good or not. We shamelessly copy these questions here (as the overlap authors of both the the nf-core/eager documentation and this text book is rather high).

Once completed, you can try going through the MultiQC report the command you executed above, and compare against the questions below. Keep in mind you have a sample N of two, so many of the questions in regards to identifying ‘outliers’ may be difficult.

The following questions are by no means comprehensive, but rather represent the ‘typical’ questions the nf-core/eager developers asked of their own data and reports. However they can act as a good framework for thinking critically about your own results.

18.3.3.2.1 General Stats Table
  • Do I see the expected number of raw sequencing reads (summed across each set of FASTQ files per library) that was requested for sequencing?
  • Does the percentage of trimmed reads look normal for aDNA, and do lengths after trimming look short as expected of aDNA?
  • Does the Endogenous DNA (%) columns look reasonable (high enough to indicate you have received enough coverage for downstream, and/or do you lose an unusually high reads after filtering )
  • Does ClusterFactor or ‘% Dups’ look high (e.g. >2 or >10% respectively - high values suggesting over-amplified or badly preserved samples i.e. low complexity; note that genome-enrichment libraries may by their nature look higher).
  • Do you see an increased frequency of C>Ts on the 5’ end of molecules in the mapped reads?
  • Do median read lengths look relatively low (normally <= 100 bp) indicating typically fragmented aDNA?
  • Does the % coverage decrease relatively gradually at each depth coverage, and does not drop extremely drastically
  • Does the Median coverage and percent >3x (or whatever you set) show sufficient coverage for reliable SNP calls and that a good proportion of the genome is covered indicating you have the right reference genome?
  • Do you see a high proportion of % Hets, indicating many multi-allelic sites (and possibly presence of cross-mapping from other species, that may lead to false positive or less confident SNP calls)?
18.3.3.2.2 FastQC (pre-AdapterRemoval)
  • Do I see any very early drop off of sequence quality scores suggesting problematic sequencing run?
  • Do I see outlier GC content distributions?
  • Do I see high sequence duplication levels?
18.3.3.2.3 AdapterRemoval
  • Do I see high numbers of singletons or discarded read pairs?
18.3.3.2.4 FastQC (post-AdapterRemoval)
  • Do I see improved sequence quality scores along the length of reads?
  • Do I see reduced adapter content levels?
18.3.3.2.5 Samtools Flagstat (pre/post Filter)
  • Do I see outliers, e.g. with unusually low levels of mapped reads, (indicative of badly preserved samples) that require downstream closer assessment?
18.3.3.2.6 DeDup/Picard MarkDuplicates
  • Do I see large numbers of duplicates being removed, possibly indicating over-amplified or badly preserved samples?
18.3.3.2.7 MALT (metagenomics only)
  • Do I have a reasonable level of mappability?
    • Somewhere between 10-30% can be pretty normal for aDNA, whereas e.g. <1% requires careful manual assessment
  • Do I have a reasonable taxonomic assignment success?
    • You hope to have a large number of the mapped reads (from the mappability plot) that also have taxonomic assignment.
18.3.3.2.8 PreSeq (genomics only)
  • Do I see a large drop off of a sample’s curve away from the theoretical complexity? If so, this may indicate it’s not worth performing deeper sequencing as you will get few unique reads (vs. duplicates that are not any more informative than the reads you’ve already sequenced)
18.3.3.2.9 DamageProfiler (genomics only)
  • Do I see evidence of damage on the microbial DNA (i.e. a % C>T of more than ~5% in the first few nucleotide positions?) ? If not, possibly your mapped reads are deriving from modern contamination.
18.3.3.2.10 QualiMap (genomics only)
  • Do you see a peak of coverage (X) at a good level, e.g. >= 3x, indicating sufficient coverage for reliable SNP calls?
18.3.3.2.11 MultiVCFAnalyzer (genomics only)
  • Do I have a good number of called SNPs that suggest the samples have genomes with sufficient nucleotide diversity to inform phylogenetic analysis?
  • Do you have a large number of discarded SNP calls?
  • Are the % Hets very high indicating possible cross-mapping from off-target organisms that may confounding variant calling?

As above evaluating these outputs will vary depending on the data and or pipeline settings, and very much. However the extensive output documentation of nf-core/eager can guide you through every single table and plot to assist you in continuing any type of ancient DNA project, assisted by fun little cartoony schematic diagrams (Figure 18.3)!

Figure 18.3: Example of cartoon schematic diagram of the output from DamageProfiler in different contexts. The six panels show different types of ancient DNA damage line-plots from having no damage (flat red/blue lines), however with a speech bubble noting that if the library was UDG treated, that the flat lines might be valid, all the way to the ‘classic’ ancient DNA damage plot with both red and blue lines showing an exponential decay from the ends of reads to the middle, with a motivational speech bubble. Source: Zandra Fagernäs, CC-BY 4.0 via nf-core/eager documentation

18.3.4 Clean up

Before continuing onto the next section of this chapter, you will need to deactivate from the conda environment

conda deactivate 

If the nf-core/eager run is still not finished, you should enter the screen session with screen -r eager, press ctrl + c until you drop to the prompt, and type exit.

Then may also need to delete the contents of the eager directory if you are low on harddrive space.

cd /<path>/<to>/ancient-metagenomic-pipelines/eager
rm -r *

18.4 What is aMeta?

Tip

For this chapter’s exercises, if not already performed, you will need to create the conda environment from the following yml file, and activate the environment:

conda activate aMeta

While nf-core/eager is a solid pipeline for microbial genomics, and can also perform metagenomic screening via the integrated HOPS pipeline or Kraken2, in some cases you may wish to have a more accurate and resource efficient pipeline In this section, we will demonstrate an example of using aMeta, a snakemake workflow proposed by Pochon et al. (2022) that aims to minimise resource usage by combining both low-resource requiring k-mer based taxonomic profiling as well as accurate read-alignment (Figure 18.5).

Figure 18.4: Schematic diagram of the aMeta pipeline. Input samples initially undergo generalised screening using the K-mer based KrakenUniq. For every hit that the reads match inside this database, then sees the genome of that hit then constructed into a MALT database where the reads undergo a mapping step to generate alignments, a lowest-common-ancestor (LCA) algorithm step to refine taxonomic assignments, and ancient DNA authentication statistics generation

Rather than the very computationally heavy HOPS pipeline (Hübler et al. 2019), that requires extremely large computational nodes with large RAM (>1 TB) to load MALT databases into memory, aMeta does this via a two step approach. Firstly it uses KrakenUniq (a k-mer based and thus memory efficient method) to do a screening of sequencing reads against a broad generalised microbial database. Once all the possible taxa have been detected, aMeta will then make a new database of just the genomes of the taxa that were reported from KrakenUniq (i.e. a specific database) but using MALT. MALT on thus much reduced database is then used to perform computationally much heavier alignment against the reference genomes and LCA taxonomic reassignment. The output from MALT is then sent to the MaltExtract program of the HOPS pipeline for ancient DNA authentication statistics.

18.4.1 Running aMeta

In this tutorial we will try running the small test data that comes with aMeta.

aMeta has been written in Snakemake, which means running the pipeline has to be installed in a slightly different manner to the nextflow pull command that can be used for nf-core/eager.

cd /<path>/<to>/ancient-metagenomic-pipelines/aMeta

As aMeta also includes tools that normally require very large computational resources that cannot fit on a standard laptop, we will instead try to re-use the internal very small ‘fake’ data the aMeta developers use to test the pipeline.

You don’t have to worry about trying to understand what the following commands are doing, they will not be important for the rest of the chapter. However generally the commands try to pull all the relevant software (via conda), make a fake database and download other required files, and then reconstruct the basic directory and file structure required for running the pipeline.

Warning

The next steps, particularly the set up conda envs will take a very long time!

If you are impatient, you can speed this process up by using mamba rather than conda.

## Download installation script and run
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh

## When asked: Accept the license
## When asked: Press 'y' to run conda init

## Turn off base
conda config --set auto_activate_base false

## Re-add original conda environments to be recognised by mamba
conda config --append envs_dirs /home/ubuntu/bin/miniconda3/envs/
## Change into ~/.test to set up all the required test resources (Databases etc.)
cd ~/.test

## Set up conda envs
## If you can an error about a 'non-default solver backend' the run `conda config --set solver classic` and re-start the command
## If you have installed the standalone mamba, change --conda-frontend to mamba
snakemake --use-conda --show-failed-logs -j 2 --conda-cleanup-pkgs cache --conda-create-envs-only -s ../workflow/Snakefile --conda-frontend conda
source $(dirname $(dirname $CONDA_EXE))/etc/profile.d/conda.sh

## If you had to change the solver back to classic, you can switch to libmamba with `conda config --set solver libmamba`

## Build dummy KrakenUniq database
env=$(grep krakenuniq .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g")
conda activate $env
krakenuniq-build --db resources/KrakenUniq_DB --kmer-len 21 --minimizer-len 11 --jellyfish-bin $(pwd)/$env/bin/jellyfish
conda deactivate

## Get Krona taxonomy tax dump
env=$(grep krona .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
cd $env/opt/krona
./updateTaxonomy.sh taxonomy
cd -
conda deactivate

## Adjust malt max memory usage
env=$(grep hops .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
version=$(conda list malt --json | grep version | sed -e "s/\"//g" | awk '{print $2}')
cd $env/opt/malt-$version
sed -i -e "s/-Xmx64G/-Xmx3G/" malt-build.vmoptions
sed -i -e "s/-Xmx64G/-Xmx3G/" malt-run.vmoptions
cd -
conda deactivate

touch .initdb

## Run a quick test
snakemake --use-conda --show-failed-logs --conda-cleanup-pkgs cache -s ../workflow/Snakefile $@ --conda-frontend conda

snakemake -s ../workflow/Snakefile --report --report-stylesheet ../workflow/report/custom.css --conda-frontend conda

## Now we move back into the main repository where we can symlink all the database files back to try running our 'own' test

cd ../
cd resources/
ln -s ../.test/resources/* .
cd ../config
mv config.yaml config.yaml.bkp
mv samples.tsv samplest.tsv.bkp
cd ../
ln -s .test/data/ .

## Again get hte taxonomy tax dump for Krona, but this time for a real run
env=$(grep krona .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
cd $env/opt/krona
./updateTaxonomy.sh taxonomy
cd -
conda deactivate

OK now aMeta is all set up, we can now simulate running a ‘real’ pipeline job!

18.4.2 aMeta configuration

First, we will need to configure the workflow. First, we need to create a tab-delimited samples.tsv file inside aMeta/config/ and provide the names of the input fastq-files. In a text editor (e.g. nano), write the following names paths in TSV format.

sample  fastq
foo data/bar.fq.gz
bar data/foo.fq.gz
Warning

Make sure when copy pasting into your test editor, tabs are not replaced with spaces, otherwise the file might not be read!

Then we need to write a config file. This tells aMeta where to find things such as database files and other settings.

These paths and settings go inside a config.yaml file inside aMeta/config/. A minimal example config.yaml files can look like this.

samplesheet: "config/samples.tsv"

krakenuniq_db: resources/KrakenUniq_DB

bowtie2_db: resources/ref.fa
bowtie2_seqid2taxid_db: resources/seqid2taxid.pathogen.map
pathogenomesFound: resources/pathogenomesFound.tab

malt_nt_fasta: resources/ref.fa
malt_seqid2taxid_db: resources/KrakenUniq_DB/seqid2taxid.map
malt_accession2taxid: resources/accession2taxid.map

ncbi_db: resources/ncbi

n_unique_kmers: 1000
n_tax_reads: 200
Note

As this is only a dummy run (due to the large-ish computational resources required for KrakenUniq), we re-use some of the resource files here. While this will produce nonsense output, it is used here to demonstrate how you would execute the pipeline.

18.4.3 Prepare and run aMeta

Make sure you’re still in the aMeta conda environment, and in the main aMeta directory with

cd /<path/<to>/ancient-metagenomic-pipelines/ameta/aMeta/

And, finally, we are ready to run aMeta!

## change conda-frontend to mamba if you set this up earlier!
snakemake --snakefile workflow/Snakefile --use-conda -j 10 --conda-frontend conda
Note

You can modify -j to represent the number of available CPUs you have on your machine.

If you are using conda is the frontend this could be quite slow!

Firstly you’ll have displayed on the console a bunch of messages about JSON schemas, building DAG of jobs and downloading installing of conda environments. You will know the pipeline has started when you get green messages saying things like ‘Checkpoint’ and ‘Finished job XX’.

You’ll know the job is completed once you get the following messages

Finished job 0.
31 of 31 steps (100%) done
Complete log: .snakemake/log/2023-10-05T155051.524987.snakemake.log

18.4.4 aMeta output

All output files of the workflow are located in aMeta/results directory. To get a quick overview of ancient microbes present in your samples you should check a heatmap in results/overview_heatmap_scores.pdf.

Figure 18.5: Example microbiome profiling summary heatmap from aMeta. The columns represent different samples, and the rows of different species. The cells of the heatmap are coloured from blue, to yellow, to red, representing aMeta authentication scores from 0 to 10, with the higher the number the more confident of the hit being both the correct taxonomic assignment and that it is ancient. Numbers in the coloured cells also provide the direct score number.

The heatmap demonstrates microbial species (in rows) authenticated for each sample (in columns) (Figure 18.5). The colors and the numbers in the heatmap represent authentications scores, i.e. numeric quantification of seven quality metrics that provide information about microbial presence and ancient status. The authentication scores can vary from 0 to 10, the higher is the score the more likely that a microbe is present in a sample and is ancient.

Typically, scores from 8 to 10 (red color in the heatmap) provide good confidence of ancient microbial presence in a sample. Scores from 5 to 7 (yellow and orange colors in the heatmap) can imply that either: a) a microbe is present but not ancient, i.e. modern contaminant, or b) a microbe is ancient (the reads are damaged) but was perhaps aligned to a wrong reference, i.e. it is not the microbe you think about. The former is a more common case scenario. The latter often happens when an ancient microbe is correctly detected on a genus level but we are not confident about the exact species, and might be aligning the damaged reads to a non-optimal reference which leads to a lot of mismatches or poor evennes of coverage. Scores from 0 to 4 (blue color in the heatmap) typically mean that we have very little statistical evidence (very few reads) to claim presence of a microbe in a sample.

To visually examine the seven quality metrics

  • deamination profile,
  • evenness of coverage,
  • edit distance (amount of mismatches) for all reads,
  • edit distance (amount of mismatches) for damaged reads,
  • read length distribution,
  • PMD scores distribution,
  • number of assigned reads (depth of coverage),

Corresponding to the numbers and colors of the heatmap, one can find them in results/AUTHENTICATION/sampleID/taxID/authentic_<Sample>_<sampleID>.trimmed.rma6_<TaxID>_<taxID>.pdf for each sample sampleID and each authenticated microbe taxID. An example of such quality metrics is shown below in Figure 18.6.

Figure 18.6: Example of sample/hit specific PDF output from aMeta, 9 panels represent different figures that are useful for evaluating the authenticitiy of an ancient metagenomic hit. From Left to Right, Top from bottom, the panels consist of: 1. Edit distance (all reads) bar plot, 2. Edit distance (ancient reads) bar plot, 3. Breadth of coverage line plot, 4. Deamination line plot, 5. Read length distribution bar plot, 6. PMD score histogram, 7. Percent identity bar plot, 8. table of similarity to hits from from closest hit to least closest, and 10. a general statistics table including the name of the taxonomic node, number of reads, duplicates, and mean read length etc.

18.4.5 Clean up

Before continuing onto the next section of this chapter, you will need to deactivate from the conda environment

conda deactivate 

18.5 What is nf-core/mag?

nf-core/mag (Krakau et al. 2022) is another Nextflow best-practise pipeline for the de novo assembly, binning, and annotation of metagenomes (Figure 18.7). While it was originally designed for modern metagenomes (with a focus on optional support co-assembly with Nanopore long reads), it has since been updated to include specific functionality and parameters for ancient metagenomes too.

Figure 18.7: Overview diagram of the main steps and tools of the nf-core/mag pipeline. Taking reads in FASTQ format or a samplesheet as input, reads can go through adapter and/or quality trimming for specific tools for both short and long reads. These reads can optionally go into taxonomic classification and visualisation, at the same time as the prepared reads go into sample-or group wise assemply, evaluation, as well as the ancient DNA valdiation subworkflow. Resulting contigs can be functionally annotated at the same time as optionally going through binning and binning refinement and evaluation (with statistics generation). Bins and refined bins can be taxonomically classified and annotated, with all final reports going into MultiQC.

nf-core/mag covers the same major steps as you will have run if you followed the chapter on De Novo assembly. Firstly, as with nf-core/eager, nf-core/mag do some initial sequencing QC and cleanup with fastp or AdapterRemoval, running FastQC before and after this step to make sure that adapters and overly short reads have been removed. It can then optionally do basic taxonomic classification of raw reads with Kraken2 or Centrifuge. The processed reads then ungo assembly with MEGAHIT, SPAdes, or SPAdeshybrid (the latter being for combining short and long reads), with contig annotation with Prodigal to get gene prediction and assembly statistics with QUAST. A unique feature of nf-core/mag over other de novo assembly pipelines is an aDNA validation and correction step: contigs under go ‘reference free’ damage profiling using pyDamage, and then remaining damage bases that are mistakenly incorporated by assemblies are corrected using freebayes and BCFTools. Contigs then optionally grouped into genomic ‘bins’ with a range of metagenomic binners (MetaBAT2, MaxBin2, and CONCOCT), as well as also optional binning refinement with DAS Tool. Bins are then re-annotated with Prokka, taxonomically classified with either CAT or GTDB-Tk Evaluation of the resulting bins is carried out with BUSCO, CheckM, GUNC and/or QUAST to assess the bin completeness. And, as with all nf-core pipelines, all the results are wrapped up into a MultiQC report. Detailed documnetation can be see on the nf-core/mag parameters and output pages.

18.5.1 Running nf-core/mag

First re-activate the chapter’s conda environment.

conda activate ancient-metagenomic-pipelines

In this tutorial, we will use the same data and aim to replicate the steps taken in the De Novo assembly chapter. It is important to note that generally, de novo assembly, is very computational intensive. In many cases it will not be able to run a de novo assembly on a standard laptop, therefore some of the parameters used here have been tweaked to get the pipeline runable on powerful laptop level machines.

This tutorial will re-use the data from the de novo assembly chapter. If you have not-run that practical, or deleted the data, pleas follow the instructions at the top of the de novo assembly page to download the denovo-assembly.tar.gz file and unpack it into the denovo-assembly directory.

Make sure once you’ve finished, change back into this chapter’s directory with

cd /<path>/<to>/ancient-metagenomic-pipelines

The manual steps of the previous chapter included:

  1. Read clean up with fastp,
  2. Assembly with MEGAHIT,
  3. Ancient DNA assessment and correction with freebayes and pyDamage
  4. Binning using MetaBAT2 and MaxBin2 (originally using the MetaWRAP pipeline),
  5. Bin assessment with CheckM
  6. Contig taxonomic classification with MMSeqs2 via the GTDB database
  7. Genome annotation with prokka

So, what command would we use to recreate these?

## 🛑 Don't run this yet! 🛑
nextflow run nf-core/mag -r 2.3.2 \
-profile test,docker \
--input '../../denovo-assembly/seqdata_files/*{R1,R2}.fastq.gz' --outdir ./results \
--ancient_dna  --binning_map_mode own \
--binqc_tool checkm --checkm_db data/checkm_data_2015_01_16/ \
--centrifuge_db false --kraken2_db false --skip_krona --skip_spades --skip_spadeshybrid --skip_maxbin2 --skip_concoct --skip_prodigal --gtdb false  --busco_reference false 

In fact, nf-core/mag will do most of the steps for you! In this case, we mostly just need to deactivate most of the additional steps (the last line). Otherwise, as with eager we have done the following:

  1. Tell nextflow to run the nf-core/mag pipeline with version 2.3.2

  2. Specify which computing and software environment to use with -profile

    • In this case we are running locally so we don’t specify a computing environment (such as a configuration for an institutional HPC)
    • We use docker as our container engine, which downloads all the software and specific versions needed for nf-core/mag in immutable ‘containers’, to ensure nothing get broken and is as nf-core/mag expects
  3. Provide the various paths to the input files - the raw FASTQ files and the output directory

  4. Specify we want to run the aDNA mode with mapping of reads back to the original contigs rather than for co-assembly

    The aDNA mode of nf-core/mag only supports mapping reads back to the original contigs. The other mapping modes are when doing co-assembly, where you assume there are similar organisms across all your metagenomes, and wish to map reads from all samples in a group to a single sample’s contig (for example). Doing this on aDNA reads would risk making false positive damage patterns, as the damaged reads may derive from reads from a different sample with damage, that are otherwise not present in the sample used for generating the given contig.

  5. Specify we want to generate completeness statistics with CheckM (rather than BUSCO), with the associated pre-downloaded database

    Most nf-core pipelines will actually download reference databases, build reference indices for alignment, and in some cases reference genomes for you if you do not specify them as pipeline input. These download and reference building steps are often very time consuming, so it’s recommended that once you’ve let the database download it once, you should move the files somewhere safe and you can re-use in subsequent pipeline runs.

  6. Skip a few steps that are either ‘nice to have’ (e.g. read taxonomic classification), or require large computational resources (e.g. metaSPAdes, CONCOCT or BUSCO)

    The steps we are skipping are: host/arefact removal (Bowtie2 removal of phiX), taxonomic classification of reads (centrifuge, kraken2, krona), the additional assemblers (metaSPAdes and SPAdeshybrid, as these require very large computational resources), additional binners (MaxBin2 and CONCOCT, while they are used in the de novo assembly chapter, they take a long time to run), ship raw contig annotation (with Prodigal, as we do this at the bin level), and contig taxonomic classification (with GTDB and BUSCO as they require very large databases).

With this we are almost ready for running the pipeline!

18.5.2 Configuring Nextflow pipelines

Before we execute the command however, we need to again consider about the resource requirements. As described above, particularly de novo assembly (but also many other bioinformatic analyses) require a lot of computational resources, depending on the type of data and analyses you wish to run.

For most pipelines you must tweak the resource requirements to ensure a) they will fit on your cluster, and b) will run optimally so you don’t under- or over- use your computational resources, where latter will either make yourself (takes too long), or your cluser/server system administrators (blocks other users) unhappy!

While nf-core pipelines all have ‘reasonable default’ settings for memory, CPU, and time limits, they will not work in all contexts. Here we will give a brief example of how to tweak the parameters of the pipeline steps so that they can run on our compute node or laptop.

For Nextflow, we must make a special ‘config’ that defines the names of each step of the pipeline you wish to tweak and the corresponding resources.

For this our nf-core/mag run you will need to open your text editor an empty file called custom.config, and copy and paste the contents of the next code block.

process {

  withName: FASTP { 
    cpus = 8
  }

  withName: BOWTIE2_PHIX_REMOVAL_ALIGN { 
    cpus = 8
  }

  withName: BOWTIE2_ASSEMBLY_ALIGN {    
    cpus = 8
  }

  withName: PYDAMAGE_ANALYZE {      
    cpus = 8
  }

  withName: PROKKA {
    cpus = 4
  }

  withName: MEGAHIT {
    cpus = 8
  }

  withName: CHECKM_LINEAGEWF{
    memory = 24.GB
    cpus   = 8
    ext.args = "--reduced_tree"
  }

}

Here we set the number of CPUs for most tools to a maximum of 8, and then we limit the amount of memory for CheckM to 24 GB (down from the default of 36 GB - which exceeds most laptop memory). We also give a specifc CheckM flag command to use a smaller database.

We can save this custom.config file, and supply this to the Nextflow command with the nextflow parameter (-c). The good thing about the config file, is you can re-use across runs on the same machine! So set once, and never think about it again.

And with that, you can run your nf-core/mag pipeline!

As recommended in the nf-core/eager section above, start a screen session.

screen -R mag

Then execute the command.

nextflow run nf-core/mag -r 2.3.2 \
-profile test,docker \
--input '../../denovo-assembly/seqdata_files/*{R1,R2}.fastq.gz' --outdir ./results \
--ancient_dna  --binning_map_mode own \
--binqc_tool checkm --checkm_db data/checkm_data_2015_01_16/ \
--centrifuge_db false --kraken2_db false --skip_krona --skip_spades --skip_spadeshybrid --skip_maxbin2 --skip_concoct --skip_prodigal --gtdb false  --busco_reference false \
-c custom.config

Again as with nf-coire/eager when the run starts, you see a long list of process lines with progress bars slowly appearing over time. If in a screen session, you can press ctrl + a and then [ to access a ‘navigation’ (called ‘copy’) mode, then use your arrow keys on your keyboard to scroll up and down. To quit this mode just press q.

The pipeline run above should take around 20 minutes or so to complete.

Tip

Configuration files don’t need be personal nor custom!

You can also generate a HPC / server specific profile which can be used on all nf-core pipelines (and even your own!).

See nf-co.re/configs for all existing institutional configs

18.5.3 nf-core/mag output

As with nf-core/eager, you will have a results directory containing many folders representing different stages of the pipeline, and a MultiQC report.

18.5.3.1 Results Files

Relevant directories for evaluting your assemblies are as follows:

  • QC_shortreads/
    • Contains various folders from the sequencing QC and raw read processing (e.g., FastQC, fastp/AdapterRemoval, host-removal results etc)
  • Taxonomy/
    • If you have not already performed taxonomic classification of your reads, the output of the Kraken2 or Centrifuge integrated steps of nf-core/mag will be deposited here, including an optional Krona piechart
    • If activated, the contig-level taxonomic assignments will be deposited here (e.g. from CAT or GTDB)
  • Assembly/
    • This directory will contain one directory per chosen assembler which will contain per-sample output files for each assembly.
    • Within each of the assembler sub directories, a QC folder will contain the QUAST output
  • Ancient_DNA
    • This folder will contain the output of pyDamge contig authentication to allow you to select or filter for just putatively ancient contigs
  • variant_calling
    • This directory will contain the ‘damage-corrected’ contigs that will be used in downstream steps of the pipeline, if the aDNA mode is turned on
  • GenomeBinning/
    • This directory will contain all the output from each of any of the selected contig binning tools. As well as the binned FASTAs, you also have read-depth statistics from when input reads are re-mapped back to the contigs
    • The directory will also contain the output from the DASTool binning refinement if turned on
    • In the QC sub-directory of this folder, you will find the QUAST results on the genome bins (rather than raw contigs as above), as well as BUSCO/CheckM/GUNC MAG completeness estimates
  • Prodigal / Prokka
    • These directories will contain the genome annotation output of the raw contigs (Prodigal) or from bins (Prokka)

18.5.3.2 Report Table

As of v2.3.* the MultiQC report is unfortunately broken, however this has been fixed for 2.4, and many of the questions asked in the preprocessing section of the nf-core/eager results interpretation will also apply here.

Other than the QUAST results in the Assembly/*/QC/ or GenomeBinning/*/QC/ directories, the main table used for evaluating the output files is the bin_summary.tsv table in the GenomeBinning/ directory.

In this file you woud typically want to assess the following columns:

  • % Complete * columns - with the higher the number of expected domain-specific genes, normally representing a better quality of the resulting bin.
  • # contigs (>=* bp) columns - which represnts the number of contigs at different lengths, the fewer the shorter reads and the greater the longer reads you have again suggests a better assembly.
  • This can also be evaluated by the N50 or L75 columns (as described in the De novo assembly chapter).
  • fastani and closest_placement_taxonomy - these can tell you if your particular bin has a genome very similar to existing species in taxonomic databases
  • warnings - for specific comments on each assignment

18.6 (Optional) clean-up

Let’s clean up your working directory by removing all the data and output from this chapter.

The command below will remove the /<PATH>/<TO>/ancient-metagenomic-pipelines directory as well as all of its contents.

Pro Tip

Always be VERY careful when using rm -r. Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!

rm -rf /<PATH>/<TO>/ancient-metagenomic-pipelines*

Once deleted you can move elsewhere (e.g. cd ~).

We can also get out of the conda environment with

conda deactivate

To delete the conda environment

conda remove --name ancient-metagenomic-data --all -y

18.7 Questions to think about

  1. Why is it important to use a pipeline for genomic analysis of ancient data?
  2. How can the design of pipelines such as nf-core/eager pipeline help researchers comply with the FAIR principles for management of scientific data?
  3. What metrics do you use to evaluate the success/failure of ancient DNA sequencing experiments? How can these measures be evaluated when using nf-core/eager for data preprocessing and analysis?

18.8 References

Andrades Valtueña, Aida, Gunnar U Neumann, Maria A Spyrou, Lyazzat Musralina, Franziska Aron, Arman Beisenov, Andrey B Belinskiy, et al. 2022. “Stone Age Yersinia Pestis Genomes Shed Light on the Early Evolution, Diversity, and Ecology of Plague.” Proceedings of the National Academy of Sciences of the United States of America 119 (17): e2116722119. https://doi.org/10.1073/pnas.2116722119.
Dimopoulos, Evangelos A, Alberto Carmagnini, Irina M Velsko, Christina Warinner, Greger Larson, Laurent A F Frantz, and Evan K Irving-Pease. 2022. HAYSTAC: A Bayesian Framework for Robust and Rapid Species Identification in High-Throughput Sequencing Data.” PLoS Computational Biology 18 (9): e1010493. https://doi.org/10.1371/journal.pcbi.1010493.
Fellows Yates, James A, Thiseas C Lamnidis, Maxime Borry, Aida Andrades Valtueña, Zandra Fagernäs, Stephen Clayton, Maxime U Garcia, Judith Neukamm, and Alexander Peltzer. 2021. “Reproducible, Portable, and Efficient Ancient Genome Reconstruction with Nf-Core/Eager.” PeerJ 9 (March): e10947. https://doi.org/10.7717/peerj.10947.
Hübler, Ron, Felix M Key, Christina Warinner, Kirsten I Bos, Johannes Krause, and Alexander Herbig. 2019. HOPS: Automated Detection and Authentication of Pathogen DNA in Archaeological Remains.” Genome Biology 20 (1): 280. https://doi.org/10.1186/s13059-019-1903-0.
Krakau, Sabrina, Daniel Straub, Hadrien Gourlé, Gisela Gabernet, and Sven Nahnsen. 2022. “Nf-Core/Mag: A Best-Practice Pipeline for Metagenome Hybrid Assembly and Binning.” NAR Genomics and Bioinformatics 4 (1). https://doi.org/10.1093/nargab/lqac007.
Louvel, Guillaume, Clio Der Sarkissian, Kristian Hanghøj, and Ludovic Orlando. 2016. metaBIT, an Integrative and Automated Metagenomic Pipeline for Analysing Microbial Profiles from High-Throughput Sequencing Shotgun Data.” Molecular Ecology Resources 16 (6): 1415–27. https://doi.org/10.1111/1755-0998.12546.
Neuenschwander, Samuel, Diana I Cruz Dávalos, Lucas Anchieri, Bárbara Sousa da Mota, Davide Bozzi, Simone Rubinacci, Olivier Delaneau, Simon Rasmussen, and Anna-Sapfo Malaspinas. 2023. “Mapache: A Flexible Pipeline to Map Ancient DNA.” Bioinformatics (Oxford, England) 39 (2): btad028. https://doi.org/10.1093/bioinformatics/btad028.
Peltzer, Alexander, Günter Jäger, Alexander Herbig, Alexander Seitz, Christian Kniep, Johannes Krause, and Kay Nieselt. 2016. EAGER: Efficient Ancient Genome Reconstruction.” Genome Biology 17 (1): 1–14. https://doi.org/10.1186/s13059-016-0918-z.
Pochon, Zoé, Nora Bergfeldt, Emrah Kırdök, Mário Vicente, Thijessen Naidoo, Tom van der Valk, N Ezgi Altınışık, et al. 2022. aMeta: An Accurate and Memory-Efficient Ancient Metagenomic Profiling Workflow.” bioRxiv. https://doi.org/10.1101/2022.10.03.510579.
Schubert, Mikkel, Luca Ermini, Clio Der Sarkissian, Hákon Jónsson, Aurélien Ginolhac, Robert Schaefer, Michael D Martin, et al. 2014. “Characterization of Ancient and Modern Genomes by SNP Detection and Phylogenomic and Metagenomic Analysis Using PALEOMIX.” Nature Protocols 9 (5): 1056–82. https://doi.org/10.1038/nprot.2014.063.
Wratten, Laura, Andreas Wilm, and Jonathan Göke. 2021. “Reproducible, Scalable, and Shareable Analysis Pipelines with Bioinformatics Workflow Managers.” Nature Methods 18 (10): 1161–68. https://doi.org/10.1038/s41592-021-01254-9.
Yu, He, Maria A Spyrou, Marina Karapetian, Svetlana Shnaider, Rita Radzevičiūtė, Kathrin Nägele, Gunnar U Neumann, et al. 2020. “Paleolithic to Bronze Age Siberians Reveal Connections with First Americans and Across Eurasia.” Cell 181 (6): 1232–1245.e20. https://doi.org/10.1016/j.cell.2020.04.037.