by Nikolay Oskolkov* and Zoé Pochon*
*These authors contributed equally
Back in part one, we showed you some of the many potential identification pitfalls of exploring ancient metagenomic datasets. Reads can incorrectly align to the wrong reference genome for many reasons, i.e., short read length, lack of specificity, sequencing errors, database bias, contamination, and shared homology across organisms. This makes it essential to double-check the true presence of a detected species and not just its ancient status. We discussed how competitive mapping against large databases (instead of mapping to a single species) will drastically reduce the false positive discovery rate. And then we showed you that evenness of coverage is more important than depth of coverage alone when validating a species’s true presence in a metagenomic dataset. Here is where aMeta steps in, a pipeline designed to reduce false positive identifications and ensure both the species assignment and its “anicentness” are properly authenticated.
Here in Part 2, we will take a deeper dive into how aMeta implements solutions, focusing on authentication plots, and authentication scores with a reminder of the importance of database size, workflow implementation and a final comparison with HOPS [1] [2]. One of the key advantages of aMeta is its ability to handle larger, more diverse reference databases while maintaining RAM efficiency—an essential improvement over HOPS in environments with limited computational resources. This makes aMeta particularly suited for labs with restricted access to numerous large memory nodes on high-performance computers, while still ensuring high detection sensitivity and reducing false positives via the use of large databases.
In the aMeta pipeline [1], evenness of coverage plays a central role in verifying species identification. While competitive mapping helps filter out reads that misalign to the wrong references, evenness of coverage serves as an additional safeguard, ensuring that the detected organism is truly present and not an artefact of misalignment. If a species is truly present, its genome should be evenly covered by sequenced reads. Uneven coverage may suggest the presence of a closely related species instead. By examining how consistently reads are distributed across the reference genome, evenness of coverage allows us to confirm the authenticity of a species’ presence and further reduces the likelihood of false positive discoveries. Below is an example of a typical aMeta output (Figure 1), featuring various quality metrics, including the evenness of coverage plot in the top-right corner (Figure 1c).
Figure 1: Typical authentication plots generated by aMeta for each detected organism over set thresholds. a) edit distance of all mapped reads, b) edit distance of deaminated mapped reads, c) evenness of coverage plot, d) deamination plot, e) read length distribution, f) PMD scores, g) average nucleotide identity, h) top 10 reference genomes, i) summary metrics.
Figures 1a and 1b represent the so-called ‘edit distance’ — the number of mismatches between the mapped reads and the reference genome, for all the reads (Fig 1a), and for specifically damaged reads (Fig 1b). Ideally, you should observe a decreasing profile, indicating that most of your generated reads perfectly match the reference. Figure 1c is the coverage plot; the reference genome is divided into 100 bins and the plot shows how evenly the mapped reads cover it. It also provides the overall breadth of coverage as a summary metric (%) and the reference genome accession from NCBI. Figure 1d is a deamination profile for both ends of the reads, alongside the read length distribution shown in Figure 1e. Figure 1f is a histogram representing the computed post-mortem damage (PMD) scores using PMDtools [3] (the more positive the PMD score, the more likely the read is to be ancient >3 is used as a reasonable marker in aMeta). Figure 1g is a histogram representing the read similarity (or identity) to the reference and an estimate of the average nucleotide identity (ANI). Commonly used in modern metagenomics, this metric is similar to the edit distance as it indicates how similar the nucleotides in the reads are to the reference genome. Figure 1h, is a table which contains the top 10 best reference genomes for the reads (generated by MaltExtract), implemented in HOPS [2], and the table in Figure 1i contains several summary metrics e.g., the number of mapped reads.
An advantage of aMeta is that it will implement all these tests and then calculate a summary authentication score for each organism found in your sample. This saves a lot of time! Imagine, having a heatmap overview of the promising hits (Figure 2) and not having to assess all the plots but only the promising ones! There is 1 point given for each edit distance decreasing profile, 2 points for evenness of coverage if enough tiles are covered (max 3 tiles covered with less than 1%), 2 x 1 point for the deamination profile higher than 0.05 on each end of the reads, 1 point for a read length distribution of 90% of the reads under 100bp, 1 point for at least 10% of the reads having a PMD score of more than 3, 1 point for an average nucleotide identity of more than 95% and finally, 1 point for a read depth corresponding to at least 200 reads. This gives a maximum aMeta authentication score of 10 points.
Figure 2: An example of a heatmap of aMeta authentication scores for each detected microbe in a sample. This one was generated on simulated data for the aMeta paper, warm colours represent high authentication scores and cool colours are low scores. The columns represent the FASTQ files provided to aMeta and the rows are the detected microbial species.
While evenness of coverage and other metrics contribute to eliminating false positive hits (specificity of analysis), they do not aid in discovering organisms absent from your database (sensitivity of analysis). To address this limitation, it is crucial to incorporate as many candidate reference genomes into your database as possible. This point is exemplified in the aMeta publication [1], where we demonstrate that larger databases not only enhance sensitivity but also contribute to reducing false positive discoveries. Consider the Yersinia pestis example from Part 1; expanding the database size resulted in a decrease in mismapping reads thanks to competitive mapping. To address this, aMeta uses KrakenUniq [4] as a broad brush-stroke metagenomic classifier. KrakenUniq uses k-mers, small DNA fragments, to classify DNA sequences by matching them to comprehensive reference databases, such as the complete NCBI NT, available at https://doi.org/10.17044/scilifelab.20205504. Recent advancements in KrakenUniq have made it feasible to handle large databases with reduced memory requirements, allowing it to run efficiently even on systems with limited memory [4]. By enabling an unbiased, hypothesis-free species detection approach, KrakenUniq enhances the sensitivity of metagenomic analysis, narrowing down the list of likely present organisms early in the process.
To further highlight the importance of searching against large databases, the aMeta publication [1] includes simulations of metagenomic samples using KrakenUniq [4] as the first detection step. The results, shown in Figure 3, demonstrate how larger databases lead to microbial identification results that better match the simulated ground truth. This step is important because KrakenUniq allows aMeta to detect a wide range of potential organisms early in the pipeline. However, using these large databases directly for alignment is computationally demanding.
Figure 3: As the database size increases, microbial identification using KrakenUniq as the first detection step in aMeta shows an improved correspondence** with a simulated ground truth. **The Jaccard similarity metric is used to assess the overlap between identified microbes and the ground truth.
To handle the memory challenge, KrakenUniq is first used in aMeta to classify the sample in a memory-efficient manner. After KrakenUniq reduces the list of candidate species, a custom MALT database is built for the final alignment and authentication steps. MALT [5] uses the Lowest Common Ancestor (LCA) method, which assigns each read to the most specific taxonomic group that all possible matches share, reducing the risk of misclassification by taking evolutionary relationships into account. By combining KrakenUniq’s broad detection capabilities with MALT’s focused LCA-based refinement, aMeta balances sensitivity and specificity while avoiding the heavy computational cost of running the entire database for alignment all at once. We chose MALT for the final authentication step due to its LCA-based alignment, which minimizes false positives compared to LCA-free aligners like Bowtie2 [6]. While Bowtie2 is faster and included in aMeta for a preliminary overview, MALT’s slower but more accurate LCA approach ensures higher precision by taking taxonomic relationships into account. Finally, authentication and abundance stats are generated and aMeta presents species classifications for your metagenomics samples. The aMeta workflow is summarised in Figure 4. To summarize, to further mitigate memory limitations that arise with HOPS [2] and its direct use of MALT [5], aMeta preprocesses the data with KrakenUniq against comprehensive databases. Once KrakenUniq identifies the candidate organisms, a custom MALT database is generated to refine these classifications with LCA alignments, which minimize false positives compared to LCA-free aligners like Bowtie2 [6], which rely solely on direct sequence matching without accounting for taxonomic relationships.
Figure 4: Overview of the aMeta workflow. FASTQ files of the samples are classified with KrakenUniq and filtered for the best candidates. Detection results are visualized with Kronatools pie charts [7]. Bowtie2 [6] is used to quickly align the reads to the candidate species, and mapDamage [8] analysis is performed for a quick authenticity check. Simultaneously, a custom database is generated based on the KrakenUniq results for MALT analysis, which incorporates LCA-based alignments and the pipeline finally outputs authentication plots, authentication scores, and read abundance quantification.
In our benchmarking, we demonstrate that aMeta’s focus on likely present organisms significantly reduces RAM usage compared to profiling directly with HOPS using a database of similar size (Figure 5). This efficiency enables researchers to process large datasets on high-performance computing systems, even those with limited access to large memory nodes. However, for human shotgun metagenomics projects, it is common to require up to 1 TB of memory during the MALT step. To stay within memory limits, it is crucial to run aMeta on a project-by-project basis. By tailoring the custom MALT database to contain only the relevant candidates for each specific project, you can significantly reduce memory usage and prevent overloading your system. Additionally, the size of the input FASTQ files can significantly impact memory usage during the MALT step. If you encounter memory issues and are limited to nodes with a maximum of 1 TB, consider splitting your FASTQ files into smaller chunks to reduce the memory load.
Figure 5: Comparison of the computer memory (RAM) amount used by aMeta and HOPS alone, with and without multithreading.
In our study, we conducted a detailed comparison of aMeta and HOPS, evaluating both detection and authentication accuracy. The following results show that aMeta provides improved performance in these areas (Figures 6 and 7).
Figure 6: Comparison between aMeta (default settings), HOPS and KrakenUniq at different assigned reads thresholds. These Jaccard similarity and F1 score plots show that there is more similarity between the aMeta microbial detection results and the simulated ground truth.
Figure 7: Receiver operating characteristic (ROC) curve comparison between aMeta and HOPS, showcasing authentication performance on a simulated dataset. The ROC curve illustrates the trade-off between sensitivity (true positive rate) and specificity (1 - false positive rate). aMeta demonstrates higher accuracy with a ROC AUC (Area Under the Curve) of 0.94 compared to HOPS, which has a ROC AUC of 0.82. The diagonal line represents random classification for reference.
In method papers, it is standard practice to compare new tools with existing ones to demonstrate advancements. However, we would like to emphasize the significant role HOPS has played in the development of aMeta. HOPS served as an inspiration, and aMeta continues to incorporate several tools from HOPS [2], such as MALT [5] and MaltExtract. Our aim, however, has been to enhance these methods by improving RAM efficiency and balancing sensitivity and specificity.
By integrating advanced techniques like competitive mapping and evenness of coverage, aMeta provides a comprehensive, efficient, and scalable solution for ancient metagenomics. These innovative approaches not only improve detection accuracy but also address common computational challenges, making aMeta a powerful tool for researchers. We encourage scientists to explore our public databases and incorporate aMeta into their workflows to enhance sensitivity and reduce false positives.
P.S. We have made a few large KrakenUniq databases and Bowtie2 indexes publicly available for the community. You are welcome to download them and use them for your research; we only ask that you kindly cite our aMeta publication when using the databases in your scientific projects.
- KrakenUniq database based on the complete NCBI NT (used by default in BlastN): https://doi.org/10.17044/scilifelab.20205504
- KrakenUniq database based on microbial part of the NCBI NT database: https://doi.org/10.17044/scilifelab.20518251
- KrakenUniq database based on complete genomes of microbial NCBI RefSeq: https://doi.org/10.17044/scilifelab.21299541
- Bowtie2 index for the complete NCBI NT database (used by default in BlastN): https://doi.org/10.17044/scilifelab.21070063
- Bowtie2 index for pathogenic microbial species of the NCBI NT: https://doi.org/10.17044/scilifelab.21185887
References
[1] Pochon, Zoé, Nora Bergfeldt, Emrah Kırdök, Mário Vicente, Thijessen Naidoo, Tom Van Der Valk, N. Ezgi Altınışık, Maja Krzewińska, Love Dalén, Anders Götherström, Claudio Mirabello, Per Unneberg and Nikolay Oskolkov. “aMeta: an accurate and memory-efficient ancient metagenomic profiling workflow.” Genome Biology 24, no. 1 (2023): 242. DOI: https://doi.org/10.1186/s13059-023-03083-9
[2] Hübler, Ron, Felix M. Key, Christina Warinner, Kirsten I. Bos, Johannes Krause, and Alexander Herbig. “HOPS: automated detection and authentication of pathogen DNA in archaeological remains.” Genome Biology 20, no. 1 (2019): 1-13. DOI: https://doi.org/10.1186/s13059-019-1903-0
[3] Skoglund, Pontus, Bernd H. Northoff, Michael V. Shunkov, Anatoli P. Derevianko, Svante Pääbo, Johannes Krause, and Mattias Jakobsson. “Separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal.” Proceedings of the National Academy of Sciences 111, no. 6 (2014): 2229-2234. DOI: https://doi.org/10.1073/pnas.1318934111
[4] Pockrandt, Christopher, Aleksey V. Zimin, and Steven L. Salzberg. “Metagenomic classification with KrakenUniq on low-memory computers.” Journal of Open Source Software 7, no. 80 (2022): 4908. DOI: https://doi.org/10.21105/joss.04908
[5] Herbig, Alexander, et al. “MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman.” BioRxiv (2016): 050559. DOI: https://doi.org/10.1101/050559
[6] Langmead, Ben, and Steven L. Salzberg. “Fast gapped-read alignment with Bowtie 2.” Nature methods 9.4 (2012): 357-359. DOI: https://doi.org/10.1038/nmeth.1923
[7] Ondov, Brian D., Nicholas H. Bergman, and Adam M. Phillippy. “Interactive metagenomic visualization in a Web browser.” BMC bioinformatics 12 (2011): 1-10. DOI: https://doi.org/10.1186/1471-2105-12-385
[8] Jónsson, Hákon, et al. “mapDamage2. 0: fast approximate Bayesian estimates of ancient DNA damage parameters.” Bioinformatics 29.13 (2013): 1682-1684. DOI: https://doi.org/10.1093/bioinformatics/btt193