Slack Export - #authentication-standards

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2020-09-23 08:08:18

@James Fellows Yates has joined the channel

Ele (eg715@york.ac.uk)

2020-09-23 08:09:22

@Ele has joined the channel

Sterling Wright (sterlingwright2016@utexas.edu)

2020-09-23 08:09:22

@Sterling Wright has joined the channel

Miriam Bravo (bravolomiriam@gmail.com)

2020-09-23 08:09:22

@Miriam Bravo has joined the channel

Jessica Hider (hiderj@mcmaster.ca)

2020-09-23 08:09:22

@Jessica Hider has joined the channel

Lucy van Dorp (lucy.dorp.12@ucl.ac.uk)

2020-09-23 08:09:22

@Lucy van Dorp has joined the channel

Ophélie Lebrasseur (ophelie.lebrasseur@liverpool.ac.uk)

2020-09-23 08:09:23

@Ophélie Lebrasseur has joined the channel

Pooja Swali (swalipooja@gmail.com)

2020-09-23 08:09:23

@Pooja Swali has joined the channel

Kun Huang (kun.huang@unitn.it)

2020-09-23 08:09:23

@Kun Huang has joined the channel

Anna F. (annakfos@gmail.com)

2020-09-23 08:09:23

@Anna F. has joined the channel

Pete Heintzman (peteheintzman@gmail.com)

2020-09-23 08:09:23

@Pete Heintzman has joined the channel

Shreya (shreya23@uchicago.edu)

2020-09-23 08:09:23

@Shreya has joined the channel

Nora Bergfeldt (nora.bergfeldt@gmail.com)

2020-09-23 08:09:23

@Nora Bergfeldt has joined the channel

Lena G (lena.granehall@gmail.com)

2020-09-23 08:09:23

@Lena G has joined the channel

Jaelle Brealey (jcbrealey@gmail.com)

2020-09-23 08:09:24

@Jaelle Brealey has joined the channel

Jaelle Brealey (jaelle.brealey@ntnu.no)

2020-09-23 08:09:24

@Jaelle Brealey has joined the channel

Christian Carøe (christian.caroe@sund.ku.dk)

2020-09-23 08:09:24

@Christian Carøe has joined the channel

aidanva (aida.andrades@gmail.com)

2020-09-23 08:28:13

@aidanva has joined the channel

Freddi Scheib (freddischeib@gmail.com)

2020-09-23 08:51:20

@Freddi Scheib has joined the channel

Katerina Guschanski (katerina.guschanski@ebc.uu.se)

2020-09-23 11:37:51

@Katerina Guschanski has joined the channel

Maria Zicos (m.zicos@qmul.ac.uk)

2020-09-23 14:14:56

@Maria Zicos has joined the channel

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2020-09-23 14:23:01

@Lucy van Dorp can you post again the strobe (and any similar reporting things?)

Lucy van Dorp (lucy.dorp.12@ucl.ac.uk)

2020-09-23 15:00:52

Recent Lancet ID paper providing a set of guidelines for modern metagenomics following a community discussion: https://www.thelancet.com/journals/laninf/article/PIIS1473-3099(20)30199-7/fulltext

👍 James Fellows Yates, Sterling Wright, Jessica Hider

Lucy van Dorp (lucy.dorp.12@ucl.ac.uk)

2020-09-23 15:02:46

And an inter-lab ''metrology project' on antimicrobial resistance predictions from genotype data. Everyone sent the same fastqs, purposely designed to test standards, and chaos ensures: https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000335 Something similar could work for taxonomic profiling

microbiologyresearch.org

Discordant bioinformatic predictions of antimicrobial resistance from whole-genome sequencing data of bacterial isolates: an inter-laboratory study | Microbiology Society

Antimicrobial resistance (AMR) poses a threat to public health. Clinical microbiology laboratories typically rely on culturing bacteria for antimicrobial-susceptibility testing (AST). As the implementation costs and technical barriers fall, whole-genome sequencing (WGS) has emerged as a ‘one-stop’ test for epidemiological and predictive AST results. Few published comparisons exist for the myriad analytical pipelines used for predicting AMR. To address this, we performed an inter-laboratory study providing sets of participating researchers with identical short-read WGS data from clinical isolates, allowing us to assess the reproducibility of the bioinformatic prediction of AMR between participants, and identify problem cases and factors that lead to discordant results. We produced ten WGS datasets of varying quality from cultured carbapenem-resistant organisms obtained from clinical samples sequenced on either an Illumina NextSeq or HiSeq instrument. Nine participating teams (‘participants’) were provided these sequence data without any other contextual information. Each participant used their choice of pipeline to determine the species, the presence of resistance-associated genes, and to predict susceptibility or resistance to amikacin, gentamicin, ciprofloxacin and cefotaxime. We found participants predicted different numbers of AMR-associated genes and different gene variants from the same clinical samples. The quality of the sequence data, choice of bioinformatic pipeline and interpretation of the results all contributed to discordance between participants. Although much of the inaccurate gene variant annotation did not affect genotypic resistance predictions, we observed low specificity when compared to phenotypic AST results, but this improved in samples with higher read depths. Had the results been used to predict AST and guide treatment, a different antibiotic would have been recommended for each isolate by at least one participant. These challenges, at the final analytical stage of using WGS to predict AMR, suggest the need for refinements when using this technology in clinical settings. Comprehensive public resistance sequence databases, full recommendations on sequence data quality and standardization in the comparisons between genotype and resistance phenotypes will all play a fundamental role in the successful implementation of AST prediction using WGS in clinical microbiology laboratories.

Original URL: https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000335

👍 James Fellows Yates, Pete Heintzman

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2020-09-23 15:05:51

I also saw this recently: https://www.biorxiv.org/content/10.1101/2020.06.24.167353v1

bioRxiv

Strengthening The Organizing and Reporting of Microbiome Studies (STORMS)

Background Human microbiome research is interdisciplinary, making concise organizing and reporting of results across the different styles of epidemiology, biology, bioinformatics, and statistics a challenge. Commonly used reporting guidelines for observational or genetic studies lack key aspects specific to microbiome studies. Methods A multidisciplinary group of microbiome researchers reviewed elements of available reporting guidelines for observational and genetic studies, and adapted these for application to human microbiome studies. New reporting elements were developed for laboratory, bioinformatic, and statistical analysis specific to microbiome studies, and other parts of these checklists were streamlined to keep reporting manageable. Results STORMS is a 18-item checklist for reporting on human microbiome studies, organized into six sections covering all sections of a scientific publication, presented as a table with space for author-provided details and intended for inclusion in supplementary materials. Conclusions STORMS provides guidance for authors and standardization for interdisciplinary microbiome studies, facilitating complete and concise reporting. Availability STORMS is downloadable as a versioned spreadsheet from [<a href="http://storms.waldronlab.io">storms.waldronlab.io</a>][1]. ### Competing Interest Statement The authors have declared no competing interest. [1]: <http://storms.waldronlab.io>

Original URL: https://www.biorxiv.org/content/10.1101/2020.06.24.167353v1

👍 Jessica Hider

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2020-09-29 21:08:26

As a parallel thing, I was just pondering about defining standards, given that Warinner 2017 sort of already does that (although imo is difficult to read).

Maybe as well as defining reporting standards, what about writing a paper for non-geneticists on how to review ancient metagenomics studies? Something like the PLoS comp. bio. 10 point checklist thing?

😀 Jessica Hider

Jessica Hider (hiderj@mcmaster.ca)

2020-10-12 20:26:13

*Thread Reply:* I think that's a great idea James, I've been wanting to do something like that for a while now, as it's always hard for non aDNAers or non geneticists to get all the intense concepts/formulae/etc. because it requires so much background knowledge

:mask_parrot: James Fellows Yates

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2020-10-12 20:28:01

*Thread Reply:* Well then give me a couple of months to actually finish my thesis and we can start ramping this up. Unless you wanna spearhead organising this one?

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2020-09-29 21:09:33

But aimed at people like archaeologists and palaeontolosts? It would be slimmer and more accessible (i.e. less detailed than the Warinner one, which is more for people who actually want to work on in the field)

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2020-09-29 21:10:06

This would be relevant for @Clio Der Sarkissian's discussion point about 'dissemination'

Clio Der Sarkissian (clio.dersarkissian@gmail.com)

2020-09-29 21:24:33

@Clio Der Sarkissian has joined the channel

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)

2021-10-05 03:25:58

Hi all, I'm using the authenticity criteria of Warinner et al. (2017) to analyse samples for the presence of TB. For the second one (percent identity distributions), does anyone know if there is an efficient way to create a table and/or graph showing the number of mapped reads according to sequence identity (something like c-e in the attached figure). Using BWA mapping and MALT, so maybe a script/tool that works on bam/sam files?

image.png

Nico Rascovan (nicorasco@gmail.com)

2021-10-05 09:41:58

What if you use the edit distance for that? (i.e., you calculate how many changes are needed in each read to be the same as reference) and then you build a histogram with that info

Nico Rascovan (nicorasco@gmail.com)

2021-10-05 09:42:01

https://gigabaseorgigabyte.wordpress.com/2017/04/14/getting-the-edit-distance-from-a-bam-alignment-a-journey/

Gigabase or gigabyte

} wdecoster (https://gigabaseorgigabyte.wordpress.com/author/wouterdecoster/)

Getting the edit distance from a bam alignment: a journey

A relevant parameter when looking at sequencing and alignment quality is the edit distance to the reference genome, roughly equivalent to the accuracy of the reads. (Roughly, because we ignore true variants between the sample and reference). The edit distance or Levenshtein distance can be defined as the number of single letter (nucleotide) changes that have to be made to one string (read) for it to be equal to another string (reference genome). Since the error profile of Nanopore sequencing is dominated by insertions and deletions, the edit distance isn’t just the number of single nucleotide mismatches. This technology results in a wide range of read lengths, therefore I scale the edit distance to the length of the aligned fragment. Longer reads shouldn’t get penalised more than shorter reads. It’s important to take the alignment length and not the read length since the ends of reads can be clipped substantially by the aligner, sometimes tens of kb.  The experiments below were done using MinION data from the human genome in a bottle sample NA12878. My scripts are in Python, so I’ll add some code snippets to this post. For parsing bam files and extracting the relevant bits of information, I use pysam which is pretty convenient and well-documented. Those snippets are not the full script, but just a minimal example to get the job done. A full example of the code is at the bottom of the post. Getting this information from an alignment done using bwa mem  is trivial since bwa sets the NM bam tag which is an integer with precisely what I need: the edit distance. A function to extract the NM tag using a list comprehension is added below. .gist table { margin-bottom: 0; } import pysam def extractNMFromBam(bam): ''' loop over a bam file and get the edit distance to the reference genome stored in the NM tag scale by aligned read length ''' samfile = pysam.AlignmentFile(bam, "rb") return [read.get_tag("NM")/read.query_alignment_length for read in samfile.fetch()] view rawgetNMtag.py hosted with ❤ by GitHub For aligners such as GraphMap this is less trivial since the NM tag is not set. However, another tag comes to the rescue: the MD tag, which stores a string containing matching numbers of nucleotides and non-matching nucleotides. Interesting information about the MD tag can be found here. I found it a quite tough representation of the read to wrap my head around, which resulted in some wrong interpretations. My first naive implementation counted the number of matching nucleotides (the integers in the MD string) and subtracted those matches from the total alignment length to get the number of mismatched nucleotides. The list comprehension is quite long, but essentially I split the MD string on all occurrences of A, C, T, G or ^ (indicating a deletion) and sum the obtained integers. This sum is subtracted from the aligned read length and divided by the same. .gist table { margin-bottom: 0; } import pysam import re def extractMDFromBam(bam): ''' loop over a bam file and get the edit distance to the reference genome mismatches are stored in the MD tag scale by aligned read length ''' samfile = pysam.AlignmentFile(bam, "rb") return [(read.query_alignment_length – sum([int(item) for item in re.split('[ACTG^]', read.get_tag("MD")) if not item == '']))/read.query_alignment_length for read in samfile.fetch()] view rawMD-editDistance-fromBam_1.py hosted with ❤ by GitHub As a sanity check I looped over a bwa mem aligned bam file to extract both the scaled NM tag and scaled MD-derived edit distance and plot those against each other, for which you can see the code below followed by the images. Between gathering the information from the bam file in lists and plotting the data I convert my lists to a numpy array and create a pandas DataFrame (see bottom of post). Since the first plot is rather overcrowded I also added a kernel density estimation to get a better idea of the density of the dots. .gist table { margin-bottom: 0; } import seaborn as sns import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats def makePlot(datadf): plot = sns.jointplot( x='editDistancesNM', y='editDistancesMD', data=datadf, kind="scatter", color="#4CB391", stat_func=stats.pearsonr, space=0, joint_kws={"s": 1}, size=10) plot.savefig('EditDistancesCompared_scatter.png', format='png', dpi=1000) plot = sns.jointplot( x='editDistancesNM', y='editDistancesMD', data=datadf, kind="kde", color="#4CB391", stat_func=stats.pearsonr, space=0, size=10) plot.savefig('EditDistancesCompared_kde.png', format='png', dpi=1000) view rawplotNMvsMD-implementation.py hosted with ❤ by GitHub The Pearson correlation coefficient is not too bad, but obviously, we want to get exactly the same as the NM tag. It’s clear that my implementation of getting the edit distance from the MD tag returns an underestimation of the edit distance. After thinking a while I decided to create a question on biostars where Santosh Anand joined my quest. He suggested counting the mismatches in the MD string, rather than the matches, which I implemented below. So this time I split the MD string on all numbers and sum the length of the mismatched nucleotides. The plots obtained by this approach are shown below the code. .gist table { margin-bottom: 0; } import pysam import re def extractMDFromBam(bam): ''' loop over a bam file and get the edit distance to the reference genome mismatches are stored in the MD tag scale by aligned read length ''' samfile = pysam.AlignmentFile(bam, "rb") return [sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))]) / read.query_alignment_length for read in samfile.fetch()] view rawMD-editDistance-fromBam_2.py hosted with ❤ by GitHub That’s an improvement, but we’re not yet there. At this point Santosh asked for a few examples of reads which showed a clear deviation in NM and MD-derived edit distance. I wrote a function to do just that, which prints out: The NM tag The MD tag derived edit distance The MD tag The CIGAR string .gist table { margin-bottom: 0; } import sys import re import pysam def extractDisagreement(bam): samfile = pysam.AlignmentFile(bam, "rb") for read in samfile.fetch(): NMdef = read.get_tag("NM")/read.query_alignment_length MDdef = sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))])/read.query_alignment_length if NMdef – MDdef > 0.2: print('\t'.join( [ str(read.get_tag("NM")), str(sum([len(item) for item in re.split('[0-9^]', read.get_tag("MD"))])), read.get_tag("MD"), read.cigarstring, ]) ) view rawgetDisagreementMD-NM-tagEditDistance.py hosted with ❤ by GitHub This is an example of a read showing a clear mismatch between the NM tag and the MD-derived edit distance. 73 34 21G1C3A1T1A2G4A8T2G4A2G0G0G19^G2G10T6C1A2A0G0G0A5T6^GA5A0G0C0C1G3A5A6C1C2 1684H20M3I7M3I10M1I5M2I1M1I1M3I6M1I4M1I13M2I5M4I3M7I6M1D6M2I1M2I26M1I3M1I2M1I3M2D16M1I6M3I10M10942H The hero on my quest then found that we were still missing the insertions, which are not present in the MD tag but can only be counted by parsing the CIGAR string. I copied the image below from his post. So my successful implementation also parses the CIGAR string to get the insertions and add those to the mismatches. In my final code piece below I show the full code of my eva…

Original URL: https://gigabaseorgigabyte.wordpress.com/2017/04/14/getting-the-edit-distance-from-a-bam-alignment-a-journey/

Nico Rascovan (nicorasco@gmail.com)

2021-10-05 09:48:32

and if you divide the number of variants in the read by the trimmed read length you can easily get the %

aidanva (aida.andrades@gmail.com)

2021-10-05 10:08:44

You can also get the percent identity from MALT, maybe in a more convoluted way. But essentially you will open your results in MEGAN and extract the info for all the reads in a specific node. This should contain the number of mismatches to the reference, which can be used to calculate the percentage identity (or edit distance for that matter?). I then plot the identities in a histogram in R. If I am not mistaken @James Fellows Yates had a script for calculating the identities and even doing the plotting. I should also have it and I could try to fish it out from my old files.

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2021-10-05 11:06:25

*Thread Reply:* Did I!?!

aidanva (aida.andrades@gmail.com)

2021-10-05 11:06:47

*Thread Reply:* or did I create the script?? I can't remember 😅

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2021-10-05 11:07:52

*Thread Reply:* From MEGAN? 🤔

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2021-10-05 11:08:13

*Thread Reply:* Must be ANCIENT

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2021-10-05 11:43:27

*Thread Reply:* If I did, I don't anymore. Sorry

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2021-10-05 11:06:13

@Nico Rascovan is right percent identity is basically edit distance. You'll probably find more tools to calculate this (MaltExtract can do this for you, if you still have your RMA6 files).

👍 aidanva

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)

2021-10-05 21:22:14

Cheers @Nico Rascovan @aidanva and @James Fellows Yates! Will try these! 😁

James Fellows Yates (james_fellows_yates@eva.mpg.de)

2021-10-06 07:56:33

*Thread Reply:* Not meant to be self-promotion, but if you do go the RMA6 -> maltExtract route, and have lots of samples I made a small R shiny app that allows you to visualise the tabular/PDF results much faster: https://github.com/jfy133/MEx-IPA

GitHub

GitHub - jfy133/MEx-IPA: Interactive results viewer for maltExtract

Interactive results viewer for maltExtract . Contribute to jfy133/MEx-IPA development by creating an account on GitHub.

Original URL: https://github.com/jfy133/MEx-IPA

🙌 Nico Rascovan

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)

2021-10-06 21:32:22

*Thread Reply:* Nice, that sounds very useful, thanks! 😁

Public Channels

Private Channels

Direct Messages

Group Direct Messages