James Fellows Yates (james_fellows_yates@eva.mpg.de)
2020-09-22 10:18:07

@James Fellows Yates has joined the channel

aidanva (aida.andrades@gmail.com)
2020-09-22 10:19:41

@aidanva has joined the channel

Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 10:19:41

@Maria Spyrou has joined the channel

Nico Rascovan (nicorasco@gmail.com)
2020-09-22 10:19:41

@Nico Rascovan has joined the channel

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2020-09-22 10:19:41

@Nikolay Oskolkov has joined the channel

ร…shild (Ash) (ashild.v@gmail.com)
2020-09-22 10:19:42

@ร…shild (Ash) has joined the channel

Karen Giffin (giffin@shh.mpg.de)
2020-09-22 10:19:42

@Karen Giffin has joined the channel

Kun Huang (kun.huang@unitn.it)
2020-09-22 10:19:42

@Kun Huang has joined the channel

Lena G (lena.granehall@gmail.com)
2020-09-22 10:19:42

@Lena G has joined the channel

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2020-09-22 10:21:06

@channel I've just made a channel for people working on ancient genomics (e.g. single genomes of pathogens, commensals etc). <#C01B511KU91|microbial-genomics> . Please join if you work in this area!

ivelsko (velsko@shh.mpg.de)
2020-09-22 10:21:14

@ivelsko has joined the channel

Clio Der Sarkissian (clio.dersarkissian@gmail.com)
2020-09-22 10:21:31

@Clio Der Sarkissian has joined the channel

Meriam Guellil (meriam.guellil.ac@gmail.com)
2020-09-22 10:22:15

@Meriam Guellil has joined the channel

Miriam Bravo (bravolomiriam@gmail.com)
2020-09-22 10:24:33

@Miriam Bravo has joined the channel

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2020-09-22 10:25:36

@Antonio Fernandez-Guerra has joined the channel

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2020-09-22 10:30:54

Hi guys, yesterday, I was following the genotyping discussion with a lot of interest. We talked e.g. about how to distinguish between true genetic variants and the ones due to the damage, and how this could bias phylogenetic analysis. Would simple masking / clipping the ends of the reads solve the problem? You will still get enough variants (~100 000) for building a phylogeny, won't you? You really do not need more because BEAST and other phylogeny tools do not really scale for millions of genetic variants, also because of the curse of dimensionality problem, i.e. the more variants you use the more biased variance estimates you get from the Maximum Likelihood principle if you have a limited number of samples (the curse of dimensionality problem in mathematics of high-dimensional data). I wonder why people prefer to compute probabilistic genotypes instead of concentrating on most reliable ones by ignoring the ends of the reads? Is this because in the latter case you get too few variants and you want to rescue as many variants as possible?

aidanva (aida.andrades@gmail.com)
2020-09-22 10:38:44

I think it is a good idea, but itโ€™s potential application will depend on: coverage of your sample and how far in the molecule the damage goes in (particularly for non-UDG treated data, sometimes you may lose 7-8 bases). I can see that as a good strategy for rather younger sample were you observe very little damage. I am not sure how this will work on old samples. Also, the amount of variant sites will depend on the organism you are interested in. What I am trying to say is that it is a bit hard to predict the effect and it is worth testing

๐Ÿ‘ Nikolay Oskolkov
Mike Martin (sameoldmike@gmail.com)
2020-09-22 10:41:41

@Mike Martin has joined the channel

Gunnar Neumann (gunnar_neumann@eva.mpg.de)
2020-09-22 10:42:43

@Gunnar Neumann has joined the channel

Nico Rascovan (nicorasco@gmail.com)
2020-09-22 10:46:40

Hey Nikolay! Nice thoughts. In my experience, for the genomes of microbial strains within a species you have at least one order of magnitude less to what you said in terms of SNVs (like ~10 000 or even ~ 1 000), so there are not that many. On the other hand, sure, excluding the variants at the end of reads (e.g., 10-20bp from each side) will solve part of the problem, but you'll still have lot of transitions (most SNPs are transitions) that you don't know whether they are true variants or not. Then you can filter out positions were you have at least certain % or # of reads not supporting you variant, that would exclude some potential deamination or environmental reads, but you may also loose some true variants there as well. In my case, for a particular ancient microbial genome, with very low coverage (high-coverage >10X helps a lot in resolving these situations) I just do a manual curation to decide what to do, and I play a bit with different genotypes to see how much they affect results (phylogenies and molecular clocks(

๐Ÿ‘ Nikolay Oskolkov
Pooja Swali (swalipooja@gmail.com)
2020-09-22 10:50:21

@Pooja Swali has joined the channel

Freddi Scheib (freddischeib@gmail.com)
2020-09-22 10:58:33

@Freddi Scheib has joined the channel

Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 11:08:31

A student in our lab recently developed a tool for filtering SNPs, where one could use different parameters for evaluation the regions around variants (e.g. coverage and heterozygosity). This has been quite helpful for evaluation, but I would say that quite some manual work is still required (e.g. read blasting) and currently works more efficiently for UDG data.

๐Ÿ‘ aidanva, Nikolay Oskolkov
Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 11:09:34

https://github.com/andreasKroepelin/SNP_Evaluation

Stars
<p>2</p>
Language
<p>Java</p>
aidanva (aida.andrades@gmail.com)
2020-09-22 11:10:33

Not really optimasized for loads of ancient genomes, as @Maria Spyrou said, loads of manual work still required. But hopefully, we can push people to make it more scalable ๐Ÿ˜‰

Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 11:10:47

quite thoroughly tested in Marcelโ€™s recent paper: https://www.pnas.org/content/pnas/116/25/12363.full.pdf

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2020-09-22 11:31:01

I see, thanks a lot @aidanva and @Nico Rascovan for your feedback. I understand your concerns better now

Katerina Guschanski (katerina.guschanski@ebc.uu.se)
2020-09-22 11:39:02

@Katerina Guschanski has joined the channel

ร…shild (Ash) (ashild.v@gmail.com)
2020-09-22 11:45:36

@Maria Spyrou One downside to that tool is that there is no proper documentation released yet. Makes it difficult for people who are new to it to use. There are only very very limited instructions on the toolโ€™s help page when you open it.

aidanva (aida.andrades@gmail.com)
2020-09-22 11:46:05

I agree

Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 11:57:50

Indeed, but I think it is still worth a try for people that do not want to visually inspect hundreds of SNPs in multiple samples. Happy to help anyone that might have questions.

Lucy van Dorp (lucy.dorp.12@ucl.ac.uk)
2020-09-22 11:58:39

@Lucy van Dorp has joined the channel

Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 11:58:45

Do people know of other tools used for qualitative evaluation of SNPs?

Nico Rascovan (nicorasco@gmail.com)
2020-09-22 12:04:15

other than our sharped eye?

Marcel Keller (marcel.keller@ut.ee)
2020-09-22 12:04:35

@Marcel Keller has joined the channel

Nico Rascovan (nicorasco@gmail.com)
2020-09-22 12:04:46

๐Ÿ˜‰

Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 12:04:58

thatโ€™s gotten soooo tired by now..

ร…shild (Ash) (ashild.v@gmail.com)
2020-09-22 12:05:42

@Maria Spyrou @Marcel Keller Maybe you two, as experienced users, could write a tutorial/some more explanation of the parameter options for the SNPEval tool for others to learn from? Makes it much more accessible to a wider audience ๐Ÿ˜‰

๐Ÿ‘ aidanva, Emrah Kฤฑrdรถk
Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 12:14:53

I agree, would be good to include some basic usage guidelines.

Marcel Keller (marcel.keller@ut.ee)
2020-09-22 12:19:58

Yes, we could do that. @Maria Spyrou do you know if Andreas has been working on it since then, i.e. will there be a newer version available?

Maria Spyrou (spyrou@shh.mpg.de)
2020-09-22 12:24:55

As far as I know he is not, I am not aware of a newer version

Marcel Keller (marcel.keller@ut.ee)
2020-09-22 12:26:09

okay

aidanva (aida.andrades@gmail.com)
2020-09-22 13:01:04

he is not working on it anymore. There has been discussion about reimplementing it but ๐Ÿคท

Betsy Nelson (nelson@shh.mpg.de)
2020-09-22 19:46:12

@Betsy Nelson has joined the channel

โค๏ธ Betsy Nelson
Kelly Blevins (blevinske1@gmail.com)
2020-09-22 19:47:16

@Kelly Blevins has joined the channel

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-29 07:25:52

Hey team, I was hoping to pick your brain on the following problem I'm encountering with MALT/MEGAN...

We build a MALT database with about 6,000 (mainly) baterial genomes with the most recent mapping file. I'm now screening coprolite data and encountered the following. When I open the rma file directly in MEGAN I get different results to when I open the sam file (paired with also the same, new mapping file).

For the same sample, the rma file gives me quite a lot of reads for Shigella boydii (140,000). This seems to be a real signal (according to Warinner's authenticity criteria) when I filter the reads with maltExtract/HOPS. Then when I open the sam file (generated in the same MALT run) I get ~85,000 reads for Enterococcus faecium, and ~105,000 reads for Citrobacter freundii, and much fewer Shigella boydii reads (~6500, which is actually about the same number after filtering of the rma files with maltExtract...).

All these species make sense in this context, but I'm so confused that they don't show up in both methods, so I don't know which results to trust? I also noticed that if I use the sam file, about double the amount of reads actually get assigned. I think the parameters might be slightly bit different in MEGAN, but surely that shouldn't make such a drastic difference? Any ideas what else it could be?

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 05:07:37

Hi @Meriam van Os, the SAM-file from MALT / HOPS does not unfortunately have LCA in it in contrast to rma6 MEGAN files, that is why you see such a difference. In other words, in MALT / HOPS SAM-files, all reads assigned to a genus / family level (multi-mappers in alignment terminology) will be randomly spread across species from that genus / family in you database.

In this sense, the SAM-file is not different from a SAM-file produced by e.g. bowtie2 / BWA. Moreover, the SAM-file from MALT / HOPS is hard to use for any filtering because it does not have a proper MAPQ, it sets all reads with MAPQ=255 (please have a look). So MALT / HOPS does not actually deliver a proper SAM, unfortunately, which is somehow hidden from the community ๐Ÿ˜ž The reason why you would like to have a proper SAM-file is because you would really want to compute a breadth / evenness of coverage metric for your microbe of interest, but it is not straightforward with the SAM output from MALT / HOPS. Rma6 MEGAN file in contrast has LCA but can not be used for breadth / evenness of coverage computation, so you will still need to run e.g. bowtie2 / BWA on the top of MALT / HOPS alignments (But then why to run MALT / HOPS at all?). Unfortunately, rma6 is not a good format for performing any bioinformatic analysis (with e.g. samtools), it can be open only by other MALT / MEGAN tools such as MaltExtract, which is a huge limitation.

Taking into account how much effort it takes to run MALT / HOPS (you have only 6000 bacterial genomes in your database, but you perhaps would really like to increase this number to have a less bias in your analysis, and then you run into RAM and time issues), we were very disappointed when we discovered that SAM-files delivered by MALT / HOPS were pretty useless. Bowtie2 is so much faster, so much less RAM demanding and can screen against the full NT / NR database, no need to cherry pick (6000 or any other limited number of) genomes. RMA6-file from MALT / HOPS has LCA but it can only give an information about read numbers assigned to each microbe. Exactly like e.g. Kraken, but Kraken can do it after a few seconds while MALT... hmmm, after a few days. However, just the number of reads is unfortunately not a good criterion of microbe detection without a breadth / evenness of coverage metric. To compute the latter you still need a proper SAM. An alternative is the KrakenUniq tool that delivers both depth and breadth of coverage and has an LCA. But then why to run MALT / HOPS at all?

Taking into account all these shortcomings, I am actually wondering why people still run MALT / HOPS? Is that only a "tradition"? I believe this issue should be more widely discussed in the community, so I would be curious to know what @James Fellows Yates, @Alex Hรผbner, @aidanva, @Betsy Nelson, @Maxime Borry, @Hannes, @Gunnar Neumann, @Christina Warinner, @Felix Key and other guys think. When using MALT / HOPS how do you guys prove the presence of a microbe without breadth / evenness of coverage?

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-30 08:11:12

*Thread Reply:* Thanks @Nikolay Oskolkov for your answer! So if I understand correctly, the rma6 files are a more accurate representation of what is is my sample, as this does use the LCA logarithm? It does seem though that when I input my sam files into MEGAN, it also applies the LCA logarithm as there is a tab with the parameters you can choose? At least when I click on "Import from BLAST".

I did also find that the MALT sam files aren't very useful for any other analysis, for example I tried samtools on them, but it didn't work. What I have been doing occasionally, is that I extract the reads I'm interested in with MEGAN, and align those with the respective species in Geneious. Does also take some extra work though, as it requires to input the fastq files into MEGAN. Or I map it subsequently to the respective species with BWA as well (same as you).

For you question why I'm using MALT. In the first instance, to be honest, I started using it as it seemed to be the kind of "standard" (and I'm still a newbie really, so MALT was an obvious choice). Gave that a go, and I have been finding it useful for screening for pathogenic/microbiome bacteria, but I use the MALT/MaltExtract results in combination with @James Fellows Yates MEx-IPA tool (https://github.com/jfy133/MEx-IPA). It generates nice reports from the rma files with the sequence identity, read distribution, damage profiles and more, which I use to authenticate the presence of species. I hope this is an okay method? If not, someone should really tell me, because this is what I've been doing for my Masters, and I'm hoping to move onto a PhD, so I want to be doing it right! ๐Ÿ˜ฌ

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 08:17:01

*Thread Reply:* Mex-IPA seems like a nice tool, I did not know about it, thanks for pointing me! I wonder, how does it compute the breadth of coverage, i.e. what sam-file does it use? The native MALT, or the tool runs an extra alignment? @James Fellows Yates?

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 08:22:28

*Thread Reply:* @Meriam van Os to answer your question, any tool (there are not so many though) that gives breadth of coverage information in addition to read numbers is good. To my knowledge most of the tools (including MALT / HOPS) do not output breadth of coverage, so it is quite a lot of manual work you need to do afterwords. Unfortunately, just read numbers are not enough to report a mucrobe detection. There are plenty of examples of false discoveries when thousands of reads aligned at just a few conserved regions of a microbial ref genome

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 08:17:16

@James Fellows Yates has joined the channel

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 08:40:59

To summarise:

1) Nikolay is correct: SAM cannot hold LCA information, so the 'extra' reads you see in your SAM file are actually assigned higher up the LCA tree. In a way this is more accurate in terms of reads 'unique' to your specific species, which maybe useful in some cases 2) Yes, MALT SAM files are shit, we never used them. Yes RMA6 are shit because you cannot do anything else with them. Unfortunately MALT was more of a side project from the developer (long story), and HOPS was not developed with good software dev. practises in mind, so there was no effor tto make the RMA6 parser a library -> but this is irrelevant to this conversation 3) From the thread: the breadth/depth coverage information is still calculable from the RMA6 information, which is why HOPS/MexIPA cna report it. 4) The only real benefit of using MALT is that it gives you both alignment (for damage calculation) AND LCA at the same time. KrakenUniq wouldn't give you the alignment information to quickly evaluate if it's worth pursuing from an aDNA point of view

I wouldn't worry @Meriam van Os about using MALT/MEGAN/HOPS/MEx-IPA - they are still valid tools. Just not scalable or extendable.

And @Nikolay Oskolkov the reason why it's still used because at the time it was developed (2016!), we could still fit in Nt databases, and the people who developed it had enough computing resources to run it. We only ever use it as a screening tool though, not for downstream purposes

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 08:41:15

And also partly tradition

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 08:43:54

Oh and I should also a bit more sensitive/accurate over standard mappers as it's more adaptable for aDNA (spaced seed, some of the parameters)

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 08:56:46

Very nice summary @James Fellows Yates, thanks! Wait, you are writing "the breadth/depth coverage information is still calculable from the RMA6 information, which is why HOPS/MexIPA cna report it". Does HOPS report breadth of coverage? How do you compute breadth of coverage from RMA6?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 08:57:33

At lesat I think it does, I should check! I've not looked at MEx-IPA for a long time ๐Ÿ˜†

RMA6 still holds all the alignment information, that's why you can open the alignment viewer in MEGAN and see the exact aignments

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 08:58:34
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 08:58:37

Left plot

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:00:48

@James Fellows Yates I am quite sure that MALT / HOPS does not produce breadth of coverage, otherwise I would not complain ๐Ÿ™‚ Hmm, the left plot is an average coverage, isn't it? I mean when you say that e.g. 6% of genome is covered with 1X, you do not know how the reads are distributed across the ref genome. What you do is compute Nreads**Lred / Lrefgenome

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-30 09:30:13

*Thread Reply:* Doesn't the read distribution present that? That's given as a number in the table at the top

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 09:01:32

No, but that's the value of breadth coverage no?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 09:01:40

Percentage of genome covered, like you said?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-30 09:03:01

Evenness of coverage is something else, and that is not something reported indeed (and something I've wanted for a while)

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-30 09:03:39

You can use https://github.com/genomewalker/bam-filter to get different types of stats related to the BAM files, for example the expected breadth or the ratio between observed/expected

GitHub
Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:09:08

A breadth of coverage you compute with "samtools depth", which reports for absolutely all bases of ref genome by how many reads it is covered. The majority of bases have zero coverage, which might result in e.g. 6% of bases covered on average. However, if the reads are spread uniformly, this still implies that the microbe is likely present. Basically, you compute a profile of what genomic locations are covered. This can be done e.g. by computing Nreadsinwindow**Lread / L_window in a sliding window

I think, I am indeed referring now to evenness of coverage. I typically use breadth and evenness of coverage as synonyms. But perhaps I am wrong

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:11:06

@Antonio Fernandez-Guerra with a proper BAM / SAM file there is no problem of getting breadth / evenness of coverage information. The problem is that MALT / HOPS does not deliver a proper SAM / BAM file ๐Ÿ˜ž

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-30 09:14:23

You can calculate with any thing it provides coordinates, I am doing something similar here with the blastx like output https://github.com/genomewalker/x-filter

GitHub
Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-30 09:15:31

if the RM6 file contains this info is trivial to have a tool that does it (never used MALT though)

Nico Rascovan (nicorasco@gmail.com)
2022-03-30 09:18:19

Very nice thread!

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:18:50

@James Fellows Yates on the left plot you posted, would you say that 6% of bases covered at 1X is a good indication of microbial presence? This sounds very little, but if the reads are spread uniformly, theoretically by sequencing deeper you can get a greater fraction of ref genome covered, i.e. 6% can still imply that a microbe is truly present. What I typically draw (and what I meant by breadth / evenness of coverage) is the third plot first row:

๐Ÿ™Œ Meriam van Os
Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:32:43

*Thread Reply:* the sample is not deeply sequenced, this is why only 3% of Y.pestis ref genome is covered, but the profile still shows a more or less uniform read distribution. An opposite example would be something like this

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:33:06

*Thread Reply:* the y-axis is the # of reads, and the x-axis is a genomic position. There are quite many reads aligned but they all come from the beginning of ref genome, so highly likely that this is a false-positive hit

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-30 09:45:39

*Thread Reply:* Yeah, I had that with all my samples that we were screening for TB.. Got quite excited at first by the amount of hits, but all was just mapping in the 16S and 23S regions

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 08:45:04

*Thread Reply:* Very sorry everyone, had a busy day yesterday with meetins and then cut short for childcare

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 08:45:55

*Thread Reply:* > would you say that 6% of bases covered at 1X is a good indication of microbial presence? Not alone no! Normally 6% would be an indicator you may have a low coverage check, and go check in IGV for evenness of coverage, like your third plot shows (how do you generate that actually? )

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 08:46:53

*Thread Reply:* I think this is a a very good point though - to me breadth of coverage is meant to be a very simplistic value representing %, evenness of coverage is by far more useful but more difficult to represent as a single value

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 08:47:07

*Thread Reply:* But this could cause confusion in the field!

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 08:47:37

*Thread Reply:* I wonder if this could be good scope for a little paper! Find various terms/phrases used in the field and make a standardised dictionary? What would you think? ๐Ÿค”

๐Ÿ‘ Meriam van Os
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 08:48:33

*Thread Reply:* > Yeah, I had that with all my samples that we were screening for TB.. Got quite excited at first by the amount of hits, but all was just mapping in the 16S and 23S regions Yes, this is very common. Agian the purpose of MALT in our deparmtent ultimately is to find the lets say, 10s of samples our of 1000s that may have hits to a specific species, and then using mapping or capture to find the 1 real positive

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-31 09:05:55

*Thread Reply:* @James Fellows Yates Those concepts are very well defined in the field. I think that a more helpful paper would be one that describes good practices applied to ancient meta genomics, there you might have the opportunity to introduce those concepts in a more intelligible way and show how can they be used

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:06:20

*Thread Reply:* I'm not sure they always are

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:06:41

*Thread Reply:* A clear one is jus there: Nikolay and I disagree on the definition of breadth of coverage already

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:06:55

*Thread Reply:* Also do you know what cluster factor is?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:07:19

*Thread Reply:* And if it differs from complexity? Or clonality?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:07:20

*Thread Reply:* ๐Ÿ˜›

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:08:02

*Thread Reply:* Or we are talking about the same thing, and just describing it differently ๐Ÿ˜†

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-31 09:08:45

*Thread Reply:* Yes, I read your eager documentation

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:08:59

*Thread Reply:* Maybe I'm misleading what I mean by standardised... I literally mean describe what they mean in one place, and in cases where there is multiple we pick the best one and recommend people use that

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-31 09:09:37

*Thread Reply:* the main problem is that usually we coin new terms that have been described many years ago

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:10:01

*Thread Reply:* Yes - that's what I mean by dictionary, I don't mean define new terms

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:10:09

*Thread Reply:* (sorry the standardised was probably completely misleading here)

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:10:37

*Thread Reply:* Just describe the existing ones and where there have been multiple new terms of the same concept we try and emphasise the use of one

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:11:01

*Thread Reply:* And having such a dictionary will helpfully STOP the 'new term coining' as you rightly said

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-31 09:13:01

*Thread Reply:* Yes, this is fine but can be made in a wiki page like we do in anviโ€™o https://anvio.org/vocabulary/

Anvi'o dot org
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:13:19

*Thread Reply:* But I dunno... maybe it's not so useful. @Nikolay Oskolkov as an outsider moving laterally into aDNA and @Meriam van Os who is (sort of new) to aDNA think?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:13:35

*Thread Reply:* > Yes, this is fine but can be made in a wiki page like we do in anviโ€™o https://anvio.org/vocabulary/ yes I considered that, but it's not very visible

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:14:09

*Thread Reply:* A paper would be picked up by more people ,and make it easier to cite if people want to say 'use this term following the definition by XYZ (et al)'

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:15:02

*Thread Reply:* The ANVI'O wiki is nice, but is that generated from a consensus of a field? Or 'opinionatd' selection. Also I don't see much background behind the terms either

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-31 09:15:24

*Thread Reply:* You need to add value to the paper, a good practices will accomplish the same and be more valuable for the community

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:15:30

*Thread Reply:* (although it is very nice with the descriptions though, looking through it ๐Ÿ‘

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-31 09:16:19

*Thread Reply:* for example at f1000 we can have a rolling paper that gets updated as the field advances

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:17:38

*Thread Reply:* That would be PERFECT

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:17:45

*Thread Reply:* Very good idea

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:18:03

*Thread Reply:* What exactly do you mean by best practises though, in this case?

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-31 09:18:08

*Thread Reply:* You can get stickers https://twitter.com/merenbey/status/1499737607940542471?s=21&t=BM7Lui3tGkWWVMUpVF9Eg|https://twitter.com/merenbey/status/1499737607940542471?s=21&t=BM7Lui3tGkWWVMUpVF9Eg

twitter
} A. Murat Eren (Meren) (https://twitter.com/merenbey/status/1499737607940542471)
๐Ÿ˜† James Fellows Yates
Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:19:09

*Thread Reply:* @James Fellows Yates by best practices I wold mean a primer (with all the codes) of how a typical analysis is done and what can be pros and cons of different filtering / screening strategies

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:20:07

*Thread Reply:* Oofff, that's a much bigger and 'controversial' project. It would be much more difficult to coordinate.

๐Ÿ‘ Nikolay Oskolkov
Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:20:26

*Thread Reply:* I agree, it is not easy

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:21:33

*Thread Reply:* @James Fellows Yates to answer your question about evenness of coverage, it is just visualizing the output of "samtools depth" where I count reads in a sliding window along ref genome

๐Ÿ‘ James Fellows Yates
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:22:07

*Thread Reply:* But do you think terms would be useful?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:22:12

*Thread Reply:* (at a minimum)

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:23:55

*Thread Reply:* I do think terminology is useful but I would add it to the primer, i.e. not as a separate paper

๐Ÿ‘ James Fellows Yates
Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:24:21

*Thread Reply:* A super simple R script that reads the "samtools depth" output and plots it is here:

BREADTH OF COVERAGE

FIXME: empty due to failed samtools sort

df<-read.delim(paste0(outdir,"/",RefID,".breadthofcoverage"),header=FALSE,sep="\t") Ntiles<-100 step=(max(df$V2)-min(df$V2))/Ntiles tiles<-c(0:Ntiles)step V4<-vector() for(i in 1:length(tiles)) { dftemp<-df[df$V2>=tiles[i] & df$V2<tiles[i+1],] V4<-append(V4,rep(sum(dftemp$V3>0)/length(dftemp$V3),dim(dftemp)[1])) } V4[is.na(V4)]<-0 df$V4<-V4 plot(df$V4~df$V2,type="s",xlab="Genome position",ylab="Fraction of covered genome",main=paste0("Breadth of coverage: ",RefID," reference")) abline(h=0,col="red",lty=2) mtext(paste0(round((sum(df$V3>0)/length(df$V3))100,2),"% of genome covered"),cex=0.8)

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:24:30

*Thread Reply:* What is the issue with: Warinner, Christina, Alexander Herbig, Allison Mann, James A. Fellows Yates, Clemens L. WeiรŸ, Hernรกn A. Burbano, Ludovic Orlando, and Johannes Krause. 2017. โ€œA Robust Framework for Microbial Archaeology.โ€ Annual Review of Genomics and Human Genetics 18 (August): 321โ€“56. https://doi.org/10.1146/annurev-genom-091416-035526.

in this case?

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-31 09:24:35

*Thread Reply:* I think this all sounds super helpful, speaking as someone moving into this field, and someone who doesn't have anyone around at their university who does this this kind of analyis! I agree with Antonio, that it would be good to combine it with a discussion of "good" practises, what sort of analysis are needed to be very sure about a positive signal. I've been using Warinners paper that talks about some of the authenticity criteria, but I doesn't go into all those terms (from memory).

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:25:29

*Thread Reply:* Ok. That's good to know but the problem is there are not really 'good practises' tbh

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:25:54

*Thread Reply:* It's all too new, so we will have to wait a long time for that to chrystallise (just as a warning)

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:25:54

*Thread Reply:* @James Fellows Yates I read this paper, my understanding was that it was too abstract and not so much hands-on, but I might be mistaken, need to perhaps have another look

๐Ÿ‘ James Fellows Yates
Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-31 09:26:04

*Thread Reply:* They discuss the criteria very nicely though, I've been using those criteria as my baseline

๐Ÿ‘ James Fellows Yates
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:26:20

*Thread Reply:* we don't have enough dedicated tools to compare, not enough benchmarking tools to find the consensus unfortunately

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:26:38

*Thread Reply:* Section 7.3 should cover everything I think

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:26:52

*Thread Reply:* Eveness of coverage, Percent identity, ec.

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:34:56

*Thread Reply:* @James Fellows Yates perhaps people in the field are aware of the problem of false discoveries (as you say 1 out of 100 can have a real hit), but talking about e.g. ISBA last year I saw only two talks showing the evenness of coverage (Lucy van Dorp and someone else), however everyone shows the deamination profile. In my opinion it should be the opposite. It is quite trivial to get a good looking deamination profile from an ancient sample, but super difficult to prove that the microbe you mean is actually there. What I saw at ISBA was that everyone (like Hannes) says "we got ~500 reads assigned to this microbe / eukaryote" without demonstrating the evenness of coverage. sedaDNA folks never demonstrate the evenness of coverage (sometimes they even do not demonstrate deamination profile). If people are aware of the problem, why is this somehow under-emphasized and almost hidden?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:41:39

*Thread Reply:* There is a variety of reasons:

1) (Modern) reviewers are often much more skeptical of aDNA in the first place, so proving that it's ancient often takes more precendent 2) Each microbe has it's own characteristics you need to prove. For example, take Y. pestis actually eveness of coverage of the genome is somewhat irreelevant to show a person was carrying it - providing it has the Ymt gene or pCp or PmT plasmid is MUCH better proof. But of course only that species has that plasmid 3) Most people are looking targetting for high fold coverage rather than evenness/breadth - this is why people have switched to captures, at which point eveneness is also secondary. You can have a patchy genome but with more confident SNPs that allow you to place it on a tree (assuming you have proof, e.g of species specific plasmids) 4) Often people are presenting at conferences prelminary data, and will only say that to show they are candidate taxa (and they will subsequently do deeper sequencing/whole genome stuff) 5) People often don't have enough genomes for it to be worth, or sufficient knowledge, to programatically check evennes sof coverage . They will check this manually in IGV and assume if people doubt their claims will check it themselves

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:41:49

*Thread Reply:* 6) and in some cases indeed naรฏvity of course ๐Ÿ˜‰

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:41:55

*Thread Reply:* IGV manual checking is a big one

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:43:14

*Thread Reply:* I happened to recently work a lot on sedaDNA projects and can honestly say that it is complete chaos. Since you do not have a host, i.e. a clear #1 candidate, the screening results are very difficult to interpret, it is a horrible mixture of microbial, human, plant and mammalian reads. The evenness of coverage is perhaps the most helpful metric to separate false from true discoveries. However, when I read sedaDNA papers, I need to go to a supplementary of a supplementary to find (sometimes!) an evenness of coverage plot (in vast majority case I do not find it). The evenness of coverage should be the Figure 1 of a paper. Instead, there are lots of phylogeny plots, lots of words about what animals / plants were discovered without showing a single evidence (not even deamination profile is shown).

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:43:47

*Thread Reply:* I can't really comment on sedaNDA unfortuantely.

But how do you make evenness of coverage plots across 100s of taxa ๐Ÿ˜‰

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:43:53

*Thread Reply:* (to play devils advocate)

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:44:12

*Thread Reply:* My understanding is human-associated microbial genomics is a little more routine in that manner

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:45:23

*Thread Reply:* You make an evenness of coverage plot per species of course. So if the paper says "we have discovered a mammoth in a sedaDNA sample" there should be some evidence shown, not just reporting the number of reads

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:45:51

*Thread Reply:* But you have figure limitations, you can fit for many taxa a single value of reads in a table ๐Ÿ˜‰

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:45:55

*Thread Reply:* (again devils advocate)

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:46:10

*Thread Reply:* But you are ultimately right

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:46:27

*Thread Reply:* Might good little <#C02D3DJP3MY|spaam-blog> post if you're up for it

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:47:01

*Thread Reply:* No-no, I am talking about most interesting hits. Say you found a few interesting mammals in the sand of Denisova cave, there should be some proof shown

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:47:24

*Thread Reply:* Ok but in that case, they did mitochondrial capture

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:47:36

*Thread Reply:* and got 10X with 98% covered, why is evennes necessary?

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:47:46

*Thread Reply:* you should start with a proof that the animal is actually there, and not with phylogenies

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:47:48

*Thread Reply:* (sorry @Meriam van Os for the spam, maybe we should move this outside the thread)

๐Ÿ‘ Nikolay Oskolkov
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:47:56

*Thread Reply:* > and got 10X with 98% covered, why is evennes necessary? is that not proof?

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:50:51

*Thread Reply:* I do not recall "10X with 98%" but this should be Figure 1 in my opinion and not hidden in a supplementary of a supplementary. People should not search for it

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:51:57

*Thread Reply:* How do you do that when you have 500 samples, across 30 species ๐Ÿ˜‰

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:52:12

*Thread Reply:* (sorry, the 10X and 98% was theoretical)

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:52:38

*Thread Reply:* Reviewers are less and less intrested in proof unfortunately. That doesn't make an interesting nature paper nowadays...

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:52:44

*Thread Reply:* I do agree with you though

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:53:14

*Thread Reply:* a typical presentation / paper is like this: they show pictures of the historical place (e.g. a cave), then a table of animals / plants found, and then phylogeny. I don't get why people (who are aware of the false discovery problem) do not try to convince the audience that the animal / plant is actually there

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:53:59

*Thread Reply:* Bt anyway to conclude: and I think eveneness of coverage is underrepresnted, I think you could make a very good argument and helpful tutorial if you were to show how to programatically generate plots like in your example above - because manual checking in IGV is indeed very suboptimal

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:54:27

*Thread Reply:* Since I know how hard it is to prove that say a bison is actually a true hit, I would concentrate my talk on this ๐Ÿ™‚

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-31 09:54:46

*Thread Reply:* We are developing many tools to deal with what @Nikolay Oskolkov is saying about sedaDNA

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:55:03

*Thread Reply:* > a typical presentation / paper is like this: they show pictures of the historical place (e.g. a cave), then a table of animals / plants found, and then phylogeny. I don't get why people (who are aware of the false discovery problem) do not try to convince the audience that the animal / plant is actually there But, ok if you have sufficient coverage to place it in a phlylogeny, and falls correctly within Bison when you have a I dunno... sheep outgroup, why is that not sufficient proof if it's a Bison?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:55:54

*Thread Reply:* From that perspective - if anything, evenness of coverage is again not really releveant. If you have pick up enough regions that hold SNP diversity that allows (sensible) placement in the phylogeny, that's enough, no?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:56:06

*Thread Reply:* > We are developing many tools to deal with what @Nikolay Oskolkov is saying about sedaDNA Where are they then ๐Ÿ˜‰ (teasing)

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:56:56

*Thread Reply:* @James Fellows Yates the conclusion about a hypothetical bison is often made based on ~20-30 reads. Does it sound crazy to you? I am not kidding. Really, they make phylogenies based on at max 100-200 reads

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 09:58:17

*Thread Reply:* The SNPs that determine the placement in the phylogentic tree can be from the damage or sequencing errors if we talk about that low amount of reads

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:58:59

*Thread Reply:* Depends, that's exactly what they can do with Neanderthals.

100-200 reads can give you pretty good coverage on a mitogenome even.

But ultimately this comes back to my point it depends very highly on context.

Evenness of coverage I consider to be more important for microbial stuff, as if you have large missing regions these may contain gene-level species 'identifiers'

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 09:59:33

*Thread Reply:* > The SNPs that determine the placement in the phylogentic tree can be from the damage or sequencing errors if we talk about that low amount of reads Depends! If you have sufficient informative SNPs and you know the approximate damage

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 10:04:38

*Thread Reply:* Perhaps! From my testing, indeed, ~20-30 reads can have e.g. mammoth-specific alleles. So they might very well be indeed mammoth reads. But I wonder how many such "mammoth-looking reads" you can get if you e.g. slice a wolf genome, damage it and inject sequencing errors? This should be discussed openly. And a I see a lack of such discussions ๐Ÿ˜ž

๐Ÿ‘ James Fellows Yates
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:05:11

*Thread Reply:* But Ok, sorry I'm being a bit facetious now sorry @Nikolay Oskolkov - it's not personal.

I just find it frustrating that people come in complaining about this sometimes and reviewers can be extremely dismissive without reading specifically how each group do validate these things. As some groups can be VERY careful. But equally the people who are very dismissive don't bother to actually make such tools to make this easier...

And unfortnuatey the few tools exist are not well made (HOPS ofr example), so this is a pain point for me too.

This could be a good in-person chat to have over SPAAM4 if you're joining that?

Maybe a best-practises validation publication with more practical examples is a good idea... but often I've seen lack of interest from the experts in this matter

๐Ÿ™Œ Nikolay Oskolkov, Meriam van Os
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:05:52

*Thread Reply:* > Perhaps! From my testing, indeed, ~20-30 reads can have e.g. mammoth-specific alleles. So they might very well be indeed mammoth reads. But I wonder how many such "mammoth-looking reads" you can get if you e.g. slice a wolf genome, damage it and inject sequencing errors? This should be discussed openly. And a I see a luck of such discussions ๐Ÿ˜ž Very true... but often they are tricky things to describe concisely so has to be forced into the SI ๐Ÿ˜ž

๐Ÿ‘ Nikolay Oskolkov
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:06:04

*Thread Reply:* You saw the size of my SI in my PNAS paper?

๐Ÿ‘ Nikolay Oskolkov
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:07:58

*Thread Reply:* Actually...

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:07:59

*Thread Reply:* ACTUALLY

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-31 10:08:22

*Thread Reply:* No, I am afraid not ๐Ÿ™‚

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:08:45

*Thread Reply:* @Nikolay Oskolkov So I had actually previously proposed a paper like this: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009823

but for ancient microbial genomics validation to one of the research gorups I'm in. But there was little interest from the leadership

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:08:58

*Thread Reply:* You are welcome to take the idea and run it in SPAAM?

James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:09:41

*Thread Reply:* > No, I am afraid not 120 pages

๐Ÿ˜‚ Nikolay Oskolkov
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 10:09:50

*Thread Reply:* https://paperpile.com/shared/evliy7

Paperpile
๐Ÿ‘ Nikolay Oskolkov
Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-30 09:25:05

@Nikolay Oskolkov this is what we usually compute using https://www.sciencedirect.com/science/article/pii/0888754388900079?via%3Dihub and https://www.nature.com/articles/jhg201621

sciencedirect.com
Nature
Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-30 09:25:54

and summarizes your third plot

aidanva (aida.andrades@gmail.com)
2022-03-30 09:28:28

I'm a bit confused, wouldn't it be possible to calculate evenness of coverage from the rma6? (I'm sorry if this is a stupid question, I haven't worked with rma6 directly)

aidanva (aida.andrades@gmail.com)
2022-03-30 09:28:53

The rma6 files should contain the positions where it is mapping in a reference

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:35:21

*Thread Reply:* @aidanva in theory yes, but nobody knows how except for MALT /HOPS developers ๐Ÿ™‚ This is why MALT / HOPS is a bit of a black box, there is no way the community can customize MALT output ๐Ÿ™‚

aidanva (aida.andrades@gmail.com)
2022-03-30 09:36:14

*Thread Reply:* Yeah, ok, that's is totally true. We should push them to report evenness of coverage

๐Ÿ‘ Nikolay Oskolkov
Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-30 09:29:14

if they have it, yes

aidanva (aida.andrades@gmail.com)
2022-03-30 09:29:15

And as Antonio days that's what you need right?

aidanva (aida.andrades@gmail.com)
2022-03-30 09:29:23

I am sure they must have it

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:37:12

*Thread Reply:* Internally this information should be there, yes, but you need to be the author of MALT / HOPS in order to know how to extract it. At least I never managed ๐Ÿ™‚

aidanva (aida.andrades@gmail.com)
2022-03-30 09:29:43

Since HOPS also have a destacking (like a duplicate removal) option

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-30 09:29:44

I have no idea about the RM6 format

aidanva (aida.andrades@gmail.com)
2022-03-30 09:29:54

And for that it needs to know the positions

aidanva (aida.andrades@gmail.com)
2022-03-30 09:30:38

To evaluate if a read is a duplicate or not

aidanva (aida.andrades@gmail.com)
2022-03-30 09:32:07

That could also give you any indication if all of your reads are coming from similar regions, by comparing the total number to the destacked number reported by HOPS, if the second one is considerably smaller than the total, then you are mostly mapping to conserve regions and it is probably a bad candidate for further work

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-30 09:40:38

Wow, didn't expect so much involvement. Thanks everyone! This is super helpful!!

@Nikolay Oskolkov @James Fellows Yates Isn't the evenness of coverage represented in the read distribution in the table at the top? Which I always thought was calculated by; Number of covered bases on reference genome / Total bases mapped to reference. So if there are duplicated/clusters the total mapped bases will be higher than the number of covered bases. In other words, if there is an even distribution, this number should be 1, and the lower it is, the more clustering. I think that's right?

Hugh Cross (hughbcross@gmail.com)
2022-03-30 09:40:51

@Hugh Cross has joined the channel

Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:43:22

@Meriam van Os the piles of reads do not necessarily come from duplicates, there might be unique reads all coming from jut one chromosome (now I am talking about eukaryotes for simplicity), so other chromosomes are not covered at all, not a good sign ๐Ÿ˜ž

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-30 09:53:05

*Thread Reply:* Aah, that is a very good point, I actually never really thought about that oops ๐Ÿ˜… why does no-one talk about this in papers haha? Or I may have missed it... ๐Ÿ˜ฌ Okay, I'll go map the potential present microbes and make some nice coverage plots!

๐Ÿ‘ Nikolay Oskolkov
Nikolay Oskolkov (nikolay.oskolkov@scilifelab.se)
2022-03-30 09:58:43

*Thread Reply:* Sorry @Meriam van Os, I need to explain properly. An uneven ref genome coverage looks (if you check in IGV) like reads forming "piles", i.e. lots of reads mapped here and lots of reads mapped there but nothing in between. What you really want to see is that all reads are mapped at unique genomic positions, no matter how many at each position, one is enough. In this case, shallow sequenced samples can give on average maybe 6% of bases covered. This is simply because there are not enough reads to cover all bases of a long ref genome. So 6% of bases covered might imply that the reads are spread non-uniformly (as you said, cluster at certain places), or uniformly but there are simply too few reads to get a higher percentage of ref genome covered. So you need to plot how coverage varies vs genomic position to really see that the majority of reads come from unique genomic positions

๐Ÿ‘ Meriam van Os
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 08:51:14

*Thread Reply:* @Meriam van Os you're very right - but that is the problem it's very difficult to represent unevenness in a single value.

So the purpose of that field in HOPS output is to remove the clearly false positive samples - you should always either do mapping and check with IGV or make nifty little plots as Nikolay has done

Meriam van Os (meriam.vanos@postgrad.otago.ac.nz)
2022-03-31 09:01:19

*Thread Reply:* Yeah that makes sense!

Antonio Fernandez-Guerra (antonio@metagenomics.eu)
2022-03-30 09:44:16

also you have all the mobile genetic elements, prophages and others that recruit many reads

๐Ÿ‘ Meriam van Os
James Fellows Yates (james_fellows_yates@eva.mpg.de)
2022-03-31 08:52:21

Also would like to point out @Nikolay Oskolkov (to be a bit cheeky) - MALT/HOPS is not a blackbox - all the code is open-source and viewable!

https://github.com/danielhuson/malt https://github.com/rhuebler/HOPS https://github.com/rhuebler/MaltExtract https://github.com/keyfm/amps

You just need to learn Java first ๐Ÿ˜‰

๐Ÿ‘ Nikolay Oskolkov