Back to our discussion @aidanva about the different classifications of the data types for single-genomes 🙂
The categories we were considering are: raw, assembly (assembled contigs) and consensus. Also MAGs?
I will argue that any assembly it is actually a MAG
since we always have a metagenomic background even when we capture the data
so I guess that should go into our definition of assembly
so I think those categories are good. however, does anyone else think we should include more categories?
we could have “assembly_MAG” as a category since contigs form a MAG
So for consensus I think we should define as any sequence derived from a consensus calling algorithm
and for assembly_MAG: anything that has derived of an assembly process independently of the completeness
are you guys aware of any paper where they only have published the mapping reads to the specific genome as opposed to the raw data?
yes, a few of the Mühlemann virus papers have done that
Actually maybe they published just the assembly, not sure. Will have to check
but I guess it is a potential scenario… should we could this mapped ? I can not think of a better name now..
for definition of raw data, how about something like: raw data: shotgun or capture data in fastq format without any type of depletion or computational manipulation of read/data composition
although, sometimes the data does not contain adapters… but I think we can consider that raw. So maybe add: “With exception of adapters being trimmed”?
*Thread Reply:* This is how it should be
*Thread Reply:* IN theory, any data at ENA/SRA should not contain technical sequences https://ena-docs.readthedocs.io/en/latest/submit/fileprep/reads.html#fastq-format and soon SRA will remove qualities from fastq
*Thread Reply:* well they have a call to ask the community
*Thread Reply:* Yes, unfortunately people ignore that -.-. surprised thats not checked for actually.
Yes I saw...I hope ena doesn't follow suit
*Thread Reply:* Terrible for us
*Thread Reply:* They have CRAM as an alternative
*Thread Reply:* let’s see, the community is not so happy
instead of mapped how about aligned or reference_aligned ?
@James Fellows Yates has joined the channel
What is the final list of cateogirs then?
raw, assemblyMAG, referencealigned, consensus
@Antonio Fernandez-Guerra has joined the channel
I’m putting together the final definitions for each category, feedback welcome.
I would add another category that is called binned or something similar. We are developing methods to recruit reads that might belong to the same genome using protein assembly methods
Sure! Could you provide a short definition as well? 🙂
So far we have: raw: shotgun or capture data in fastq format without any type of depletion or computational manipulation of read/data composition, with exception of adapters being trimmed
assembly_MAG: anything that is derived of an assembly process, independent of the completeness
consensus: Any sequence derived from a consensus calling algorithm
@aidanva what do you think about this for reference_aligned?
reference_aligned: target reads derived from alignment (mapping) of metagenomic data to a reference sequence
binned: reads recruited by ancient co-abundant genes
although binned might be to general, mags are binned, mapped are binned, let me think for a better name
maybe we can use the prefix binned_ for MAGs and CAGs, as they are creating bins, then we could have binned_cag , binned_mag , just a suggestion
Then I suppose we should have assembly instead of assembly_MAG ?
I would have, as not all assemblies are MAGs, but all mags are assemblies (most of them)
@Antonio Fernandez-Guerra could you provide a definition for the binned_cag as welll?
binned_cag: reads recruited by ancient co-abundant genes
now we have some flexibility if new techniques appear to bin reads of similar genomes
and for binned_mag ? Something along the lines of:
binned_mag: a single-taxon assembly based on one or more binned metagenomes that is a close representation of a known isolate or represents a novel isolate.
I would remove the isolate part
so, just binned_mag: a single-taxon assembly based on one or more binned metagenomes
the problem of mags is that you are not sure that is a single taxon, but you might have contigs(reads) from different populations. For example, Meren has in his glossary:
Metagenome-assembled genome (MAG)
A genome bin that meets certain quality requirements and can be assumed to represent contigs from one bin of a metagenome, which collectively represent the DNA of (what we think is) a single population.
You also have the Segata style Single Genome Bin (SGB)
(at least i mostly see them in their papers
i would keep in the standard terms: https://www.nature.com/articles/nbt.3893
Do these seem ok now?
raw: shotgun or capture data in fastq format without any type of depletion or computational manipulation of read/data composition, with exception of adapters being trimmed
assembly: anything that is derived of an assembly process, independent of the completeness
binned_mag: a single-taxon assembly based on one or more binned metagenomes that meet certain quality requirements and can be assumed to represent contigs from one bin of a metagenome
binned_cag: reads recruited by ancient co-abundant genes
reference_aligned: target reads derived from alignment (mapping) of metagenomic data to a reference sequence
consensus: Any sequence derived from a consensus calling algorithm
Aida (sitting next to me) and says she approves. My only comments are to define typical output for assembly and consensus to make sure to not leave ambiguity
For example a mag is also intrinsically an assembly
And some people consider napping to a reference an assembly
an assembly doesn’t have binning
you are adding an extra layer on top of assembly
again, I'm not saying its correct but it's how some people will use it, so we need to teach them to use the right term
and mapping to a reference, is not assembly either as you are not using any approach like de brujin, OLC, …
then in description we leave ambiguities clear
maybe on assembly we can add some of the techniques for assembly
...assembly process (de Brujin, overlap-layout-consensus...)...
assembly: anything that is derived of an assembly process (i.e. de Brujin, overlap-layout-consensus) , independent of the completeness
Maybe single or metagenome assembler? And by assembly are you specifically meaning De novo?
If it's de novo assembly just putting that in the definition would be enough for me
and you can use an assembler that can work with a single genome or a mix
I meant that if by the assembly category itself: I would be happy with the definition of:
Anything that is derived from a de novo assembly process, independent of completeness
Then we can skip the whole methods thing
Because that is what distinguishes it from 'reference guided assembly' (what I've seen in the literature for reference mapping)
And then for consensus... That's the more tricky one
then you can have reference-guided de novo assembly
I think we then get lost in details, and we should take a compromise
I would take assembly as both types, maybe add the program you used for the assembly
then how we separate consensus sequences from reference assemblies
What about: consensus sequence derived from consensus calling algorithm of reference aligned data?
need to jump now, catch up later
What about assembly_denovo assembly_guided and consensus and the last step of the process takes priority
I.e reference-guided + consensus = consensus
To use @Åshild (Ash) definitions are my suggestions to assembly and consensus OK?:
raw: shotgun or capture data in fastq format without any type of depletion or computational manipulation of read/data composition, with exception of adapters being trimmed
assembly: anything that is derived of a de novo assembly process, independent of the completeness
binned_mag: a single-taxon assembly based on one or more binned metagenomes that meet certain quality requirements and can be assumed to represent contigs from one bin of a metagenome
binned_cag: reads recruited by ancient co-abundant genes
reference_aligned: target reads derived from alignment (mapping) of metagenomic data to a reference sequence
consensus: Any sequence derived from a consensus calling algorithm applied to reference aligned data (typically a FASTA style file, as can be found on e.g. NCBI GenBank)
https://github.com/SPAAM-workshop/AncientMetagenomeDir/pull/146
Ignore validation fails, this will happen as the JSONs aren't integrated into master yet
Would like two approvals for this one
In a meeting until noon, can have a look after lunch
There are 2 krause-kyora 2018 publications btw 😉
*Thread Reply:* We have the two siseus, right?
*Thread Reply:* Oh you mean with the keys
*Thread Reply:* I've already sneakly fixed that a couple days ago 😉
We have a problem... Aida just added a paper and it's now missing from the new singlegneome-hostassoicated list by the looks of it?
After we merged the new colum thing
I can reenter the data, I have save it on my computer
This has the new data_type column
well, I reenter them then, no problem
done, if any of you does the review I will merge it
it seems that all open pull requests will have the same issue. I am just doing a PR review of Barquera
Not sure what happened there 🤔
I will fix the two open PRs
Just nee dto wait for Aida to mere
Oops lost the Barquera latlongs
Or didn't add them in the first place :face_palm: sorry Ash
I can’t do anything with this pull request , right?
ok now you can review 🙂
Krause Kyora also good now too (or should be)
*Thread Reply:* review done
*Thread Reply:* Fixed, check again ❤️
*Thread Reply:* approved, there is some conflicts though
Can I rename to “single-genome-ancientmetagenomedir”? Reason why I didn’t start with “ancientmetagenomedir” is bc it now looks like. this on my sidebar 😉 😉
it also can be shortened as amd- or -amd
Please also rename the environment
*Thread Reply:* how? can’t seem to find on google
*Thread Reply:* Don't worry, fine it
*Thread Reply:* It was the other channel, maybe not visible as you weren't there
*Thread Reply:* Sorry I mean the AncientMetagenomeDir-environment slack channel
*Thread Reply:* <#C018UBC9T47|dir-environmental> <- which is there. Would be the same procedure as for this one
Or suffix could be: the 'Dir as @Becky Cribdon referred to it today 🤣
Maybe put dir First though, so they are all tither?
Another question, as @aidanva brought up/predicted before, but it has now come up in https://github.com/SPAAM-workshop/AncientMetagenomeDir/pull/170 from @Becky Cribdon, in de Dios 2020 they report hits to a eukaryotic pathogen but they only recover/analyses a mt genome for one of the pathogens
How should we indicate this, if we should?
And then what should we allow to be included?
Maybe with genome_type: chromosome vs plasmid, although someone would need to do a lit. review for what we have missed with the latter...
maybe organelle to keep it generic?
it should be chromosome, plasmid, organelle
I meant organelle instead of mtDNA
@channel I am working on the Kay 2015 paper, here there are several individuals with mixed infections of TB strains, for example one individual may have 3 TB strains. - according to the authors’ analysis. Do I enter every strain separately OR do I make one entry per individual?
Another issue, which relates to the Krause-Kyora 2018 paper. Here they publish data for several leprosy genomes. Some are novel and some are additional data for samples already published by Schuenemann 2018, but using a slightly different sample name. How can I make it clear that the data from the Krause-Kyora 2018 paper belongs to some of the same samples from Schuenemann 2018?
> @channel I am working on the Kay 2015 paper, here there are several individuals with mixed infections of TB strains, for example one individual may have 3 TB strains. - according to the authors’ analysis. Do I enter every strain separately OR do I make one entry per individual? As we only list species, then one entry per individual (IMO). Pepole should still check the oriignal publication for what they are looking for
> Another issue, which relates to the Krause-Kyora 2018 paper. Here they publish data for several leprosy genomes. Some are novel and some are additional data for samples already published by Schuenemann 2018, but using a slightly different sample name. How can I make it clear that the data from the Krause-Kyora 2018 paper belongs to some of the same samples from Schuenemann 2018?
How sligtly different is the sample name? If it's just like _new or something, I would re-use the old names and try and duplicate exactly (as far as possible) the info from the previous paper
Maybe put G507 (Jørgen_507)?
Are we including single microbial genomes from ancient herbaria specimens? Thinking Phytophthora infestans etc.
Aida tried dealing with Yoshida 2913 today
Just added another: Martin 2016
I emailed the authors... let’s see what they say
I hate when that happens! Why do people publish papers and don’t release the data, even though they provide and accession number in the paper?
*Thread Reply:* They are so fed up of the whole thing they forget
*Thread Reply:* (speaking from experience)
*Thread Reply:* shifty eyes
These are all genomes?!
*Thread Reply:* yes, should be
Saville 2016 sequnce mito-genomes of Phytophthora infestans and did genotyping, do we include it? https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0168381
This was the thing @aidanva brought up earlier, I guess we need to discuss
Can you make a PR with a new column (and README) with something like genome_level
With two options: chromosome and organelle
I don't want to get smaller than that I think
Everything already on there should be then chromosome
I think that's the simplest, @aidanva?
Everything in the list is chromosomes as far as I am aware
Can you do that @Åshild (Ash)?
Thanks @Miriam Bravo for the PR for Kerudin! I've made a few comments on the the PR, https://github.com/SPAAM-workshop/AncientMetagenomeDir/pull/195
Does this sound ok for the Readme, for the genome_level column?
Specify one of two options: chromosome or organelle
organelle: if a study has only published a micorbial mitochondrial genome (e.g. for Plasmodium sp.) or a bacterial plasmid sequencechromosome: if a study a published the complete genome (chromosome and plasmid) or just the chromosome@Åshild (Ash) Thank you! I ended up tweaking it slightly after discussion with Aida this morning, but otherwise merged (and dealt with stupid python -identtation JSON bollocks).
The main change was: I switched it to genometype, because I realised genomelevel might get sligtly confused with assembly level in NCBI/EBI (contigs, scaffolds, chromosome, complete etc.. and I added a little more documentation (Aida pointed out should emphasis that the study must have aimed to reconstructed 'whole' versions of each genome, not the gene stuff, like in the issue you closed last night)
So... I have a question, I am now putting in Martin 2016. Most of their samples are from 1950s until 2000s. Should they all be considered ancient?
based on what we discuss yesterday everything from 10 years onwards, should be considered ancient. If that's the case, then we need to add a bunch more of studies that dealt with very young samples
Ooof. Good point. We can either exclude anything younger than 1950 (when did Sanger sequencing start?), Or just say include them if they report them as ancient.
I'm currently feeling like the former might be safer, as as @aidanva says we might end up with a massive flood of stuff and it'll get out of hand...
Maybe we should move this to #ancientmetagenomedir as this would cover that
Maybe anything older than that?
let's see what other think in the general channel, Even though that when it was invented I dunno if it was widely used
@aidanva could you copy and paste your original message then my response there?
Or maybe 1950 is also easier because it's our date limit
I don’t understand why the date of sequencing technology should be used to decide if a sample is “ancient”. isn’t it better to use the 10-years you specificed earlier?
but 10 years since now, or 10 years since the study was published?
I think the second makes more sense
I guess @aidanva's point is that there is a lot of medical papers we are missing
And suddenly we will have loads of things to add. On the Otherhand they are valid I guess, but I think maybe out of scope for the review by Tina and co. Maybe we could prioritise for first release anything older than 1950
Then we can write the Preprint/paper based on that, and then continue with the additions after that
(then there is less stress)
I can make milestones for release to help track that too
I agree w/ a hard date cut-off to start, and really stick with it. So don’t include border-line samples, b/c then you have to decide how far youwant to stretch the border
Ok, then I will complete Martin 2016 with all the samples prior to 2006, however I won't make a PR because none of the samples are prior to 1950
Make it a draft PR with a comment
Ok, wanna me to write in the ancientmetagenomedir channel that we decided that for the first release we will only consider samples prior to 1950?
Yes please. And say I'll tagging all papers with that date and so people should only assign themselves to those
@channel should I merge the 5/6 Iceman H. pylori samples into one so it represents a single genome (as they actually merge all into one, but I think I was thinking too much of microbiome samples at the time...)?
Could someone review this: https://github.com/SPAAM-workshop/AncientMetagenomeDir/pull/216
Fixes Maixner and changes versioning system to a vYY.MM based system, as I think this is more readable for humans when we will have time based releases
New pathogen genomes: https://bmcbiol.biomedcentral.com/track/pdf/10.1186/s12915-020-00839-8
*Thread Reply:* Or, rather it is an issue and Aida is already assigned to it 🙂
@Maria Spyrou coudl you also post that in #papers!
And thanks for also posting it here 😄
@Åshild (Ash) @aidanva you need to come up with a team name
Environmental core people are now @thedir-team-dirt on github
Irina and I now: @thedir-team-bugparty
what about thedir-team-pathogen_peeps @aidanva?
not 100% what that means but 🤷
I had a mental image of Maia joining github
I've also removed myself, and made you both maintainers, so you can make teams/modify from your team and under as yo uwish
I’ve just added two new issues for papers on plasmodium mtDNA, which were previously not included.
Should we include those in the current release or should hey go in the next one?
Next one. Release is fixed now
Only updates to the release are major breaking things (like when I missed the pathogen list from the ZIP file 😅 )
Also, there is the Chan 2013 paper which published the first evidence for TB in one of the hungarian mummies also included in the Kay 2015 paper (included in the list already), should we include/exclude this paper?
What exactly did they do in Chan 2013 vs Kay 2015?
Chan paper did metagenomic analysis of body_68
which was further sequenced, I believe, in Kay 2015
The same individual (body_68) was re-sampled by Kay etal 2015
@Katerina Guschanski has joined the channel
@Gunnar Neumann has joined the channel
@Åshild (Ash) @aidanva let me know once you've looked throug the manuscript/response letter (I want at least 'Go's from the core team at a minimum) 🙂
@aidanva @James Fellows Yates I’m getting ready to add the papers that I signed up for on the ’dir. However, several of these were the ‘problematic’ papers where the genomes date after 1950, ie 1960s etc. Do we want to include these still? How should I enter these with regard to date, assuming we want to include them?
Now our paper is accepted I guess we can loosen the restrictions for now...
I would still indicate them as 100 (i.e. within the last 100 years)
I honestly don't know. It depends if yo uwould use them or not
Like the HIV (was it?) one, do you tihnk people would really use that much aDNA?
I personally would lean against it as I think it's barely aDNA.
If someone did fine e.g. a 10,000 year old HIV virus they would have to build modern datasets anyway and would pick that up
so I think leaving out 1950 cut-off might be suffiicent, as then we can use the limitation of the most common dating techniuqe in archaeology.
Exceptions could be papers that have a mixture of both >1950 and younger stuff (but still sort of ancient)
Yeah, it’s two HIV papers with genomes from the 60's
I can leave them out for now. We need to draw a line I suppose, and 1950 is as good as any
@channel for appers with a mixture, what about a cut off if more than 50% of samples are older than 1950?
50% of ancient samples I mean
(as we would exclude modern ones)
Have the same issue with @Miriam Bravo’s latest PR
I don't think I understand what you mean. So you won't include a paper if less that half of the samples are ancient?
Ok, so criteria would be
1) does the paper say they include ancient samples (i.e. report damage) 2) of the ones they consider 'ancient', are >= 50% of these older than 1950
I would just include the ones that are pre-1950…and then people can read the paper and make their own decisions about what they want to include 🤷♀️
I agree with Ash, it just seems a random cut off
I meant more that if the majority of the 'ancient' samples are from 1970 or 180 or something, then we don't include the paper at all
I will include any paper that has samples predating to 1950. However, only samples predating 1950 should be included in the database. Then people should read the paper and decide if they wanna include anything else in their analysis. I think that's what @Åshild (Ash) was saying
The think is I don't particualrly like this half splitting of actual samples...
I'll put it up as a vote then that 1950 is the hard cut off
in some papers they sequence modern strains alongside with ancient ones
(will need to go back and check everything then)
but then the don't include the modern ones right?
Didn't we already agreed that 1950 was the cut-off for ancient?
*Thread Reply:* Not definitevely
Bos 2014 and Schuenemann 2013 include modern genomes (off the top of my head)
That is a lot of work to go back and add, and it makes no sense to include these modern genomes in the dir
We woudln't include modern genomes
But I guess then th esame issue of a cut off would apply, true
the philosophical question here is: What is ancient? and depending on what you consider as defining features of aDNA, you can set different cut offs. We should just decide on one and stick to it, and be upfront that this is what is recorded in this database
1950 is a good cut off, as that’s what the 14C dates are calculated off, and then we can still include genomes from the early 1900s, which are now over 100 years old
Naja, but that is where my uncertaintiy comes in. Should we use 14C as our gold standard... I dunno if that is regularly done e.g. on sedaDNA
what is considered as ancient is subjective, so I think it’s fine that we decide an arbitrary number, even if it coincides with the 14C date.
@Åshild (Ash) @aidanva in case you didn't see the doodle email, planning for the open data award application thing will be July 30th 9-10
*Thread Reply:* I did a quick check, it seems that the library was selected only for siRNA. I would say that that’s not a complete genome and I’m not sure how metagenomic it is. So I wouldn’t include it at the moment, but it make me think: did we add other RNA studies before?
*Thread Reply:* Good question
*Thread Reply:* I think there was a couple yes 🤔
*Thread Reply:* Looks pretty complete genome to me?
*Thread Reply:* I’ve only had time to look at it very briefly, but I think we should add it. We have added other RNA genomes before and since it is at such high coverage, as James points out, I don’t see any reason not to add it
*Thread Reply:* I agree, I missread it, not that I look at it more in depth we should add it.
https://www.ncbi.nlm.nih.gov/labs/pmc/articles/PMC8553777/
it is quite low, but I don't remember if we established a cut off by coverage?
Iirc initially we were aiming for reasonable coverage to say a 'partial genome' (so people can do some form of phylogenetic analysis), but then we included some stuff from that paper from the Turkish group (forgetting the name now) that Anders Götherström works with..., I think?
Does this paper have 1x coverage of a plasmid at least?
they had more reads, Kilinç 2021 (doi: 10.1126/sciadv.abc4587) right?
but the enrichment looks good
since we included Kilinç, we should include this one too
@aidanva could you make the issue?
The number of unique reads in Kilinç 2021 was higher though: yak022: 1640 yak023: 3070 irk050: 3289
for one of the samples after capture the read count is higher: https://www.nature.com/articles/s41598-021-98214-2/tables/5
I think we discussed previously a threshold of minimum 0.1X coverage for a “genome”/sample to be included in the list
@Laura Carrillo Olivas has joined the channel
https://www.sciencedirect.com/science/article/pii/S2666517421000699#sec0013
*Thread Reply:* Is any one else’s mind blown that bacteria from 100-year-old teeth from non-permafrost burial conditions could be cultured?
*Thread Reply:* 🐟 🐟 🐟 🐟 🐟
@Nikolay Oskolkov has joined the channel
Y/N for the TB hits?
mmmm... they have 161 reads mapping to the TB chromosome... that's very low...
No, too few reads unfortunately. They state that they cannot properly authenticate it, although they are probably real reads. However, not enough for 0.1X coverage to qualify this as a “genome” and for inclusion in the dir
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9090925/
@Meriam van Os has joined the channel
@aidanva @Åshild (Ash) Should we just drop Woroby (HIV) thing?
I say this as actually the sample is younger than 1950 (our rule of thumb), also while it does have consensus sequences, I'm finding it hard to work out which is associated with which
No wait, found a way to associate each one
@aidanva Gok4 didn't have a Y. pestis genome in teh end right?
It's mentioned in the methods but mysteriously disappears from teh rest of the paper 😬
but very few reads, so no phylogeny is possible
OK, so not genome reconstruction-level
but edit distances, etc. clearly show it is pestis
@Nikolaos Psonis (Nikos) has joined the channel