@James Fellows Yates has joined the channel
@Brooklynn Scott has joined the channel
@Darío Alejandro Ramirez has joined the channel
@Laura Carrillo Olivas has joined the channel
@Maria Lopopolo has joined the channel
@Merlin Szymanski has joined the channel
@Mohamed Sarhan has joined the channel
@Nora Bergfeldt has joined the channel
@Nikolay Oskolkov has joined the channel
Good idea @Maria Lopopolo!
Could you please provide us with the link to the gather.town space?
*Thread Reply:* Also good idea 🙂
All information here: https://spaam-community.github.io/wss-summer-school/#/2022/
Gather.Town link: https://app.gather.town/app/PlXjb0deog0B4JCq/spaam-community
@channel Hi All! We’re meeting in the GatherTown Lecture Hall
Hi All, I misspoke earlier - we have an index set of 195 F and 195 R that we use on rotation. Most labs don’t use quite this many - it really depends on your throughput
@channel please can everyone move to the purple room!
Hi All, there were some great questions earlier about sequencing. Here are two additional tips: You want to match your expected DNA length to your sequencing chemistry. Illumina sequencing kits come in a few flavors depending on the instrument model, typically 100 cycle, 150 cycle, and 300 cycle kits. You can use these to sequence single end (SE) 100 bp, 150 bp, or 300 bp; or you can use them to do paired end (PE) sequencing of 2x50 bp, 2x75 bp, or 2x150 bp. Some sequencing centers also allow you to use the 300 cycle kit to do 2x100 bp. You want to match your sequencing chemistry to your DNA. For example, imagine you are sequencing a really important set of libraries on a NextSeq (with 2-color chemistry). Your TapeStation says that your mode DNA length is 60 bp (with some spread on either side). I’d recommend using 2x75 bp sequencing. That will allow you to sequence everything up to 150 bp, which is probably almost all of your data. This will maximize your sequencing efficiency and get high quality, high confidence data. You DO NOT want to sequence with 2x150 bp chemistry. Beyond being a waste of money (because you are paying for sequence data you won’t get), it will also reduce the calculated basecalling of the run and could cause it to fail. This is because by the time the instrument reaches cycle 120 or so, you will have probably already sequenced almost everything that is there and now most of the clusters will just be showing black. This will cause the instrument to have trouble locating the clusters, and the software is likely to interpret this as an error and stop the run or flag your run as failed. So, a good rule of thumb for ancient sequencing is as follows: 2x75 bp or 2x100 bp for ancient microbial DNA; 2x150 bp for modern DNA (for genomic DNA sheared to 500 bp). That will give you the best possible data.
Use paired end sequencing for microbial DNA. Although people doing human aDNA sequencing sometimes use SE sequencing, I STRONGLY recommend that you only use PE sequencing for microbial DNA. This is because de novo assembly works best with PE data. You can force de novo assemblers to use SE data, but it doesn’t work very well and will result in lower quality results. Also, by using PE data, you will achieve higher quality basecalling and have high quality sequences. So if you always PE sequence microbial DNA, you will have high quality data that you can also use to make MAGs (metagenome-assembled genomes).
@channel Hi All, we’re gathering now for the Roundtable! Please return and turn on your webcams
And everyone please take a seat at one of the tables
since yesterday I tried a couple of times do this step but I'm stuck,
$ mkdir images
$ while read filepath; do
> echo "${filepath}" images/$(basename ${filepath})
> # mv ${filepath} images/$(basename ${filepath})
> done < File_names.txt
it created the file but is empty
¿filepath is a variable?, in the presentation I din't find it, or I need to put a specific path?
*Thread Reply:* so
${filepath}
is indeed a variable, that stores a line of the File_names.txt
*Thread Reply:* the while loop will read one line at a time from File_names.txt, which will modify
${filepath}
to be one line each time
*Thread Reply:* what does your file File_names.txt
contain?
*Thread Reply:* is empty
*Thread Reply:* ah, that's why the while loop is not printing anything, you will need to run this before:
suffix="jpg"
find Boosted-BBB/ -type f -name "**${suffix}" > File_names.txt
*Thread Reply:* and check if the File_names.txt
contains the path to your jpg files
*Thread Reply:* let me know if not, and we can continue to debug 🙂
*Thread Reply:* Thank you very much!! Do all the steps again and it worked!! To which email do I send the new script for the image sorting with new folders?
*Thread Reply:* aida_andrades@eva.mpg.de
@channel I realized after the talk that I described PCoA and CLR slightly incorrectly in the talk, so I updated the corresponding slide in the presentation (slide 70) with the correct information. Sorry for the confusion!
I just saw that we have a few gather.town twins!
@Ina Wasmuth and @Sierra Blunt as blondies
And Markus and @Tre Blohm as beardy bros
@channel for thos who were interested in the Git setup re-do next week: please indicate your availability here - https://www.when2meet.com/?16311979-XsKID (all times Berlin times, the session would only be 1h tops, and we could try to do it on your own personal laptops/servers )
Thank you so much for the outstanding organization and content! I learnt so much exciting things....and I love gathertown! 🙂
Thankssssss for organising this boot camp, I were searching for materials to learn about how to do ancient microbial analysis and then saw this summer school, too good to be true! got inspired a lot and it’s so fun to see people from many different places sharing the similar science interest! 🍻
I'm glad you enjoyed it and found it useful! Now go out and spread the knowledge!!
@channel the time (that isn't this afternoon) that most people who were interested in the re-do of setting up the git session is this Thursday - August 11th at 11am!
Please mark that in your calendars. We will meet in gather.town again 🙂
You're also welcome to join if you didn't fill in the when2meet or the poll!
We will be setting up hte SHH keys properly for you all on your laptops/computers/servers, whereever you want 😉
PM your google account if you want a google calendar invit
*Thread Reply:* Invited you can delete this messag eif you want 🙂
@channel we are starting teh git v2 thing (/personal git set up)
@Pooja Mehta @Andrea Musso @Kadri Irdt @Laura Carrillo Olivas if you're around
Oops sorry wrong Pooja - Sorry @Pooja Swali you can leave this cahnnel 🙂
We are still here for another twenty minutes!
Thanks a lot @James Fellows Yates and @Megan Michel for the Github session today ☺️
I'm so sorry!! I didn't hear my alarm 🙁and didn't wake me up!! But I will check everything and if I have any doubts I will not hesitate to tell you, thank you so much!! ❤️
*Thread Reply:* Don't apologise, I was surprised you inidicated you make that time! If you have time your morning/my afternoon today, we could quickly meet if you want?
*Thread Reply:* I just saw the message, I can any day after 3:00 pm Berlin time 😊
*Thread Reply:* Shall we book 15:00_15:45 on Monday (15th)?
*Thread Reply:* yes! Perfect!
Ok, if anyone else is interested in getting help setting up their GitHub shh keys on their servers/laptops, we will do one more git session on Monday 15ty august at 15:00_15:45 Berlin time
I'll bei n gather in 2 mitnues
Hi All, a nice summary of how to approach functional analysis was just published in PLOS Computational Biology
I would add to Tip 5 - always perform an effect size calculation. Remember that a p-value has NO biological meaning. Smaller p-values are not "more significant", they mean your observations are less likely to be due to chance. Effect size tests tell you the size of the difference between groups, so look for bigger effect sizes. A gene/pathway with a big effect size and p > 0.05 may be more interesting/informative than a gene/pathway with a small effect size and a very small p-value. (p < 0.05 was arbitrarily selected as a standard cut-off anyway, so don't throw out "non-significant" data till you've looked at effect sizes)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010348
Hi folks!
I need your help 🙂
We have received comments on one of my PhD manuscripts and one of the reviewers insisting that we have to use MALT for taxonomic classification. I tried to use malt-build (v 0.5.0) on ~16,000 bacterial genomes, but it keeps giving a memory error. I allocated even more memory, up to 1TB 🤯, and still getting the same error (as shown below).
Do you have any advice?
Best
Number input files: 16,637
Loading FastA files:
10% 20% 30% 40% 50% 60% 70% 100% (827.0s)
java.lang.OutOfMemoryError: Java heap space
at malt.io.FastAFileIteratorBytes.next(FastAFileIteratorBytes.java:155)
at malt.data.ReferencesDBBuilder.loadFastAFile(ReferencesDBBuilder.java:151)
at malt.data.ReferencesDBBuilder.loadFastAFiles(ReferencesDBBuilder.java:134)
at malt.MaltBuild.run(MaltBuild.java:226)
at malt.MaltBuild.main(MaltBuild.java:57)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.exe4j.runtime.LauncherEngine.launch(LauncherEngine.java:65)
at com.install4j.runtime.launcher.UnixLauncher.main(UnixLauncher.java:57)
*Thread Reply:* Hi Mohammed - depending on how you initially did your classification you may be able to push back against using MALT.
MALT is infamous for being memory intensive outside labs' computational resources, which is a large blocker for most labs. You can easily reply that to the editor and reviewer saying that.
In this case your only solution is to reduce the number of reference genomes you're inputting into your database. If you have not done it already, you could make sure you're only picking one representative per genome, for example.
However, you have to remember that MALT is not particularly special, the only benefit is that it's maybe slightly more specific because it does an alignment rather than slightly fuzzy kmer matching, and with that alignment you can generate damage plots etc. But you can also just generate alignments by mapping your self afterwards. It performs LCA just as e.g. Kraken does it, just kraken is more sensitive so you'll pick up more false positives - but you can just raise your support threshold a bit.
Furthermore both MALT 0.5.* and MALT 0.4.* are both actually broken:
http://megan.informatik.uni-tuebingen.de/t/lca-placement-failure-with-malt-v-0-5-2-and-0-5-3/1996 http://megan.informatik.uni-tuebingen.de/t/unable-to-change-the-default-coverage-parameter-for-naive-lca-assignment-with-malt-v-0-4-1/2032
This would force you to use an ancient version (0.3.8) but this will mean you will have to use a very out of date taxonomy as the version doesn't have updated Megan files anymore.
So ultimately, depending on why your reviewer wants you to run MALT you can quite robustly just say: we can't because it's broken and is outside our computational capacity (and that's with a 1TB node!)
*Thread Reply:* Hi @Mohamed Sarhan, I would also add that you can argue that MALT produces a similar taxonomic table to Kraken, especially after filtering out low abundance taxa, despite performing alignment rather than k-mer matching, by citing this paper. I used CLARK-S, which performs k-mer matching highly similar to Kraken. https://journals.asm.org/doi/full/10.1128/mSystems.00080-18
*Thread Reply:* Hi @Mohamed Sarhan, I agree with @James Fellows Yates and @irinavelsko that the lists of detected microbes are typically rather similar for Kraken and MALT providing you use similar databases (same organisms) for both. However, what Kraken absolutely cannot do is authentication, so you need some alignments for following up detected microbes. Bowtie2 alignments might be good enough for validation / authentication but they lack LCA, that is a drawback.
Now, regarding your technical issue, you can increase MALT java heap space by manually modifying the -Xmx flag in malt-build.vmoptions file which is located in /opt/malt/class folder in your malt installation. By default, I believe it is "-Xmx512m" but you can specify "-Xmx1000G" for example. Please try it and let me know whether it has worked
*Thread Reply:* Oh you can also try increasing the step size of the seeds in malt build
*Thread Reply:* I think Ron and Felix found you can reasonably go down to 8 with minimal loss of sensitivity
*Thread Reply:* Hmm, yes, I think --step 1 is the default. We tried --step 2 and --step 3, if I remember correctly, it does decrease the database size but from what we saw, it also decreases the accuracy of taxonomic classification. I do not recognize --step 8 🙂
*Thread Reply:* Thank you so much for your constructive replies. Really appreciate your help 😊 Thank you @James Fellows Yates, that's so convincing - It makes no sense to go back and use an outdated taxonomy. We used DIAMOND/MEGAN, MetaPhlAn3, and Kraken2/Bracken for taxonomic classification check, but include in the manuscript only the DIAMOND/MEGAN results. I will include these arguments and the paper @irinavelsko linked here. I hope that would be enough to convince the reviewer. Thank you @Nikolay Oskolkov - The "Xmx" was set to 64G, now changed it to 800G and it is running with the default step size 👍
*Thread Reply:* Ok - DIAMOND could be your issue there and why the reviewer is asking for MALT
*Thread Reply:* DIAMOND will not work well with short aDNA reads
*Thread Reply:* because it translates to very short amino acid sequencs and will not be specific enough
*Thread Reply:* https://peerj.com/preprints/27166/
*Thread Reply:* (which uses a BLASTX moe of MALT, but similar thing)
*Thread Reply:* Agree, this could be the reason - We will add the output of MetaPhlAn3 and Kraken2 as well. For the DIAMOND, we use it against the NCBI-nr database, that's why we like it because it gives a comprehensive picture on everything we have in our samples (Human DNA, microbiome, and dietary components). Then we keep going with further confirmation with specialized curated databases.
*Thread Reply:* Just to inform you, here is a comparison between the the bacterial assigned reads using MALT/BLASTn against the representative bacterial genomes (~16,600 genomes) and DIAMOND/BLASTx against the NCBI-nr database. The numbers of assigned reads are different from sample to sample. Looking forward to discussing this more in details during the upcoming SPAAM4 🙂
*Thread Reply:* @Mohamed Sarhan so did you manage to build the Malt DB?
*Thread Reply:* Yes, @Nikolay Oskolkov.. Thanks to your suggestion 🙏. It worked once we changed the the file you referred to. It needed ~700GB to build it.
*Thread Reply:* Good! It is interesting that MALT / BLASTN does not always assign more reads than DIAMOND / BLASTX
*Thread Reply:* Indeed... Do you have any read lengtg stats?
*Thread Reply:* Here is a read-length distribution for 4 of the samples (Just plotted them now 😄)
*Thread Reply:* Hm ok that's not what I expected
*Thread Reply:* Hmm, I would not use reads below 30 bp, but those I believe should not be assigned by DIAMOND to any organism at all because they are too short. So this does not explain why DIAMOND gives more assigned reads than MALT for some samples. Were all adapters trimmed prior to mapping with MALT?
*Thread Reply:* Yes, these are already quality-filtered adapter-trimmed merged deduplicated reads
*Thread Reply:* In my opinion it might have to do with the database itself and the sample microbial composition. I think the default word-size for the BLASTx is 6 and can be adjusted to 3 or 2 (I'm not sure about DIAMOND/BLASTx word size), but if it is so, it could mean the short-fragments of ~18 nt can be still seeded and assigned (Just in theory). What do you think?
*Thread Reply:* @Mohamed Sarhan I do not know about word size, but it seems plausible to me that this effect has to do with the NR/NT (used for Diamond) vs. 16 000 genomes (used for Malt). Since the former is much bigger, that might indeed result in more reads assigned by Diamond (higher sensitivity)
Hello everyone! I was looking, but I only found the pdf of the lessons in the link that you sent us, could you share the link with the recordings please? I really wish I could see them again =)
They are coming VERY soon! Don't worry!
Very soon being ™️, only one and a half things missing now
Hi EVA sediment people! SPAAM4 is happening this week and some of the pathogen and microbiome people are setting up a group viewing, maybe at the institute, that you're welcome to addend. We didn't see anyone from the EVA sedaDNA groups registered, but also don't know how to reach everyone in one go, so if you can let the rest of your groups know we'd appreciate it (it's a no-PI conference, and a perfect chance to hear about the latest in ancient metagenomics from the PhDs and postdocs doing the work, and to meet people in your field across the world!).
Follow-up for EVA people, we'll be in either the Aquarium or Terrarium, depending on what's available, at 4pm when SPAAM4 starts today. Feel free to join us there!
Hi!! help please! could you teach me how can I get and plot the percent identity of my bam file? It is calculated before or after the capture of the pathogen?
Hey @Laura Carrillo Olivas
I’d encourage you to ask further questions in the the <#C02DCKJ54JX|no-stupid-questions> channel, you’re more likely to get an answer there 😉
Regarding the percent identity, you can retrieve it using the NM/MD
tags of your bam file, the alignment length, and the read length.
You can do it either using samtools and write a parser yourself, or better, use a library such as pysam https://pysam.readthedocs.io/en/stable/index.html
The doc of the sam/bam file format is also very informative when you look for these kind of informations https://samtools.github.io/hts-specs/SAMtags.pdf
*Thread Reply:* thank you!!! ❤️