by Olivia Smith
Don’t forget to check out Part 1 and 2!
tl;dr There’s tons of options for working with sequencing reads and consensus sequences, so pick whatever floats your boat and/or what the nearest-to-you scholar uses. You should learn SAMTools and BCFTools for alignment files. My best advice is (1) read the documentation and (2) read all the options/flags for a function before you use it.
Sequencing reads: fastp, fastqc, GATK
The first step in many analyses is processing of the raw sequencing reads in some way. A huge number of tools are out there for this, but I like fastp (Chen, et al., 2018) and many folks use GATK from the Broad Institute or SeqKit (Shen, et al., 2016). FastQC is another nice tool for QC which creates really readable HTML reports!
A few examples of things you might do in ancient metagenomics with reads:
- Filter on base call quality
- Trim adapters used for sequencing and barcoding but that we don’t want for our analysis
- Merge paired-end reads
- Align them!
Alignment and index files: SAMTools, BCFTools
You need to know how to use SAMTools and BCFTools. They have great documentation and are really powerful, so are worth the investment to understand. Also this is the way that you will generate the index file to go with your alignment, which is usually required for any tool that uses alignments.
Some examples of what you can do with SAMTools and BCFTools:
- Call variants
- Build a consensus sequence
- Make pileups to count the base calls at each position or compute genotype likelihoods
- Take a random subset or fraction of reads to test on (an option when using view)
Consensus sequences: BLAST, Geneious, Benchling, ApE
You’ve perhaps seen these before, as these are the most accessible tools. They often have straightforward user interfaces via applications or browsers! These are great for things like homology analysis (BLAST), or plasmid visualization and genetic engineering (Geneious, Benchling, ApE). These tools are super powerful and flexible, so anything you can imagine doing with a consensus sequence is probably possible in one of these.
Managing environments
As you can tell, there’s a broad range of tools and software used in the field. It is good practice to “manage your computational environments” – you can imagine this as organizing boxes of stuff. Environments are isolated computational setups that keep your software dependencies from conflicting and causing problems; each environment has only what is necessary for that task. Conda is a widely used package/environment software which makes installing lots of the tools I’ve mentioned here easy (I use miniconda). You can do similar environment management with Python. Also available are container platforms like Docker or Singularity. These are more common when working in industry settings or with high performance computing environments.
Olivia Smith is a PhD candidate at The University of Texas at Austin working under the supervision of Dr. Arbel Harpak. You can find her on Twitter at @SmithOliviaS.