Wrangling Sequencing Data. Part 3: File Formats

Don’t forget to check out Part 1 and 2!

tl;dr There’s tons of options for working with sequencing reads and consensus sequences, so pick whatever floats your boat and/or what the nearest-to-you scholar uses. You should learn SAMTools and BCFTools for alignment files. My best advice is (1) read the documentation and (2) read all the options/flags for a function before you use it.

Sequencing reads: fastp, fastqc, GATK

The first step in many analyses is processing of the raw sequencing reads in some way. A huge number of tools are out there for this, but I like fastp (Chen, et al., 2018) and many folks use GATK from the Broad Institute or SeqKit (Shen, et al., 2016). FastQC is another nice tool for QC which creates really readable HTML reports!

A few examples of things you might do in ancient metagenomics with reads:

Filter on base call quality
Trim adapters used for sequencing and barcoding but that we don’t want for our analysis
Merge paired-end reads
Align them!

Alignment and index files: SAMTools, BCFTools

You need to know how to use SAMTools and BCFTools. They have great documentation and are really powerful, so are worth the investment to understand. Also this is the way that you will generate the index file to go with your alignment, which is usually required for any tool that uses alignments.

Some examples of what you can do with SAMTools and BCFTools:

Call variants
Build a consensus sequence
Make pileups to count the base calls at each position or compute genotype likelihoods
Take a random subset or fraction of reads to test on (an option when using view)

Consensus sequences: BLAST, Geneious, Benchling, ApE

You’ve perhaps seen these before, as these are the most accessible tools. They often have straightforward user interfaces via applications or browsers! These are great for things like homology analysis (BLAST), or plasmid visualization and genetic engineering (Geneious, Benchling, ApE). These tools are super powerful and flexible, so anything you can imagine doing with a consensus sequence is probably possible in one of these.

Managing environments

As you can tell, there’s a broad range of tools and software used in the field. It is good practice to “manage your computational environments” – you can imagine this as organizing boxes of stuff. Environments are isolated computational setups that keep your software dependencies from conflicting and causing problems; each environment has only what is necessary for that task. Conda is a widely used package/environment software which makes installing lots of the tools I’ve mentioned here easy (I use miniconda). You can do similar environment management with Python. Also available are container platforms like Docker or Singularity. These are more common when working in industry settings or with high performance computing environments.

Olivia Smith is a PhD candidate at The University of Texas at Austin working under the supervision of Dr. Arbel Harpak. You can find her on Twitter at @SmithOliviaS.

spaam, blog

Sequencing reads: fastp, fastqc, GATK

Alignment and index files: SAMTools, BCFTools

Consensus sequences: BLAST, Geneious, Benchling, ApE

Managing environments

Related Posts

Emerging Ancient RNA Virus Research 08 May 2026

SPAAM8 01 Apr 2026

Wrangling Sequencing Data. Part 2: File Formats 02 Feb 2026