Biocontainer Tools

Astral Tree

The astral-tree tool is used for estimating an unrooted species tree given a set of unrooted gene trees. To do so it takes a set of unrooted gene trees and searches for the species tree with the maximum shared induced quartet trees. Each model generated is statistically consistent under the multi-species coalescent model and grows at most linearly by the number of species and genes.

The main input for this tool is a set of unrooted gene trees in Newick format and it will return an unrooted species tree. Note it is recommended that you contract very low support branches eg less than 10% bootstrap support. The input file should have one species per line where the species name differs from the individual’s name. Astral only estimates branch lengths for species in which more than one individual is supplied. Branch lengths are reported in coalescent units and are a direct measure of the amount of discordance in the gene trees.

For additional information please see the documentation

ml biocontainers astral-tree
astral -i song_mammals.424.gene.tre -o song_mammles.tre

Parallel Capabilities: Single core only.

BCFTools

BCFTools is a suit of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF. The tool is designed for handling data streams and as such treats input / output files as stdin and stdout. Therefore when passing multiple files they must be properly indexed and compressed.

This suit consists of the following commands :

Command	Description
`annotate`	add or remove annotations of a VCF
`call`	SNP/indel calling
`cnv`	Copy number variation caller
`concat`	concatenate VCF/BCF files from the same set of samples
`consensus`	create a consensus sequence by applying VCF variants
`convert`	convert VCF to BCF and vise versa
`csq`	haplotype aware consequence caller
`filter`	filter VCF/BCF files using fixed thresholds
`gtcheck`	check sample concordance, detect sample swaps and contamination
`head`	view file headers
`index`	index files
`isec`	intersections of files
`merge`	merge files from non-overlapping sample sets
`mpileup`	multi-way pileup producing genotype likelihoods
`norm`	normalize indels
`plugin`	run user-defined plugin
`polysomy`	detect contaminations and whole-chromosome aberrations
`query`	transform files into user-defined formats
`reheader`	modify header and change sample names
`roh`	identify runs of homo/auto-zygosity utilizing a HMM.
`sort`	sort files
`stats`	produce stats and can be visualized with `plot-vcfstats`
`view`	subset, filter, and convert files

Two additional scripts are bundled with bcftools and they are gff2gff and plot-vcfstats. gff2gff converts a GFF file to the format required by csq while plot-vcfstats plots the output of stats.

For additional information please see the documentation

ml biocontainers bcftools
bcftools stats -s <stdin> > file.vchk
plot-vcfstats -p outdir file.vchk
cat in.gff.gz | gff2gff | gzip -c > out.gff.gz

Parallel Capabilities: Single core only.

Bowtie2

Bowtie2 is a tool for aligning sequencing reads to long genomes. It indexes the reference genome using FM indexing and then aligns the reads. The resulting alignments are output in the SAM format. By default these alignments are end to end but they can be set to local. In local mode bowtie2 will trim some read characters on either end to maximize alignment score. Two additional tools are loaded with bowtie2 including bowtie2-build and bowtie2-inspect.

When searching for alignments bowtie2 looks for concordant and discordant alignments, meaning those where the pairs align in the expected orientation within expected ranges as well as those that do not. Additionally, it does NOT guarantee that the alignment reported is the best possible in the terms of alignment score. In the event that two alignments are identical it uses a pseudo-random number to choose.

For additional information and examples please see the documentation

ml biocontainers bowtie2
# for unpaired reads
bowtie2 -x <basename of the index for reference genome> -u longreads.fq
# for paired reads
bowtie2 -x <basename of the index for reference genome> -1 forward_reads.fq -2 reverse_reads.fq

Parallel Capabilities: Single core default, Multithreading options supported.

Busco

Busco is a suite that attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. It outputs a file consisting of complete single copy, complete duplicate, Fragmented, or missing counts.

For additional information see the documentation

ml biocontainers busco
# Genome mode
busco -m genome -i INPUT.nucleotideFile -o OUTPUTNAME -l SpeciesLineage
# Protein mode
busco -m protein -i INPUT.amino_acidsFile -o OUTPUTNAME -l SpeciesLineage
# Transcriptome mode
buscso -m transcriptome -i INPUT.nucleotideFile -o OUTPUTNAME -l SpeciesLineage
# can also have it auto determine lineage with --auto-lineage

Parallel Capabilities: Single core default, Multithreading options supported.

Burrows-Wheeler Aligner (BWA)

BWA is a software package for mapping low-divergent sequences against a large reference genome. It consists of 3 algorithms BWA-backtrack, BWA-SW, and BWA_MEM. Backtrack is designed for illumina sequence reads up to 100bps, while SW/MEM are designed for reads ranging from 70 - 1Mbp. It accepts a reference genome and a FASTQ file and outputs to a SAM file.

For additional information see the documentation

ml biocontainers bwa
bwa aln <reference> <short seq.fq> > <output.sam>

Parallel Capabilities: Single core default, Multithreading options supported.

Cutadapt

Cutadapt finds and removes adapter sequences, primers, poly-A tails, and other types of unwanted sequences from high-throughput sequencing reads. Additionally, it can modify, filter, and demultiplex reads in various ways. It's output consists of all sequences edited even those reduced to 0. For input you provide the adapter sequence to be trimmed and an input FASTQ file.

For additional information see the documentation

ml biocontainers cutadept
cutadapt -a <Adapter Seq> -o output.fastq input.fastq

Parallel Capabilities: Single core default, Multithreading options supported.

Diamond

Diamond is a sequence aligner designed for aligning translated DNA or protein sequences against a protein reference database such as NR. To begin you generate a protein database binary file through diamond. Then you can query FASTA files against the database using blastp. By default it returns hits with >70% sequence homology to the output .tsv file. This can be adjusted using command options. By default it will detect and use the available virtual cores on the machine.

It is also capable of clustering proteins in a similar way to CD_HIT and UCLUST where, based upon input criteria, it finds a set of centroid or representative sequences and assigns each input sequence to the cluster of one representative.

For additional information see the documentation

ml biocontainers diamond
diamond makedb --in <referencedb> -d <dbname>
diamond blastp -q <query.fa> -d <dbname> -o <output.tsv> --<optional commands>
# clustering
diamond cluster -d <query.fa> -o <output.tsv> --<optional commands>

Parallel Capabilities: Multithreading options supported.

FastQC

FastQC aims to provide a simple way to do quality control checks on raw sequencing data from high throughput sequencing pipelines. To do so it provides a modular set of analyses which can be used to give a quick impression whether or not the data has any issues. It outputs HTML graphs and reports showing the quality of your reads.

For additional information see the documentation

ml biocontainers fastqc
fastqc <seq.fa>

Parallel Capabilities: Single core only.

GATK4

The genome analysis toolkit is a suit of tools handling a variety of tasks within the genome analysis pipeline. This suit consists of more than 100 tools and as such complete documentation is beyond the scope of this document. Each tool consists of different input requirements and the user is urged to see the full documentation.

The areas of analysis covered by this tool are :

Area of Analysis	Description
Basecalling	Tools that process sequencing machine data e.g Illumina base calls, and detect sequencing level attributes e.g. adapters
Copy Number Variant Discovery	Tools that analyze read coverage to detect copy number variants
Coverage analysis	Tools that count coverage e.g. depth per allele
Diagnostics and QC	Tools that collect sequencing quality related and comparative metrics
Genotyping arrays manipulation	Tools that manipulate data generated by genotyping arrays
Intervals manipulation	Tools that process genomic intervals in various formats
Metagenomics and pathogen detection	Tools that perform metagenomic analysis e.g. microbial community composition
Methylation-specific tools	Tools that perform methylation calling, processing bisulfite sequences, methylation-aware aligned BAM
Other	Miscellaneous tools such as those that aid in data streaming
Read data manipulation	Tools that manipulate read data in SAM, BAM, or CRAM formats
Reference	Tools that analyze and manipulate fasta format references
Short variant discovery	Tools that perform variant calling and genotyping for short variants such as SNPs, SNVs, and Indels
Structural variant discovery	Tools that detect structural variants
Variant evaluation and refinement	Tools that evaluate and refine variant calls such as with annotations not offered by the engine
Variant filtering	Tools that filter variants by annotating the Filter column
Variant manipulation	Tools that manipulate variant call format data

Parallel Capabilities: Varies by tool. See documentation.

`gmap`

The genomic mapping and alignment program handles several different tasks in alignment and mapping workflows. It allows the user to map and align single cDNA against large genomes, switching between genomes at will, generating accurate gene models despite the presence of significant polymorphisms or sequencing errors, locating splice sites, detect microexons, and handling mapping across genomes with alternative assemblies.

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh gmap

Parallel Capabilities: Single core default. Multithreading options supported.

HISAT2

HISAT2 is a sensitive spliced alignment program for mapping RNA-sequencing data. It utilizes a combination of small FM indexes with a singular large FM index to collectively cover the whole genome and effectively align across exons. Built upon bowtie2 and hisat.

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh hisat

For additional information see the documentation

ml biocontainers hisat2
# paired end alignments
hisat2 -x <index> -1 <forwardseqs.fa> -2 <reverseseqs.fa> -S <output.sam>
# unpaired alignments
hisat2 -x <index> -u <list of files.fa> -S <output.sam>
# SRA access
hisat2 -x <index> --sra-acc <SRA Accession> -S <output.sam>

Parallel Capabilities: Single core default, Multithreading options supported.

hmmer3

HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. Each tool has different input and output requirements please see the documentation below. By default it will detect POSIX threads and configure in multithread mode. In which case, unless specified, it will utilize 3 threads, a Master and two worker threads.

It consists of 18 tools including :

Tool	Description
`hmmbuild`	build profile from input multiple alignment
`hmmalign`	make multiple sequence alignment using a profile
`hmmsearch`	search profile against sequence database
`hmmscan`	search sequence against profile database
`hmmpress`	prepare profile database for `hmmscan`
`phmmer`	search single sequence against sequence database
`jackhmmer`	iteratively search single sequence against database
`nhmmer`	search DNA query against DNA sequence database
`nhmmerscan`	search DNA sequence against a DNA profile database
`hmmfetch`	retrieve profiles from a profile file
`hmmstat`	show summary statistics for a profile file
`hmmemit`	generate sample sequences from a profile
`hmmlogo`	produce a conservation logo graphic from a profile
`hmmconvert`	convert between different profile file formats
`hmmpgmd`	search daemon for the HMMER website
`hmmpgmd_shard`	sharded search daemon for the HMMER website
`makehmmerdb`	prepare an `nhmmer` binary database
`hmmsim`	collect score distributions on random sequences
`alimask`	add column mask to a multiple sequence alignment

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh hmmer

For additional information and running examples see the documentation

Parallel Capabilities: Multithreading options supported.

HTSeq

HTSeq is for high throughput sequencing data analysis. Covering everything from BAM file manipulation to counting reads this package is a helpful tool for sequence analysis. This module consists of htseq-count, htseq-count-barcodes, and htseq-qa. Htseq-count accepts a list of alignment files and a gff file for input while outputting a counts file. htseq-count-barcodes is similar to htseq-count but is designed for SAM, BAM, and CRAM file formats.

For additional information and examples please refer to the documentation

Execution:

ml biocontainers htseq
htseq-count <alignment files> <gff file>

Parallel Capabilities: Single core default, Multithreading options supported.

InterProScan

InterProScan provides functional analysis of proteins by classifying them into families and predicting domains and important sites. To do this it takes FASTA formatted sequences and scans the InterPro database, which integrates predictive information about proteins function from a number of partner resources. It outputs any matches to any of the databases used.

For additional information and examples please see the documentation

ml biocontainers interproscan
interproscan.sh -i INPUT.fasta -f tsv

Parallel Capabilities: Single core default, Multithreading options supported.

IQ-TREE

IQ-TREE is a high-throughput phylogenetic tree software capable of handling multiple sequence alignment files. It reconstructs the evolutionary tree as best explained by the data. Note it will throw an error if the exons are not divisible by 3.

For additional information and examples see the documentation

ml biocontainers iqtree
iqtree -s input.phy

Parallel Capabilities: Single core default, Multithreading options supported.

jellyfish

Jellyfish is a tool for memory-efficient, fast k-mer counting in DNA files. It has the capacity to utilize compare and swap cpu instruction sets to improve parallelism. The output will depend upon which subcommand used, but by default it will report k-mers to mer_counts.jf.

For additional information and examples see the documentation

Execution:

ml biocontainers jellyfish
jellyfish count -m 21 -s 100M -t 10 -C input.fasta

Parallel Capabilities: Single core default, Multithreading options supported.

minimap2

Minimap2 is a sequence mapping and alignment program that can find overlaps between long noisy reads, or map long reads (or their assemblies) to a reference genome. This program works efficiently with query sequences ranging from a few kilobases to ~100 megabases with an error rate ~15%. Outputs to either PAF or SAM file formats.

Can multithread with the following limitations. It uses at most 3 threads when indexing target sequences. When mapping reads it can use N+1 threads, where the extra thread is used for IO operations.

Parallel Capabilities: Single core default, Multithreading options supported.

MultiQC

MultiQC is a reporting tool that parses results and statistics from bioinformatics tool outputs, such as log files and console outputs. It searches a given directory for analysis logs and compiles a HTML report.

For additional information and examples please see the documentation

ml biocontainers multiqc
multiqc ResultsDir/

Parallel Capabilities: Single core only.

muscle

Muscle is a multiple sequence alignment tool capable of assembling high accuracy alternative alignments. It accepts FASTA as input and outputs an aligned FASTA. It detects the number of CPUs available for processing and will use up to 20 by default.

For more information see the documentation

ml biocontainers muscle
muscle --align input.fasta --output output.fasta

Parallel Capabilities: Multithreading options supported.

picard

Picard is a set of tools for the manipulation of high-throughput sequencing data and formats such as SAM, BAM, CRAM, and VCF.

Contains tools for the following areas of analysis :

Area of Analysis	Description
Base calling	Tools that process sequencing machine data e.g. Illumina base calls and detect sequencing level attributes, e.g. adapters
Diagnostics and QC	Tools that collect sequencing quality related and comparative metrics
Genotyping Arrays Manipulation	Tools that manipulate data generated by Genotyping arrays
Intervals Manipulation	Tools that process genomic intervals in various formats
Other	Miscellaneous tools e.g. those that aid in data streaming
Read Data Manipulation	Tools that manipulate read data in SAM, BAM, or CRAM format
Reference	Tools that analyze and manipulate FASTA format references
Variant Evaluation and Refinement	Tools that evaluate and refine variant calls, e.g. with annotations not offered by the engine
Variant filtering	Tools that filter variants by annotating the FILTER column
Variant manipulation	Tools that manipulate variant call format data

Users are urged to refer to the documentation for additional information and examples.

Parallel Capabilities: Single core only.

Prokka

Prokka is a prokaryotic, archaeal, and viral genome annotator. It utilizes several databases to determine its annotations including ISfinder, UniProtKB, HMMER, and NCBI bacterial antimicrobial resistance reference gene databases. It takes a contigs file and outputs several different files in a PROKKA_YYYYMMDD directory.

For additional information and execution examples please see the documentation

ml biocontainers prokka
prokka contigs.fa

Parallel Capabilities: Single core default, Multithreading options supported.

Qiime

Qiime2 is an end to end microbiome analysis pipeline. Consisting of several tools that handle everything from demultiplexing to phylogenetic analysis.

For additional information and a plethora of examples see the documentation

ml biocontainers qiime2
qiime demux emp-single --i-seqs emp-single-end-sequences.qza /
--m-barcodes-file sample_metadata.tsv --m-barcodes-column /
 barcode-sequence -o-per-sample-sequences demux.qza /
 --o-error-correction-details demux-details.qza

Parallel Capabilities: Single core default, Some tools support Multithreading, please see documentation.

QUAST

QUAST evaluates genome assemblies by computing various metrics. For input it takes assemblies and a reference genome in FASTA format. It outputs several files containing informative plots and statistics. This module also loads metaquast.py.

For additional information and examples please see the documentation

ml biocontainers quast
quast.py contigs1.fasta contigs2.fasta -r reference.fasta.gz -g genes.gff

Parallel Capabilities: Single core only.

RepeatMasker

RepeatMasker screens DNA sequences for interspersed repeats and low complexity sequences. At minimum the input requires a sequence and return format. For output it provides a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which the annotated repeats have been masked. The documentation page breaks down the output files in more detail and users are recommended to review that material.

For additional information please see the documentation

ml biocontainers repeatmasker
RepeatMasker --species <species> <Sequence.fa>

Parallel Capabilities: Single core only.

RepeatModeler

RepeatModeler is a de novo transposable element family identification and modeling package. Consisting of 3 core programs (RECON, RepeatScout, and LtrHarvest/Ltr_retriever) which identify repeat element boundaries and family relationships from sequencing data.

For input it takes a database and an input sequence. It outputs 3 primary files, families.fa, families.stk, and rmod.log. families.fa consists of the consensus sequences, .stk consists of seed alignments, and rmod consists of a summarized log of the run. Note that this program generates A LOT of temporary files that ARE NOT cleaned up between runs. You will have to manually clean up your directory between runs. Also note that this has long runtimes so plan your resource request accordingly. This module also loads BuildDatabase and RepeatClassifier.

For additional information and examples see the documentation

ml biocontainers repeatmodeler
BuildDatabase -name fish fish.fa
RepeatModeler -database fish -threads # -LTRStruct

Parallel Capabilities: Single core default, Multithreading options supported.

RSEM

RSEM is for estimating gene and isoform expression levels from RNA sequencing data. This workflow consists of several programs handling database building to data visualization.

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh rsem

For more information see the documentation

Parallel Capabilities: Single core default, Multithreading options supported.

Samtools

Samtools are a set of utilities that manipulate alignments in SAM, BAM, and CRAM file formats. It manages conversions between formats, handles sorting, merging, indexing, and can retrieve reads in any region swiftly. It treats any input as stdin and output as stdout.

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh samtools

For additional information please see the documentation

Parallel Capabilities: Single core only.

skewer

Skewer is a bit-masked k-difference matching algorithm dedicated to the task of adapter trimming and is specifically designed for processing NGS pair-end sequences. For input it accepts the adapter sequence to be trimmed and the forward/reverse reads.

ml biocontainers skewer
skewer -x <adapter sequence> -q 3 <forward reads.gz> <reverse reads.gz>

Parallel Capabilities: Single core default, Multithreading options supported.

spades

Spades is an assembly toolkit consisting of several pipelines. It takes paired-end, mate-pairs, and unpaired reads as input. It then outputs several files that contain everything from the scaffolds to the assembly graph. By default the thread count is 16.

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh spades

For additional information and examples please see the documentation

ml biocontainers spades
spades.py --help

Parallel Capabilities: Multithreading options supported.

sra-tools

The sra-toolkit consists of several tools for using data in the Sequence Read Archives.

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh sra-tools

For additional information see the documentation

ml biocontainers sra-tools
fasterq-dump -O <output dir> <SRR#>

Parallel Capabilities: Single core only.

STAR

STAR is an ultra fast universal RNA sequence alignment tool designed to address the challenges of RNA-sequencing data mapping using a strategy to account for spliced alignments. Note that, though fast, requires a lot of memory so plan resource requests accordingly. STARlong also loads with this module

For additional information see the documentation

Parallel Capabilities: Single core default, Multithreading options supported.

Subread

Subread acts as a general alignment tool which can align both genomic DNA and RNA-sequencing reads. It can also discover genomic mutations such as indels and structural variants. When multithreading max threads useable is 32.

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh subread

For additional information and examples please see the documentation

ml biocontainers subread
subread-buildindex -o <index name> <list of ref.fa>
subread-align -i <index name> -r <reads.txt.gz> -o <results.bam>

Parallel Capabilities: Single core default, Multithreading options supported.

trim-galore

Trim-galore is a wrapper script to automate quality and adapter trimming as well as quality control with some added functionality to remove biased methylation positions for RRSBS sequencing files.

For additional information please see the documentation

ml biocontainers trim-galore
trim_galore -a AAAAAAAAA input.fa

Parallel Capabilities: Single core default, Multithreading options supported.

Trimmomatic

Trimmomatic is capable of performing a variety of trimming tasks for illumina generated paired-end and single end data. If no thread count is provided it is determined automatically. The available commands include :

Command	Description
`illuminaclip`	cut adapter and other illumina-specific sequences from the read
`slidingwindow`	perform a sliding window trimming, cutting once the average quality within the window falls below a threshold
`leading`	cut bases off the start of a read, if below a threshold quality
`trailing`	cut bases off the end of a read, if below a threshold quality
`crop`	cut the read to a specified length
`headcrop`	cut the specified number of bases from the start of the read
`minlen`	drop the read if its below a specified length
`tophred33`	convert quality scores to phred-33
`tophred64`	convert quality scores to phred-66

For additional information please see the documentation

ml biocontainers trimmomatic
trimmomatic -threads 20 -phred33 -trimlog log.txt input.fa output.fa

Parallel Capabilities: Some commands have Multithreading options supported. Please see documentation.

trinity

Trinity allows for the efficient and robust de novo reconstruction of transcriptomes from RNA-sequencing data. It requires you to define the sequence file type, left and right read files, number of CPU cores and max memory. It outputs to a Trinity.fasta file. Note that the inchworm stage is capped at 6 threads.

tip

Note several other tools load with this module and can be seen running the following command :

report_subtools.sh trinity

For additional information and examples please see the documentation

ml biocontainers trinity
Trinity --seqType fa --left left_1.fa --right right_2.fa --CPU 12 --max_memory 10G

Parallel Capabilities: Single core default, Multithreading options supported.

VCFtools

VCFtools is a program package designed for manipulating VCF files. It is capable of filtering out specific variants, comparing files, summarizing variants, file conversions, and validation/merging. As such input depends on the usage.

For additional information please see the documentation

Parallel Capabilities: Single core only.

VSEARCH

VSEARCH is a tool capable of handling a variety of tasks ranging from chimera detection to global alignment searches. As such the input of this tool varies depending upon the usage.

For additional information and examples please see the documentation

Parallel Capabilities: Single core default, Multithreading options supported.

Astral Tree​

BCFTools​

Bowtie2​

Busco​

Burrows-Wheeler Aligner (BWA)​

Cutadapt​

Diamond​

FastQC​

GATK4​

gmap​

HISAT2​

hmmer3​

HTSeq​

InterProScan​

IQ-TREE​

jellyfish​

minimap2​

MultiQC​

muscle​

picard​

Prokka​

Qiime​

QUAST​

RepeatMasker​

RepeatModeler​

RSEM​

Samtools​

skewer​

spades​

sra-tools​

STAR​

Subread​

trim-galore​

Trimmomatic​

trinity​

VCFtools​

VSEARCH​

Astral Tree

BCFTools

Bowtie2

Busco

Burrows-Wheeler Aligner (BWA)

Cutadapt

Diamond

FastQC

GATK4

`gmap`

HISAT2

hmmer3

HTSeq

InterProScan

IQ-TREE

jellyfish

minimap2

MultiQC

muscle

picard

Prokka

Qiime

QUAST

RepeatMasker

RepeatModeler

RSEM

Samtools

skewer

spades

sra-tools

STAR

Subread

trim-galore

Trimmomatic

trinity

VCFtools

VSEARCH