Biocontainer Tools
Astral Tree
The astral-tree tool is used for estimating an unrooted species tree given a set of unrooted gene trees. To do so it takes a set of unrooted gene trees and searches for the species tree with the maximum shared induced quartet trees. Each model generated is statistically consistent under the multi-species coalescent model and grows at most linearly by the number of species and genes.
The main input for this tool is a set of unrooted gene trees in Newick format and it will return an unrooted species tree. Note it is recommended that you contract very low support branches eg less than 10% bootstrap support. The input file should have one species per line where the species name differs from the individual’s name. Astral only estimates branch lengths for species in which more than one individual is supplied. Branch lengths are reported in coalescent units and are a direct measure of the amount of discordance in the gene trees.
For additional information please see the documentation
ml biocontainers astral-tree
astral -i song_mammals.424.gene.tre -o song_mammles.tre
Parallel Capabilities: Single core only.
BCFTools
BCFTools is a suit of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF. The tool is designed for handling data streams and as such treats input / output files as stdin and stdout. Therefore when passing multiple files they must be properly indexed and compressed.
This suit consists of the following commands :
Command | Description |
---|---|
annotate | add or remove annotations of a VCF |
call | SNP/indel calling |
cnv | Copy number variation caller |
concat | concatenate VCF/BCF files from the same set of samples |
consensus | create a consensus sequence by applying VCF variants |
convert | convert VCF to BCF and vise versa |
csq | haplotype aware consequence caller |
filter | filter VCF/BCF files using fixed thresholds |
gtcheck | check sample concordance, detect sample swaps and contamination |
head | view file headers |
index | index files |
isec | intersections of files |
merge | merge files from non-overlapping sample sets |
mpileup | multi-way pileup producing genotype likelihoods |
norm | normalize indels |
plugin | run user-defined plugin |
polysomy | detect contaminations and whole-chromosome aberrations |
query | transform files into user-defined formats |
reheader | modify header and change sample names |
roh | identify runs of homo/auto-zygosity utilizing a HMM. |
sort | sort files |
stats | produce stats and can be visualized with plot-vcfstats |
view | subset, filter, and convert files |
Two additional scripts are bundled with bcftools and they are gff2gff
and
plot-vcfstats
. gff2gff
converts a GFF file to the format required by csq
while plot-vcfstats
plots the output of stats.
For additional information please see the documentation
ml biocontainers bcftools
bcftools stats -s <stdin> > file.vchk
plot-vcfstats -p outdir file.vchk
cat in.gff.gz | gff2gff | gzip -c > out.gff.gz
Parallel Capabilities: Single core only.
Bowtie2
Bowtie2 is a tool for aligning sequencing reads to long genomes. It indexes the
reference genome using FM indexing and then aligns the reads. The resulting
alignments are output in the SAM format. By default these alignments are end to
end but they can be set to local. In local mode bowtie2 will trim some read
characters on either end to maximize alignment score. Two additional tools are
loaded with bowtie2 including bowtie2-build
and bowtie2-inspect
.
When searching for alignments bowtie2 looks for concordant and discordant alignments, meaning those where the pairs align in the expected orientation within expected ranges as well as those that do not. Additionally, it does NOT guarantee that the alignment reported is the best possible in the terms of alignment score. In the event that two alignments are identical it uses a pseudo-random number to choose.
For additional information and examples please see the documentation
ml biocontainers bowtie2
# for unpaired reads
bowtie2 -x <basename of the index for reference genome> -u longreads.fq
# for paired reads
bowtie2 -x <basename of the index for reference genome> -1 forward_reads.fq -2 reverse_reads.fq
Parallel Capabilities: Single core default, Multithreading options supported.
Busco
Busco is a suite that attempts to provide a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. It outputs a file consisting of complete single copy, complete duplicate, Fragmented, or missing counts.
For additional information see the documentation
ml biocontainers busco
# Genome mode
busco -m genome -i INPUT.nucleotideFile -o OUTPUTNAME -l SpeciesLineage
# Protein mode
busco -m protein -i INPUT.amino_acidsFile -o OUTPUTNAME -l SpeciesLineage
# Transcriptome mode
buscso -m transcriptome -i INPUT.nucleotideFile -o OUTPUTNAME -l SpeciesLineage
# can also have it auto determine lineage with --auto-lineage
Parallel Capabilities: Single core default, Multithreading options supported.
Burrows-Wheeler Aligner (BWA)
BWA is a software package for mapping low-divergent sequences against a large
reference genome. It consists of 3 algorithms BWA-backtrack
, BWA-SW
, and
BWA_MEM
. Backtrack is designed for illumina sequence reads up to 100bps, while
SW/MEM are designed for reads ranging from 70 - 1Mbp. It accepts a reference
genome and a FASTQ file and outputs to a SAM file.
For additional information see the documentation
ml biocontainers bwa
bwa aln <reference> <short seq.fq> > <output.sam>
Parallel Capabilities: Single core default, Multithreading options supported.
Cutadapt
Cutadapt finds and removes adapter sequences, primers, poly-A tails, and other types of unwanted sequences from high-throughput sequencing reads. Additionally, it can modify, filter, and demultiplex reads in various ways. It's output consists of all sequences edited even those reduced to 0. For input you provide the adapter sequence to be trimmed and an input FASTQ file.
For additional information see the documentation
ml biocontainers cutadept
cutadapt -a <Adapter Seq> -o output.fastq input.fastq
Parallel Capabilities: Single core default, Multithreading options supported.
Diamond
Diamond is a sequence aligner designed for aligning translated DNA or protein
sequences against a protein reference database such as NR. To begin you generate
a protein database binary file through diamond. Then you can query FASTA files
against the database using blastp
. By default it returns hits with >70%
sequence homology to the output .tsv file. This can be adjusted using command
options. By default it will detect and use the available virtual cores on
the machine.
It is also capable of clustering proteins in a similar way to CD_HIT
and
UCLUST
where, based upon input criteria, it finds a set of centroid or
representative sequences and assigns each input sequence to the cluster of one
representative.
For additional information see the documentation
ml biocontainers diamond
diamond makedb --in <referencedb> -d <dbname>
diamond blastp -q <query.fa> -d <dbname> -o <output.tsv> --<optional commands>
# clustering
diamond cluster -d <query.fa> -o <output.tsv> --<optional commands>
Parallel Capabilities: Multithreading options supported.
FastQC
FastQC aims to provide a simple way to do quality control checks on raw sequencing data from high throughput sequencing pipelines. To do so it provides a modular set of analyses which can be used to give a quick impression whether or not the data has any issues. It outputs HTML graphs and reports showing the quality of your reads.
For additional information see the documentation
ml biocontainers fastqc
fastqc <seq.fa>
Parallel Capabilities: Single core only.
GATK4
The genome analysis toolkit is a suit of tools handling a variety of tasks within the genome analysis pipeline. This suit consists of more than 100 tools and as such complete documentation is beyond the scope of this document. Each tool consists of different input requirements and the user is urged to see the full documentation.
The areas of analysis covered by this tool are :
Area of Analysis | Description |
---|---|
Basecalling | Tools that process sequencing machine data e.g Illumina base calls, and detect sequencing level attributes e.g. adapters |
Copy Number Variant Discovery | Tools that analyze read coverage to detect copy number variants |
Coverage analysis | Tools that count coverage e.g. depth per allele |
Diagnostics and QC | Tools that collect sequencing quality related and comparative metrics |
Genotyping arrays manipulation | Tools that manipulate data generated by genotyping arrays |
Intervals manipulation | Tools that process genomic intervals in various formats |
Metagenomics and pathogen detection | Tools that perform metagenomic analysis e.g. microbial community composition |
Methylation-specific tools | Tools that perform methylation calling, processing bisulfite sequences, methylation-aware aligned BAM |
Other | Miscellaneous tools such as those that aid in data streaming |
Read data manipulation | Tools that manipulate read data in SAM, BAM, or CRAM formats |
Reference | Tools that analyze and manipulate fasta format references |
Short variant discovery | Tools that perform variant calling and genotyping for short variants such as SNPs, SNVs, and Indels |
Structural variant discovery | Tools that detect structural variants |
Variant evaluation and refinement | Tools that evaluate and refine variant calls such as with annotations not offered by the engine |
Variant filtering | Tools that filter variants by annotating the Filter column |
Variant manipulation | Tools that manipulate variant call format data |
Parallel Capabilities: Varies by tool. See documentation.
gmap
The genomic mapping and alignment program handles several different tasks in alignment and mapping workflows. It allows the user to map and align single cDNA against large genomes, switching between genomes at will, generating accurate gene models despite the presence of significant polymorphisms or sequencing errors, locating splice sites, detect microexons, and handling mapping across genomes with alternative assemblies.
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh gmap
Parallel Capabilities: Single core default. Multithreading options supported.
HISAT2
HISAT2 is a sensitive spliced alignment program for mapping RNA-sequencing data. It utilizes a combination of small FM indexes with a singular large FM index to collectively cover the whole genome and effectively align across exons. Built upon bowtie2 and hisat.
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh hisat
For additional information see the documentation
ml biocontainers hisat2
# paired end alignments
hisat2 -x <index> -1 <forwardseqs.fa> -2 <reverseseqs.fa> -S <output.sam>
# unpaired alignments
hisat2 -x <index> -u <list of files.fa> -S <output.sam>
# SRA access
hisat2 -x <index> --sra-acc <SRA Accession> -S <output.sam>
Parallel Capabilities: Single core default, Multithreading options supported.
hmmer3
HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. Each tool has different input and output requirements please see the documentation below. By default it will detect POSIX threads and configure in multithread mode. In which case, unless specified, it will utilize 3 threads, a Master and two worker threads.
It consists of 18 tools including :
Tool | Description |
---|---|
hmmbuild | build profile from input multiple alignment |
hmmalign | make multiple sequence alignment using a profile |
hmmsearch | search profile against sequence database |
hmmscan | search sequence against profile database |
hmmpress | prepare profile database for hmmscan |
phmmer | search single sequence against sequence database |
jackhmmer | iteratively search single sequence against database |
nhmmer | search DNA query against DNA sequence database |
nhmmerscan | search DNA sequence against a DNA profile database |
hmmfetch | retrieve profiles from a profile file |
hmmstat | show summary statistics for a profile file |
hmmemit | generate sample sequences from a profile |
hmmlogo | produce a conservation logo graphic from a profile |
hmmconvert | convert between different profile file formats |
hmmpgmd | search daemon for the HMMER website |
hmmpgmd_shard | sharded search daemon for the HMMER website |
makehmmerdb | prepare an nhmmer binary database |
hmmsim | collect score distributions on random sequences |
alimask | add column mask to a multiple sequence alignment |
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh hmmer
For additional information and running examples see the documentation
Parallel Capabilities: Multithreading options supported.
HTSeq
HTSeq is for high throughput sequencing data analysis. Covering everything from
BAM file manipulation to counting reads this package is a helpful tool for
sequence analysis. This module consists of htseq-count
,
htseq-count-barcodes
, and htseq-qa
. Htseq-count
accepts a list of
alignment files and a gff file for input while outputting a counts file.
htseq-count-barcodes
is similar to htseq-count
but is designed for SAM, BAM,
and CRAM file formats.
For additional information and examples please refer to the documentation
Execution:
ml biocontainers htseq
htseq-count <alignment files> <gff file>
Parallel Capabilities: Single core default, Multithreading options supported.
InterProScan
InterProScan provides functional analysis of proteins by classifying them into families and predicting domains and important sites. To do this it takes FASTA formatted sequences and scans the InterPro database, which integrates predictive information about proteins function from a number of partner resources. It outputs any matches to any of the databases used.
For additional information and examples please see the documentation
ml biocontainers interproscan
interproscan.sh -i INPUT.fasta -f tsv
Parallel Capabilities: Single core default, Multithreading options supported.
IQ-TREE
IQ-TREE is a high-throughput phylogenetic tree software capable of handling multiple sequence alignment files. It reconstructs the evolutionary tree as best explained by the data. Note it will throw an error if the exons are not divisible by 3.
For additional information and examples see the documentation
ml biocontainers iqtree
iqtree -s input.phy
Parallel Capabilities: Single core default, Multithreading options supported.
jellyfish
Jellyfish is a tool for memory-efficient, fast k-mer counting in DNA files. It
has the capacity to utilize compare and swap cpu instruction sets to improve
parallelism. The output will depend upon which subcommand used, but by default
it will report k-mers to mer_counts.jf
.
For additional information and examples see the documentation
Execution:
ml biocontainers jellyfish
jellyfish count -m 21 -s 100M -t 10 -C input.fasta
Parallel Capabilities: Single core default, Multithreading options supported.
minimap2
Minimap2 is a sequence mapping and alignment program that can find overlaps between long noisy reads, or map long reads (or their assemblies) to a reference genome. This program works efficiently with query sequences ranging from a few kilobases to ~100 megabases with an error rate ~15%. Outputs to either PAF or SAM file formats.
Can multithread with the following limitations. It uses at most 3 threads when indexing target sequences. When mapping reads it can use N+1 threads, where the extra thread is used for IO operations.
Parallel Capabilities: Single core default, Multithreading options supported.
MultiQC
MultiQC is a reporting tool that parses results and statistics from bioinformatics tool outputs, such as log files and console outputs. It searches a given directory for analysis logs and compiles a HTML report.
For additional information and examples please see the documentation
ml biocontainers multiqc
multiqc ResultsDir/
Parallel Capabilities: Single core only.
muscle
Muscle is a multiple sequence alignment tool capable of assembling high accuracy alternative alignments. It accepts FASTA as input and outputs an aligned FASTA. It detects the number of CPUs available for processing and will use up to 20 by default.
For more information see the documentation
ml biocontainers muscle
muscle --align input.fasta --output output.fasta
Parallel Capabilities: Multithreading options supported.
picard
Picard is a set of tools for the manipulation of high-throughput sequencing data and formats such as SAM, BAM, CRAM, and VCF.
Contains tools for the following areas of analysis :
Area of Analysis | Description |
---|---|
Base calling | Tools that process sequencing machine data e.g. Illumina base calls and detect sequencing level attributes, e.g. adapters |
Diagnostics and QC | Tools that collect sequencing quality related and comparative metrics |
Genotyping Arrays Manipulation | Tools that manipulate data generated by Genotyping arrays |
Intervals Manipulation | Tools that process genomic intervals in various formats |
Other | Miscellaneous tools e.g. those that aid in data streaming |
Read Data Manipulation | Tools that manipulate read data in SAM, BAM, or CRAM format |
Reference | Tools that analyze and manipulate FASTA format references |
Variant Evaluation and Refinement | Tools that evaluate and refine variant calls, e.g. with annotations not offered by the engine |
Variant filtering | Tools that filter variants by annotating the FILTER column |
Variant manipulation | Tools that manipulate variant call format data |
Users are urged to refer to the documentation for additional information and examples.
Parallel Capabilities: Single core only.
Prokka
Prokka is a prokaryotic, archaeal, and viral genome annotator. It utilizes
several databases to determine its annotations including ISfinder, UniProtKB,
HMMER, and NCBI bacterial antimicrobial resistance reference gene databases. It
takes a contigs file and outputs several different files in a PROKKA_YYYYMMDD
directory.
For additional information and execution examples please see the documentation
ml biocontainers prokka
prokka contigs.fa
Parallel Capabilities: Single core default, Multithreading options supported.
Qiime
Qiime2 is an end to end microbiome analysis pipeline. Consisting of several tools that handle everything from demultiplexing to phylogenetic analysis.
For additional information and a plethora of examples see the documentation
ml biocontainers qiime2
qiime demux emp-single --i-seqs emp-single-end-sequences.qza /
--m-barcodes-file sample_metadata.tsv --m-barcodes-column /
barcode-sequence -o-per-sample-sequences demux.qza /
--o-error-correction-details demux-details.qza
Parallel Capabilities: Single core default, Some tools support Multithreading, please see documentation.
QUAST
QUAST evaluates genome assemblies by computing various metrics. For input it
takes assemblies and a reference genome in FASTA format. It outputs several
files containing informative plots and statistics. This module also loads
metaquast.py
.
For additional information and examples please see the documentation
ml biocontainers quast
quast.py contigs1.fasta contigs2.fasta -r reference.fasta.gz -g genes.gff
Parallel Capabilities: Single core only.
RepeatMasker
RepeatMasker screens DNA sequences for interspersed repeats and low complexity sequences. At minimum the input requires a sequence and return format. For output it provides a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which the annotated repeats have been masked. The documentation page breaks down the output files in more detail and users are recommended to review that material.
For additional information please see the documentation
ml biocontainers repeatmasker
RepeatMasker --species <species> <Sequence.fa>
Parallel Capabilities: Single core only.
RepeatModeler
RepeatModeler is a de novo transposable element family identification and modeling package. Consisting of 3 core programs (RECON, RepeatScout, and LtrHarvest/Ltr_retriever) which identify repeat element boundaries and family relationships from sequencing data.
For input it takes a database and an input sequence. It outputs 3 primary files,
families.fa
, families.stk
, and rmod.log
. families.fa
consists of the
consensus sequences, .stk
consists of seed alignments, and rmod
consists of
a summarized log of the run. Note that this program generates A LOT of temporary
files that ARE NOT cleaned up between runs. You will have to manually clean up
your directory between runs. Also note that this has long runtimes so plan your
resource request accordingly. This module also loads BuildDatabase
and
RepeatClassifier
.
For additional information and examples see the documentation
ml biocontainers repeatmodeler
BuildDatabase -name fish fish.fa
RepeatModeler -database fish -threads # -LTRStruct
Parallel Capabilities: Single core default, Multithreading options supported.
RSEM
RSEM is for estimating gene and isoform expression levels from RNA sequencing data. This workflow consists of several programs handling database building to data visualization.
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh rsem
For more information see the documentation
Parallel Capabilities: Single core default, Multithreading options supported.
Samtools
Samtools are a set of utilities that manipulate alignments in SAM, BAM, and CRAM file formats. It manages conversions between formats, handles sorting, merging, indexing, and can retrieve reads in any region swiftly. It treats any input as stdin and output as stdout.
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh samtools
For additional information please see the documentation
Parallel Capabilities: Single core only.
skewer
Skewer is a bit-masked k-difference matching algorithm dedicated to the task of adapter trimming and is specifically designed for processing NGS pair-end sequences. For input it accepts the adapter sequence to be trimmed and the forward/reverse reads.
ml biocontainers skewer
skewer -x <adapter sequence> -q 3 <forward reads.gz> <reverse reads.gz>
Parallel Capabilities: Single core default, Multithreading options supported.
spades
Spades is an assembly toolkit consisting of several pipelines. It takes paired-end, mate-pairs, and unpaired reads as input. It then outputs several files that contain everything from the scaffolds to the assembly graph. By default the thread count is 16.
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh spades
For additional information and examples please see the documentation
ml biocontainers spades
spades.py --help
Parallel Capabilities: Multithreading options supported.
sra-tools
The sra-toolkit consists of several tools for using data in the Sequence Read Archives.
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh sra-tools
For additional information see the documentation
ml biocontainers sra-tools
fasterq-dump -O <output dir> <SRR#>
Parallel Capabilities: Single core only.
STAR
STAR is an ultra fast universal RNA sequence alignment tool designed to address the challenges of RNA-sequencing data mapping using a strategy to account for spliced alignments. Note that, though fast, requires a lot of memory so plan resource requests accordingly. STARlong also loads with this module
For additional information see the documentation
Parallel Capabilities: Single core default, Multithreading options supported.
Subread
Subread acts as a general alignment tool which can align both genomic DNA and RNA-sequencing reads. It can also discover genomic mutations such as indels and structural variants. When multithreading max threads useable is 32.
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh subread
For additional information and examples please see the documentation
ml biocontainers subread
subread-buildindex -o <index name> <list of ref.fa>
subread-align -i <index name> -r <reads.txt.gz> -o <results.bam>
Parallel Capabilities: Single core default, Multithreading options supported.
trim-galore
Trim-galore is a wrapper script to automate quality and adapter trimming as well as quality control with some added functionality to remove biased methylation positions for RRSBS sequencing files.
For additional information please see the documentation
ml biocontainers trim-galore
trim_galore -a AAAAAAAAA input.fa
Parallel Capabilities: Single core default, Multithreading options supported.
Trimmomatic
Trimmomatic is capable of performing a variety of trimming tasks for illumina generated paired-end and single end data. If no thread count is provided it is determined automatically. The available commands include :
Command | Description |
---|---|
illuminaclip | cut adapter and other illumina-specific sequences from the read |
slidingwindow | perform a sliding window trimming, cutting once the average quality within the window falls below a threshold |
leading | cut bases off the start of a read, if below a threshold quality |
trailing | cut bases off the end of a read, if below a threshold quality |
crop | cut the read to a specified length |
headcrop | cut the specified number of bases from the start of the read |
minlen | drop the read if its below a specified length |
tophred33 | convert quality scores to phred-33 |
tophred64 | convert quality scores to phred-66 |
For additional information please see the documentation
ml biocontainers trimmomatic
trimmomatic -threads 20 -phred33 -trimlog log.txt input.fa output.fa
Parallel Capabilities: Some commands have Multithreading options supported. Please see documentation.
trinity
Trinity allows for the efficient and robust de novo reconstruction of
transcriptomes from RNA-sequencing data. It requires you to define the sequence
file type, left and right read files, number of CPU cores and max memory. It
outputs to a Trinity.fasta
file. Note that the inchworm stage is capped at 6
threads.
Note several other tools load with this module and can be seen running the following command :
report_subtools.sh trinity
For additional information and examples please see the documentation
ml biocontainers trinity
Trinity --seqType fa --left left_1.fa --right right_2.fa --CPU 12 --max_memory 10G
Parallel Capabilities: Single core default, Multithreading options supported.
VCFtools
VCFtools is a program package designed for manipulating VCF files. It is capable of filtering out specific variants, comparing files, summarizing variants, file conversions, and validation/merging. As such input depends on the usage.
For additional information please see the documentation
Parallel Capabilities: Single core only.
VSEARCH
VSEARCH is a tool capable of handling a variety of tasks ranging from chimera detection to global alignment searches. As such the input of this tool varies depending upon the usage.
For additional information and examples please see the documentation
Parallel Capabilities: Single core default, Multithreading options supported.