Diamond

Diamond is a sequence aligner designed for aligning translated DNA or protein sequences against a protein reference database such as NR. To begin you generate a protein database binary file through diamond. Then you can query FASTA files against the database using blastp. By default it returns hits with >70% sequence homology to the output .tsv file. This can be adjusted using command options. By default it will detect and use the available virtual cores on the machine.

It is also capable of clustering proteins in a similar way to CD_HIT and UCLUST where, based upon input criteria, it finds a set of centroid or representative sequences and assigns each input sequence to the cluster of one representative.

For additional information see the documentation

ml biocontainers diamond
diamond makedb --in <referencedb> -d <dbname>
diamond blastp -q <query.fa> -d <dbname> -o <output.tsv> --<optional commands>
# clustering
diamond cluster -d <query.fa> -o <output.tsv> --<optional commands>

Parallel Capabilities: Multithreading options supported.