Displaying sequence statistics for next generation sequencing

Works with large fasta, fastq and SAM/BAM files.

Version 1.5.1 - still *Beta*

Posted on July 1th 2015 by Timo Lassmann

Version 1.5.1 is addressing a number of bug reports. Most prominently changes include:

1) Added a testsuite to verify the output on a variety of SAM files.

*Beta* Version 1.5

Posted on August 14th 2014 by Timo Lassmann

Version 1.5 is addressing several issues with the now 4 year old original version of SAMstat. Most prominently changes include:

1) Better support for long reads.Version 1.5 uses a simple hidden Markov model to calculate the positon specific nucleotide over-representation profiles.

2) Better visualization using the excellent Chartjs library for drawing plots.

Version 1.09

Posted on July 8th 2013 by Timo Lassmann

Version 1.09: fixed compilation issues on Ubuntu Linux.

Posted on March 23th 2011 by Timo Lassmann

Version 1.08 allows users to print a summary over multiple input files. Additionally several several minor bugs were fixed.


Posted on September 21th 2010 by Timo Lassmann

Next generation sequencing is being applied to understand individual variation, the RNA output of a cell and epigenetic regulation. The millions of sequenced reads are commonly stored in fasta, fastq and after mapping to a reference genome in the alignment / map format (SAM/BAM). To monitor the sequence quality over time and to identify problems it is necessary to report various statistics of the reads at different stages during processing.

SAMStat is an efficient C program to quickly display statistics of large sequence files from next generation sequencing projects. When applied to SAM/BAM files all statistics are reported for unmapped, poorly and accurately mapped reads separately. This allows for identification of a variety of problems, such as remaining linker and adaptor sequences, causing poor mapping. Apart from this SAMStat can be used to verify individual processing steps in large analysis pipelines.

SAMStat reports nucleotide composition, length distribution, base quality distribution, mapping statistics, mismatch, insertion and deletion error profiles, di-nucleotide and 10-mer over-representation. The output is a single html5 page which can be interpreted by a non-specialist.


samstat <file.sam>  <file.bam>  <file.fa>  <file.fq> .... 
For each input file SAMStat will create a single html page named after the input file name plus a dot html suffix.


tar -zxcf samstat.tgz 
cd samstat
make clean
sudo cp samstat /usr/local/bin/