[samtools] SAMtools FAQ

Bioinformatics/Biological data analysis 2012. 12. 24. 17:32

FROM: http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ

About SAMtools

How to convert SAM to BAM?
If your SAM file has header @SQ lines, you may get BAM by:
samtools view -bS aln.sam > aln.bam
If not, you need to have your reference file ref.fa and then do this:
samtools faidx ref.fa
samtools view -bt ref.fa.fai aln.sam > aln.bam
The second method also works if your SAM file has @SQ lines. After conversion, you would probably like to sort and index the alignment to enable fast random access:
samtools sort aln.bam aln-sorted
samtools index aln-sorted.bam

I want to call SNPs and short indels.
For a short answer, do this:
samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D100 > flt.txt
awk '($3=="*" && $6>=50)||($3!="*" && $6>=20)' flt.txt > final.txt
For a long answer, see SAM protocol. Please always remember to set the maximum depth (-D) in filtering.

I want to call SNPs from one chromosome only.
Index your alignment with the `index' command and:
samtools view -u aln.bam chr10 | samtools pileup -vcf ref.fa - > chr10.raw.txt
Please read [http://samtools.sourceforge.net/pipe.shtml this page] for more information.

The integer FLAG field is not friendly to eyes.
You may get string FLAG by:
samtools view -X aln.bam | less -S
For more information, please check out:
samtools view -?

Pileup output.

This is explained in the [http://samtools.sourceforge.net/samtools.shtml manual page]. Or briefly (when you invoke pileup with the -c option):
1. reference sequence name
2. reference coordinate
3. reference base, or `*' for an indel line
4. genotype where heterozygotes are encoded in the [http://biocorp.ca/IUB.php IUB code]: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.
5. Phred-scaled likelihood that the genotype is wrong, which is also called `consensus quality'.
6. Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
7. [http://en.wikipedia.org/wiki/Root_mean_square root mean square] (RMS) mapping quality
8. # reads covering the position
9. read bases at a SNP line (check the manual page for more information); the 1st indel allele otherwise
10. base quality at a SNP line; the 2nd indel allele otherwise
11. indel line only: # reads directly supporting the 1st indel allele
12. indel line only: # reads directly supporting the 2nd indel allele
13. indel line only: # reads supporting a third indel allele
If pileup is invoked without `-c', indel lines and columns between 3 and 7 inclusive will not be outputted.

I see `*' in the pileup sequence column. What are they?
A star at the sequence column represents a deletion. It is a place holder to make sure the number of bases at that column matches the read depth column. Simply ignore `*' if you do not use this information.

Does samtools generate the consensus sequence like Maq?

Yes. Try this:
samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.

I want to get `unique' alignments from SAM/BAM.
We prefer to say an alignment is `reliable' rather than `unique' as `uniqueness' is not well defined in general cases. You can get reliable alignments by setting a threshold on mapping quality:
samtools view -bq 1 aln.bam > aln-reliable.bam
You may want to set a more stringent threshold to get more reliable alignments.

In repetitive regions, SAMtools call all bases as 'A' although there are no 'A' bases in reads.
This is due to a floating underflow in the MAQ SNP calling model used by default and only happens in repetitive regions. These calls are always filtered out. However, if you are uncomfortable with this, you may use the simplified SOAPsnp model with:
samtools -avcf ref.fa aln.bam > raw.txt
The MAQ model and SOAPsnp model usually deliver very similar SNP calls.

How are SNPs and indels called and filtered by SAMtools?
By default, SNPs are called with a Bayesian model identical to the one used in MAQ. A simplified SOAPsnp model is implemented, too. Indels are called with a simple Bayesian model. The caller does local realignment to recover indels that occur at the end of a read but appear to be contiguous mismatches. For an example, see [http://samtools.sourceforge.net/images/seq2-156.png this picture].

The varFilter filters SNPs/indels in the following order:
* d: low depth
* D: high depth
* W: too many SNPs in a window (SNP only)
* G: close to a high-quality indel (SNP only)
* Q: low root-mean-square (RMS) mapping quality (SNP only)
* g: close to another indel with more evidence (indel only)
The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option. A SNP/indel filtered by a rule higher in the list will not be tested against other rules.

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[TMAP]TMAP설치 (0)	2013.02.04
[samtools] samtools sam bam (0)	2012.12.24
[bam] MD tag and cigar (0)	2012.12.24
[python] a method to reduce ID length using ascii value (0)	2012.12.24
[python] decimal to binary (0)	2012.12.24

Posted by 옥탑방람보

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

옥탑방람보

[samtools] SAMtools FAQ

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바


	by 옥탑방람보