reference site: http://samtools.sourceforge.net/
manual : http://samtools.sourceforge.net/SAM1.pdf

The Samtools is an essential tool to handle files with sam (bam) format.

99 (decimal) -> 1100011 (binary) : paired, proper pair, mapped, forward, mate reverse .... (Watch the number one by one by reverse order)

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[TMAP]TMAP설치  (0) 2013.02.04
[samtools] SAMtools FAQ  (0) 2012.12.24
[bam] MD tag and cigar  (0) 2012.12.24
[python] a method to reduce ID length using ascii value  (0) 2012.12.24
[python] decimal to binary  (0) 2012.12.24
Posted by 옥탑방람보
,

FROM: http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ

About SAMtools

How to convert SAM to BAM?
If your SAM file has header @SQ lines, you may get BAM by:
samtools view -bS aln.sam > aln.bam
If not, you need to have your reference file ref.fa and then do this:
samtools faidx ref.fa
samtools view -bt ref.fa.fai aln.sam > aln.bam
The second method also works if your SAM file has @SQ lines. After conversion, you would probably like to sort and index the alignment to enable fast random access:
samtools sort aln.bam aln-sorted
samtools index aln-sorted.bam

I want to call SNPs and short indels.
For a short answer, do this:
samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D100 > flt.txt
awk '($3=="*" && $6>=50)||($3!="*" && $6>=20)' flt.txt > final.txt
For a long answer, see SAM protocol. Please always remember to set the maximum depth (-D) in filtering.

I want to call SNPs from one chromosome only.
Index your alignment with the `index' command and:
samtools view -u aln.bam chr10 | samtools pileup -vcf ref.fa - > chr10.raw.txt
Please read [http://samtools.sourceforge.net/pipe.shtml this page] for more information.

The integer FLAG field is not friendly to eyes.
You may get string FLAG by:
samtools view -X aln.bam | less -S
For more information, please check out:
samtools view -?


Pileup output.

This is explained in the [http://samtools.sourceforge.net/samtools.shtml manual page]. Or briefly (when you invoke pileup with the -c option):
1. reference sequence name
2. reference coordinate
3. reference base, or `*' for an indel line
4. genotype where heterozygotes are encoded in the [http://biocorp.ca/IUB.php IUB code]: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.
5. Phred-scaled likelihood that the genotype is wrong, which is also called `consensus quality'.
6. Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
7. [http://en.wikipedia.org/wiki/Root_mean_square root mean square] (RMS) mapping quality
8. # reads covering the position
9. read bases at a SNP line (check the manual page for more information); the 1st indel allele otherwise
10. base quality at a SNP line; the 2nd indel allele otherwise
11. indel line only: # reads directly supporting the 1st indel allele
12. indel line only: # reads directly supporting the 2nd indel allele
13. indel line only: # reads supporting a third indel allele
If pileup is invoked without `-c', indel lines and columns between 3 and 7 inclusive will not be outputted.

I see `*' in the pileup sequence column. What are they?
A star at the sequence column represents a deletion. It is a place holder to make sure the number of bases at that column matches the read depth column. Simply ignore `*' if you do not use this information.


Does samtools generate the consensus sequence like Maq?

Yes. Try this:
samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.

I want to get `unique' alignments from SAM/BAM.
We prefer to say an alignment is `reliable' rather than `unique' as `uniqueness' is not well defined in general cases. You can get reliable alignments by setting a threshold on mapping quality:
samtools view -bq 1 aln.bam > aln-reliable.bam
You may want to set a more stringent threshold to get more reliable alignments.

In repetitive regions, SAMtools call all bases as 'A' although there are no 'A' bases in reads.
This is due to a floating underflow in the MAQ SNP calling model used by default and only happens in repetitive regions. These calls are always filtered out. However, if you are uncomfortable with this, you may use the simplified SOAPsnp model with:
samtools -avcf ref.fa aln.bam > raw.txt
The MAQ model and SOAPsnp model usually deliver very similar SNP calls.

How are SNPs and indels called and filtered by SAMtools?
By default, SNPs are called with a Bayesian model identical to the one used in MAQ. A simplified SOAPsnp model is implemented, too. Indels are called with a simple Bayesian model. The caller does local realignment to recover indels that occur at the end of a read but appear to be contiguous mismatches. For an example, see [http://samtools.sourceforge.net/images/seq2-156.png this picture].

The varFilter filters SNPs/indels in the following order:
* d: low depth
* D: high depth
* W: too many SNPs in a window (SNP only)
* G: close to a high-quality indel (SNP only)
* Q: low root-mean-square (RMS) mapping quality (SNP only)
* g: close to another indel with more evidence (indel only)
The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option. A SNP/indel filtered by a rule higher in the list will not be tested against other rules.

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[TMAP]TMAP설치  (0) 2013.02.04
[samtools] samtools sam bam  (0) 2012.12.24
[bam] MD tag and cigar  (0) 2012.12.24
[python] a method to reduce ID length using ascii value  (0) 2012.12.24
[python] decimal to binary  (0) 2012.12.24
Posted by 옥탑방람보
,
Posted by 옥탑방람보
,
MD tag and cigar
10A5^AC6

REF:         ATCGTAGCTAATTTGGACATCGGT
READ:        ATCGTAGCTATTTTGG--ATCGGT
MD TAG:      10        A5   ^AC6
CIGAR:       16M             2D6M
READ:        atcGTAGCTATTTTGGATA..GGT (ATCGTAGCTATTTTGGATAAAGGT)
MD TAG:      17               C1TC3
CIGAR:       3S 16M             2N3M
READ:        ATCGTAGCTAATTTGGACATCGGT (ATCGTGGAGCTAATTTGGACATCGGT)
CIGAR:       5M   2I19M


MD TAG
The MD eld aims to achieve SNP/indel calling without looking at the reference. For example, a string `10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is di erent from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD eld ought to match the CIGAR string.

CIGAR

M     alignment match (can be a sequence match or mismatch)
I     insertion to the reference
D     deletion from the reference
N     skipped region from the reference
S     soft clipping (clipped sequences present in SEQ)
H     hard clipping (clipped sequences NOT present in SEQ)
P     padding (silent deletion from padded reference)
=     sequence match
X     sequence mismatch

H can only be present as the rst and/or last operation.
S may only have H operations between them and the ends of the CIGAR string.
For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not de ned.
Sum of lengths of the M/I/S/=/X operations ought to equal the length of SEQ.
Posted by 옥탑방람보
,