reference site: http://samtools.sourceforge.net/
manual : http://samtools.sourceforge.net/SAM1.pdf

The Samtools is an essential tool to handle files with sam (bam) format.

99 (decimal) -> 1100011 (binary) : paired, proper pair, mapped, forward, mate reverse .... (Watch the number one by one by reverse order)

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[TMAP]TMAP설치  (0) 2013.02.04
[samtools] SAMtools FAQ  (0) 2012.12.24
[bam] MD tag and cigar  (0) 2012.12.24
[python] a method to reduce ID length using ascii value  (0) 2012.12.24
[python] decimal to binary  (0) 2012.12.24
Posted by 옥탑방람보
,

FROM: http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ

About SAMtools

How to convert SAM to BAM?
If your SAM file has header @SQ lines, you may get BAM by:
samtools view -bS aln.sam > aln.bam
If not, you need to have your reference file ref.fa and then do this:
samtools faidx ref.fa
samtools view -bt ref.fa.fai aln.sam > aln.bam
The second method also works if your SAM file has @SQ lines. After conversion, you would probably like to sort and index the alignment to enable fast random access:
samtools sort aln.bam aln-sorted
samtools index aln-sorted.bam

I want to call SNPs and short indels.
For a short answer, do this:
samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D100 > flt.txt
awk '($3=="*" && $6>=50)||($3!="*" && $6>=20)' flt.txt > final.txt
For a long answer, see SAM protocol. Please always remember to set the maximum depth (-D) in filtering.

I want to call SNPs from one chromosome only.
Index your alignment with the `index' command and:
samtools view -u aln.bam chr10 | samtools pileup -vcf ref.fa - > chr10.raw.txt
Please read [http://samtools.sourceforge.net/pipe.shtml this page] for more information.

The integer FLAG field is not friendly to eyes.
You may get string FLAG by:
samtools view -X aln.bam | less -S
For more information, please check out:
samtools view -?


Pileup output.

This is explained in the [http://samtools.sourceforge.net/samtools.shtml manual page]. Or briefly (when you invoke pileup with the -c option):
1. reference sequence name
2. reference coordinate
3. reference base, or `*' for an indel line
4. genotype where heterozygotes are encoded in the [http://biocorp.ca/IUB.php IUB code]: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.
5. Phred-scaled likelihood that the genotype is wrong, which is also called `consensus quality'.
6. Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
7. [http://en.wikipedia.org/wiki/Root_mean_square root mean square] (RMS) mapping quality
8. # reads covering the position
9. read bases at a SNP line (check the manual page for more information); the 1st indel allele otherwise
10. base quality at a SNP line; the 2nd indel allele otherwise
11. indel line only: # reads directly supporting the 1st indel allele
12. indel line only: # reads directly supporting the 2nd indel allele
13. indel line only: # reads supporting a third indel allele
If pileup is invoked without `-c', indel lines and columns between 3 and 7 inclusive will not be outputted.

I see `*' in the pileup sequence column. What are they?
A star at the sequence column represents a deletion. It is a place holder to make sure the number of bases at that column matches the read depth column. Simply ignore `*' if you do not use this information.


Does samtools generate the consensus sequence like Maq?

Yes. Try this:
samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.

I want to get `unique' alignments from SAM/BAM.
We prefer to say an alignment is `reliable' rather than `unique' as `uniqueness' is not well defined in general cases. You can get reliable alignments by setting a threshold on mapping quality:
samtools view -bq 1 aln.bam > aln-reliable.bam
You may want to set a more stringent threshold to get more reliable alignments.

In repetitive regions, SAMtools call all bases as 'A' although there are no 'A' bases in reads.
This is due to a floating underflow in the MAQ SNP calling model used by default and only happens in repetitive regions. These calls are always filtered out. However, if you are uncomfortable with this, you may use the simplified SOAPsnp model with:
samtools -avcf ref.fa aln.bam > raw.txt
The MAQ model and SOAPsnp model usually deliver very similar SNP calls.

How are SNPs and indels called and filtered by SAMtools?
By default, SNPs are called with a Bayesian model identical to the one used in MAQ. A simplified SOAPsnp model is implemented, too. Indels are called with a simple Bayesian model. The caller does local realignment to recover indels that occur at the end of a read but appear to be contiguous mismatches. For an example, see [http://samtools.sourceforge.net/images/seq2-156.png this picture].

The varFilter filters SNPs/indels in the following order:
* d: low depth
* D: high depth
* W: too many SNPs in a window (SNP only)
* G: close to a high-quality indel (SNP only)
* Q: low root-mean-square (RMS) mapping quality (SNP only)
* g: close to another indel with more evidence (indel only)
The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option. A SNP/indel filtered by a rule higher in the list will not be tested against other rules.

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[TMAP]TMAP설치  (0) 2013.02.04
[samtools] samtools sam bam  (0) 2012.12.24
[bam] MD tag and cigar  (0) 2012.12.24
[python] a method to reduce ID length using ascii value  (0) 2012.12.24
[python] decimal to binary  (0) 2012.12.24
Posted by 옥탑방람보
,
10A5^AC6

REF:         ATCGTAGCTAATTTGGACATCGGT
READ:        ATCGTAGCTATTTTGG--ATCGGT
MD TAG:      10        A5   ^AC6
CIGAR:       16M             2D6M
READ:        atcGTAGCTATTTTGGATA..GGT (ATCGTAGCTATTTTGGATAAAGGT)
MD TAG:      17               C1TC3
CIGAR:       3S 16M             2N3M
READ:        ATCGTAGCTAATTTGGACATCGGT (ATCGTGGAGCTAATTTGGACATCGGT)
CIGAR:       5M   2I19M




MD TAG
The MD eld aims to achieve SNP/indel calling without looking at the reference. For example, a string `10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is di erent from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD eld ought to match the CIGAR string.

CIGAR

M     alignment match (can be a sequence match or mismatch)
I     insertion to the reference
D     deletion from the reference
N     skipped region from the reference
S     soft clipping (clipped sequences present in SEQ)
H     hard clipping (clipped sequences NOT present in SEQ)
P     padding (silent deletion from padded reference)
=     sequence match
X     sequence mismatch

H can only be present as the rst and/or last operation.
S may only have H operations between them and the ends of the CIGAR string.
For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not de ned.
Sum of lengths of the M/I/S/=/X operations ought to equal the length of SEQ.

 

Posted by 옥탑방람보
,
Posted by 옥탑방람보
,
SOLiD의 BAM파일의 경우에는 base quality을 ord('A') 를 한 후 -33 을 하면 됨. (0~40 까지의 범위)

QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format).
A base quality is the phred-scaled base error probability which equals 10 log10 Pr{base is wrong}.
This eld can be a `*' when quality is not stored. If not a `*', SEQ must not be a `*' and the
length of the quality string ought to equal the length of SEQ.


기본적으로 모두 옛날부터 시퀀서에서 적용되던 phred quality score의 개념을 따른다. 10 10%의 에러 확률, 20 1%의 에러 확률, 30은 0.1%의 에러 확률을 의미한다예를 들어 어떤 시퀀서가 99.99%의 정확도를 냈다고 한다면 그건 생산된 데이터(reads)의 대부분이 QV40 이상이었다는 의미가 된다장비마다 데이터를 생산하면서 각 메커니즘에 맞게 어떤 신호가 어떤 형식으로 나와서 그게 base call 또는 color call을 할 때 어느 정도의 정확성을 보이는지 미리 training 시켜서 얻은 경험(?)으로 나타낸다보통 다양한 생물종의 데이터를 준비하고 같은 기종이라도 여러 대에서 실험하면서 일종의 점수표를 만드는 것으로 안다따라서 개념은 같지만 서로 다른 기종의 QV를 그대로 비교하는 건 좀 위험하며기종에 따라 QV를 좀 더 좋게 보여주는 것이 있을 수도 있다시퀀싱을 한 후에 일차적인 평가를 하는데 중요한 단서이기는 하지만 실제 최종적인 서열의 정확도를 보여주는 것은 아니다참고로 다른 NGS들과 달리 SOLiD에서는 QV를 기반으로 한 필터링을 하지 않고 일단 모두 raw data로 생산한다복잡한 genome에서 QV가 특별하게 낮은 영역도 있을 수 있으므로, 그러한 곳에 대한 정보를 전부 잃기보다는 일단 분석 과정까지 가지고 간다는 의미가 있다. 
Posted by 옥탑방람보
,
MD tag and cigar
10A5^AC6

REF:         ATCGTAGCTAATTTGGACATCGGT
READ:        ATCGTAGCTATTTTGG--ATCGGT
MD TAG:      10        A5   ^AC6
CIGAR:       16M             2D6M
READ:        atcGTAGCTATTTTGGATA..GGT (ATCGTAGCTATTTTGGATAAAGGT)
MD TAG:      17               C1TC3
CIGAR:       3S 16M             2N3M
READ:        ATCGTAGCTAATTTGGACATCGGT (ATCGTGGAGCTAATTTGGACATCGGT)
CIGAR:       5M   2I19M


MD TAG
The MD eld aims to achieve SNP/indel calling without looking at the reference. For example, a string `10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is di erent from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD eld ought to match the CIGAR string.

CIGAR

M     alignment match (can be a sequence match or mismatch)
I     insertion to the reference
D     deletion from the reference
N     skipped region from the reference
S     soft clipping (clipped sequences present in SEQ)
H     hard clipping (clipped sequences NOT present in SEQ)
P     padding (silent deletion from padded reference)
=     sequence match
X     sequence mismatch

H can only be present as the rst and/or last operation.
S may only have H operations between them and the ends of the CIGAR string.
For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not de ned.
Sum of lengths of the M/I/S/=/X operations ought to equal the length of SEQ.
Posted by 옥탑방람보
,