< 선행설치>

1. autoconf 설치 (2.65 이상의 버전)

    다운로드 - http://ftp.gnu.org/gnu/autoconf/

    설치 - ./configure, make, make install

    (필요할경우) vi /etc/profile 에서 export PATH=/usr/local/bin:$PATH 추가

2. automake 설치

3. m4 설치

4. SSE2 지원 확인

    cat /proc/cpuinfo 시에 flags 부분에 sse2 가 있는지 확인

5. git 설치 및 TMAP 다운로드

    $> yum install git-XXX

    $> git clone git://github.com/iontorrent/TMAP.git  (방화벽 9418 오픈되어 있어야함)

    $> cd TMAP

    $> git submodule init

    $> git submodule update

<TMAP 설치>

1. 설치

    $> cd TMAP

    $> chmod +x autogen.sh

    $> chown -R user.group ../TMAP

    $root> sudo ./autogen.sh (perl모듈 사용으로 root로 해야하는 경우 있음)

    $> mkdir ~/install/TMAP_130122

    $>./configure --prefix=~/install/TMAP_130122

    $root> sudo make (perl모듈 사용으로 root로 해야하는 경우 있음)

    $> make install

 

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[samtools] samtools sam bam  (0) 2012.12.24
[samtools] SAMtools FAQ  (0) 2012.12.24
[bam] MD tag and cigar  (0) 2012.12.24
[python] a method to reduce ID length using ascii value  (0) 2012.12.24
[python] decimal to binary  (0) 2012.12.24
Posted by 옥탑방람보
,

reference site: http://samtools.sourceforge.net/
manual : http://samtools.sourceforge.net/SAM1.pdf

The Samtools is an essential tool to handle files with sam (bam) format.

99 (decimal) -> 1100011 (binary) : paired, proper pair, mapped, forward, mate reverse .... (Watch the number one by one by reverse order)

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[TMAP]TMAP설치  (0) 2013.02.04
[samtools] SAMtools FAQ  (0) 2012.12.24
[bam] MD tag and cigar  (0) 2012.12.24
[python] a method to reduce ID length using ascii value  (0) 2012.12.24
[python] decimal to binary  (0) 2012.12.24
Posted by 옥탑방람보
,

FROM: http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ

About SAMtools

How to convert SAM to BAM?
If your SAM file has header @SQ lines, you may get BAM by:
samtools view -bS aln.sam > aln.bam
If not, you need to have your reference file ref.fa and then do this:
samtools faidx ref.fa
samtools view -bt ref.fa.fai aln.sam > aln.bam
The second method also works if your SAM file has @SQ lines. After conversion, you would probably like to sort and index the alignment to enable fast random access:
samtools sort aln.bam aln-sorted
samtools index aln-sorted.bam

I want to call SNPs and short indels.
For a short answer, do this:
samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D100 > flt.txt
awk '($3=="*" && $6>=50)||($3!="*" && $6>=20)' flt.txt > final.txt
For a long answer, see SAM protocol. Please always remember to set the maximum depth (-D) in filtering.

I want to call SNPs from one chromosome only.
Index your alignment with the `index' command and:
samtools view -u aln.bam chr10 | samtools pileup -vcf ref.fa - > chr10.raw.txt
Please read [http://samtools.sourceforge.net/pipe.shtml this page] for more information.

The integer FLAG field is not friendly to eyes.
You may get string FLAG by:
samtools view -X aln.bam | less -S
For more information, please check out:
samtools view -?


Pileup output.

This is explained in the [http://samtools.sourceforge.net/samtools.shtml manual page]. Or briefly (when you invoke pileup with the -c option):
1. reference sequence name
2. reference coordinate
3. reference base, or `*' for an indel line
4. genotype where heterozygotes are encoded in the [http://biocorp.ca/IUB.php IUB code]: M=A/C, R=A/G, W=A/T, S=C/G, Y=C/T and K=G/T; indels are indicated by, for example, */+A, -A/* or +CC/-C. There is no difference between */+A or +A/*.
5. Phred-scaled likelihood that the genotype is wrong, which is also called `consensus quality'.
6. Phred-scaled likelihood that the genotype is identical to the reference, which is also called `SNP quality'. Suppose the reference base is A and in alignment we see 17 G and 3 A. We will get a low consensus quality because it is difficult to distinguish an A/G heterozygote from a G/G homozygote. We will get a high SNP quality, though, because the evidence of a SNP is very strong.
7. [http://en.wikipedia.org/wiki/Root_mean_square root mean square] (RMS) mapping quality
8. # reads covering the position
9. read bases at a SNP line (check the manual page for more information); the 1st indel allele otherwise
10. base quality at a SNP line; the 2nd indel allele otherwise
11. indel line only: # reads directly supporting the 1st indel allele
12. indel line only: # reads directly supporting the 2nd indel allele
13. indel line only: # reads supporting a third indel allele
If pileup is invoked without `-c', indel lines and columns between 3 and 7 inclusive will not be outputted.

I see `*' in the pileup sequence column. What are they?
A star at the sequence column represents a deletion. It is a place holder to make sure the number of bases at that column matches the read depth column. Simply ignore `*' if you do not use this information.


Does samtools generate the consensus sequence like Maq?

Yes. Try this:
samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
Again, remember to set -D according to your read depth. Note that pileup2fq applies fewer filters in comparison to varFilter, and you may see tiny inconsistency between the two outputs.

I want to get `unique' alignments from SAM/BAM.
We prefer to say an alignment is `reliable' rather than `unique' as `uniqueness' is not well defined in general cases. You can get reliable alignments by setting a threshold on mapping quality:
samtools view -bq 1 aln.bam > aln-reliable.bam
You may want to set a more stringent threshold to get more reliable alignments.

In repetitive regions, SAMtools call all bases as 'A' although there are no 'A' bases in reads.
This is due to a floating underflow in the MAQ SNP calling model used by default and only happens in repetitive regions. These calls are always filtered out. However, if you are uncomfortable with this, you may use the simplified SOAPsnp model with:
samtools -avcf ref.fa aln.bam > raw.txt
The MAQ model and SOAPsnp model usually deliver very similar SNP calls.

How are SNPs and indels called and filtered by SAMtools?
By default, SNPs are called with a Bayesian model identical to the one used in MAQ. A simplified SOAPsnp model is implemented, too. Indels are called with a simple Bayesian model. The caller does local realignment to recover indels that occur at the end of a read but appear to be contiguous mismatches. For an example, see [http://samtools.sourceforge.net/images/seq2-156.png this picture].

The varFilter filters SNPs/indels in the following order:
* d: low depth
* D: high depth
* W: too many SNPs in a window (SNP only)
* G: close to a high-quality indel (SNP only)
* Q: low root-mean-square (RMS) mapping quality (SNP only)
* g: close to another indel with more evidence (indel only)
The first letter indicates the reason why SNPs/indels are filtered when you invoke varFilter with the `-p' option. A SNP/indel filtered by a rule higher in the list will not be tested against other rules.

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[TMAP]TMAP설치  (0) 2013.02.04
[samtools] samtools sam bam  (0) 2012.12.24
[bam] MD tag and cigar  (0) 2012.12.24
[python] a method to reduce ID length using ascii value  (0) 2012.12.24
[python] decimal to binary  (0) 2012.12.24
Posted by 옥탑방람보
,
10A5^AC6

REF:         ATCGTAGCTAATTTGGACATCGGT
READ:        ATCGTAGCTATTTTGG--ATCGGT
MD TAG:      10        A5   ^AC6
CIGAR:       16M             2D6M
READ:        atcGTAGCTATTTTGGATA..GGT (ATCGTAGCTATTTTGGATAAAGGT)
MD TAG:      17               C1TC3
CIGAR:       3S 16M             2N3M
READ:        ATCGTAGCTAATTTGGACATCGGT (ATCGTGGAGCTAATTTGGACATCGGT)
CIGAR:       5M   2I19M




MD TAG
The MD eld aims to achieve SNP/indel calling without looking at the reference. For example, a string `10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is di erent from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD eld ought to match the CIGAR string.

CIGAR

M     alignment match (can be a sequence match or mismatch)
I     insertion to the reference
D     deletion from the reference
N     skipped region from the reference
S     soft clipping (clipped sequences present in SEQ)
H     hard clipping (clipped sequences NOT present in SEQ)
P     padding (silent deletion from padded reference)
=     sequence match
X     sequence mismatch

H can only be present as the rst and/or last operation.
S may only have H operations between them and the ends of the CIGAR string.
For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not de ned.
Sum of lengths of the M/I/S/=/X operations ought to equal the length of SEQ.

 

Posted by 옥탑방람보
,
def asciiID(i,one,two,three,four,five):
   if one >= len(chrtb): one=0;two+=1
   if two >= len(chrtb): two=0;three+=1
   if three >= len(chrtb): three=0;four+=1
   if four >= len(chrtb): four=0;five+=1
   pk = chrtb[five]+chrtb[four]+chrtb[three]+chrtb[two]+chrtb[one]
   one+=1
   return pk,one,two,three,four,five

chrtb=[]
for ch in range(33,127): chrtb.append( chr(ch) )
one=0;two=0;three=0;four=0;five=0

for inputInteger in range(1,3000000000):
   pk,one,two,three,four,five = asciiID(inputInteger,one,two,three,four,five)
   print inputInteger,pk


When a ID is formatted with sequencial interger, it is able to convert to characters based on ascii code. Only 5 characters are able to present 7.3 billon IDs (33~126: 94, 94*94*94*94*94).

 

Posted by 옥탑방람보
,
def decimalToBinary(i):
   b = ''
   while i > 0:
      j = i & 1
      b = str(j) + b
      i >>= 1
   return b

 

Posted by 옥탑방람보
,
def subsets(x):  
    if x == []:
        return [[]]
    else:
        s = [x]
        for elem in x:
            tmp = x[:]
            tmp.remove(elem)
            new_sub = subsets(tmp)
            for y in new_sub:  # <-- can't just make it a set(), because lists aren't hashable.
                if y not in s:
                    s.append(y)
        return s
>>> s = [1,2,3]
>>> subsets(s)
[[1, 2, 3], [2, 3], [3], [], [2], [1, 3], [1], [1, 2]]


from http://www.daniweb.com/forums/thread89048.html

 

Posted by 옥탑방람보
,
import os
os.system( 'ls -al' )

a = os.popen( 'ls -al' ).read()
print a

------------------------------------------------

Here's a summary of the ways to call external programs and the advantages and disadvantages of each:

os.system("some_command with args") passes the command and arguments to your system's shell. This is nice because you can actually run multiple commands at once in this manner and set up pipes and input/output redirection. For example,

os.system("some_command < input_file another_command > output_file")

However, while this is convenient, you have to manually handle the escaping of shell characters such as spaces, etc. On the other hand, this also lets you run commands which are simply shell commands and not actually external programs.

http://docs.python.org/lib/os-process.html

stream = os.popen("some_command with args") will do the same thing as os.system except that it gives you a file-like object that you can use to access standard input/output for that process. There are 3 other variants of popen that all handle the i/o slightly differently. If you pass everything as a string, then your command is passed to the shell; if you pass them as a list then you don't need to worry about escaping anything.
http://docs.python.org/lib/os-newstreams.html

The Popen class of the subprocess module. This is intended as a replacement for os.popen but has the downside of being slightly more complicated by virtue of being so comprehensive. For example, you'd say

print Popen("echo Hello World", stdout=PIPE, shell=True).stdout.read()
instead of
print os.popen("echo Hello World").read()

but it is nice to have all of the options there in one unified class instead of 4 different popen functions.
http://docs.python.org/lib/node528.html

The call function from the subprocess module. This is basically just like the Popen class and takes all of the same arguments, but it simply wait until the command completes and gives you the return code. For example:
return_code = call("echo Hello World", shell=True)
http://docs.python.org/lib/node529.html

The os module also has all of the fork/exec/spawn functions that you'd have in a C program, but I don't recommend using them directly.

The subprocess module should probably be what you use.

 

Posted by 옥탑방람보
,
* 1000 genome proejct
http://www.1000genomes.org/wiki/Analysis/variant-call-format
  • AA    ancestral allele
  • AC    allele count in genotypes, for each ALT allele, in the same order as listed
  • AF    allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
  • AN    total number of alleles in called genotypes
  • BQ    RMS base quality at this position
  • CIGAR    cigar string describing how to align an alternate allele to the reference allele
  • DB    dbSNP membership
  • DP    combined depth across samples, e.g. DP=154
  • END    end position of the variant described in this record (esp. for CNVs)
  • H2    membership in hapmap2
  • MQ    RMS mapping quality, e.g. MQ=52
  • MQ0    Number of MAPQ == 0 reads covering this record
  • NS    Number of samples with data
  • SB    strand bias at this position
  • SOMATIC    indicates that the record is a somatic mutation, for cancer genomics
  • VALIDATED    validated by follow-up experiment

* samtools mpileup
http://samtools.sourceforge.net/mpileup.shtml

Tag Description
  • I16
    • 1 #reference Q13 bases on the forward strand 2 #reference Q13 bases on the reverse strand
    • 3 #non-ref Q13 bases on the forward strand 4 #non-ref Q13 bases on the reverse strand
    • 5 sum of reference base qualities 6 sum of squares of reference base qualities
    • 7 sum of non-ref base qualities 8 sum of squares of non-ref base qualities
    • 9 sum of ref mapping qualities 10 sum of squares of ref mapping qualities
    • 11 sum of non-ref mapping qualities 12 sum of squares of non-ref mapping qualities
    • 13 sum of tail distance for ref bases 14 sum of squares of tail distance for ref bases
    • 15 sum of tail distance for non-ref bases 16 sum of squares of tail distance for non-ref
  • INDEL    Indicating the variant is an INDEL.
  • DP    The number of reads covering or bridging POS.
  • DP4    Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.
  • PV4    P-values for 1) strand bias (exact test); 2) baseQ bias (t-test); 3) mapQ bias (t); 4) tail distance bias (t)
  • FQ    Consensus quality. If positive, FQ equals the phred-scaled probability of there being two or more different alleles. If negative, FQ equals the minus phred-scaled probability of all chromosomes being identical. Notably, given one sample, FQ is positive at hets and negative at homs.
  • AF1    EM estimate of the site allele frequency of the strongest non-reference allele.
  • CI95    Equal-tail (Bayesian) credible interval of the site allele frequency at the 95% level.
  • PC2    Phred-scaled probability of the alternate allele frequency of group1 samples being larger (,smaller) than of group2 samples.
  • PCHI2    Posterior weighted chi^2 P-value between group1 and group2 samples. This P-value is conservative.
  • QCHI2    Phred-scaled PCHI2
  • RP    Number of permutations yeilding a smaller PCHI2
 
Specifications for VCF format are different in Info field.

 

Posted by 옥탑방람보
,
consider embedding the octal value of the tab character, 011, inside the sed command
sed 's/,/'"$(printf '\011')"'/g' data.file

 

Posted by 옥탑방람보
,