[python] a method to reduce ID length using ascii value

Bioinformatics/Biological data analysis 2012. 12. 24. 17:29

def asciiID(i,one,two,three,four,five):
   if one >= len(chrtb): one=0;two+=1
   if two >= len(chrtb): two=0;three+=1
   if three >= len(chrtb): three=0;four+=1
   if four >= len(chrtb): four=0;five+=1
   pk = chrtb[five]+chrtb[four]+chrtb[three]+chrtb[two]+chrtb[one]
   one+=1
   return pk,one,two,three,four,five

chrtb=[]
for ch in range(33,127): chrtb.append( chr(ch) )
one=0;two=0;three=0;four=0;five=0

for inputInteger in range(1,3000000000):
   pk,one,two,three,four,five = asciiID(inputInteger,one,two,three,four,five)
   print inputInteger,pk

When a ID is formatted with sequencial interger, it is able to convert to characters based on ascii code. Only 5 characters are able to present 7.3 billon IDs (33~126: 94, 94*94*94*94*94).

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[samtools] SAMtools FAQ (0)	2012.12.24
[bam] MD tag and cigar (0)	2012.12.24
[python] decimal to binary (0)	2012.12.24
[python] universal set - computing subsets from a set (list) (0)	2012.12.24
[python] the ways to call external programs (0)	2012.12.24

Posted by 옥탑방람보

,

[python] decimal to binary

Bioinformatics/Biological data analysis 2012. 12. 24. 17:28

def decimalToBinary(i):
   b = ''
   while i > 0:
      j = i & 1
      b = str(j) + b
      i >>= 1
   return b

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[bam] MD tag and cigar (0)	2012.12.24
[python] a method to reduce ID length using ascii value (0)	2012.12.24
[python] universal set - computing subsets from a set (list) (0)	2012.12.24
[python] the ways to call external programs (0)	2012.12.24
[File Format] VCF format (0)	2012.12.24

Posted by 옥탑방람보

,

[python] universal set - computing subsets from a set (list)

Bioinformatics/Biological data analysis 2012. 12. 24. 17:28

def subsets(x):
   if x == []:
   return [[]]
   else:
   s = [x]
   for elem in x:
   tmp = x[:]
   tmp.remove(elem)
   new_sub = subsets(tmp)
   for y in new_sub: # <-- can't just make it a set(), because lists aren't hashable.
   if y not in s:
   s.append(y)
   return s
>>> s = [1,2,3]
>>> subsets(s)
[[1, 2, 3], [2, 3], [3], [], [2], [1, 3], [1], [1, 2]]

from http://www.daniweb.com/forums/thread89048.html

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[python] a method to reduce ID length using ascii value (0)	2012.12.24
[python] decimal to binary (0)	2012.12.24
[python] the ways to call external programs (0)	2012.12.24
[File Format] VCF format (0)	2012.12.24
[linux] replace comma to tab (0)	2012.12.24

Posted by 옥탑방람보

,

[python] the ways to call external programs

Bioinformatics/Biological data analysis 2012. 12. 24. 17:27

import os
os.system( 'ls -al' )
a = os.popen( 'ls -al' ).read()
print a

------------------------------------------------

Here's a summary of the ways to call external programs and the advantages and disadvantages of each:

os.system("some_command with args") passes the command and arguments to your system's shell. This is nice because you can actually run multiple commands at once in this manner and set up pipes and input/output redirection. For example,

os.system("some_command < input_file another_command > output_file")

However, while this is convenient, you have to manually handle the escaping of shell characters such as spaces, etc. On the other hand, this also lets you run commands which are simply shell commands and not actually external programs.

http://docs.python.org/lib/os-process.html

stream = os.popen("some_command with args") will do the same thing as os.system except that it gives you a file-like object that you can use to access standard input/output for that process. There are 3 other variants of popen that all handle the i/o slightly differently. If you pass everything as a string, then your command is passed to the shell; if you pass them as a list then you don't need to worry about escaping anything.
http://docs.python.org/lib/os-newstreams.html

The Popen class of the subprocess module. This is intended as a replacement for os.popen but has the downside of being slightly more complicated by virtue of being so comprehensive. For example, you'd say

print Popen("echo Hello World", stdout=PIPE, shell=True).stdout.read()
instead of
print os.popen("echo Hello World").read()

but it is nice to have all of the options there in one unified class instead of 4 different popen functions.
http://docs.python.org/lib/node528.html

The call function from the subprocess module. This is basically just like the Popen class and takes all of the same arguments, but it simply wait until the command completes and gives you the return code. For example:
return_code = call("echo Hello World", shell=True)
http://docs.python.org/lib/node529.html

The os module also has all of the fork/exec/spawn functions that you'd have in a C program, but I don't recommend using them directly.

The subprocess module should probably be what you use.

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[python] decimal to binary (0)	2012.12.24
[python] universal set - computing subsets from a set (list) (0)	2012.12.24
[File Format] VCF format (0)	2012.12.24
[linux] replace comma to tab (0)	2012.12.24
[liftover] convert hg18 to hg19 - liftOver (0)	2012.12.24

Posted by 옥탑방람보

,

[File Format] VCF format

Bioinformatics/Biological data analysis 2012. 12. 24. 17:26

* 1000 genome proejct
http://www.1000genomes.org/wiki/Analysis/variant-call-format

AA ancestral allele
AC allele count in genotypes, for each ALT allele, in the same order as listed
AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
AN total number of alleles in called genotypes
BQ RMS base quality at this position
CIGAR cigar string describing how to align an alternate allele to the reference allele
DB dbSNP membership
DP combined depth across samples, e.g. DP=154
END end position of the variant described in this record (esp. for CNVs)
H2 membership in hapmap2
MQ RMS mapping quality, e.g. MQ=52
MQ0 Number of MAPQ == 0 reads covering this record
NS Number of samples with data
SB strand bias at this position
SOMATIC indicates that the record is a somatic mutation, for cancer genomics
VALIDATED validated by follow-up experiment

* samtools mpileup
http://samtools.sourceforge.net/mpileup.shtml

Tag Description

I16
- 1 #reference Q13 bases on the forward strand 2 #reference Q13 bases on the reverse strand
- 3 #non-ref Q13 bases on the forward strand 4 #non-ref Q13 bases on the reverse strand
- 5 sum of reference base qualities 6 sum of squares of reference base qualities
- 7 sum of non-ref base qualities 8 sum of squares of non-ref base qualities
- 9 sum of ref mapping qualities 10 sum of squares of ref mapping qualities
- 11 sum of non-ref mapping qualities 12 sum of squares of non-ref mapping qualities
- 13 sum of tail distance for ref bases 14 sum of squares of tail distance for ref bases
- 15 sum of tail distance for non-ref bases 16 sum of squares of tail distance for non-ref
INDEL Indicating the variant is an INDEL.
DP The number of reads covering or bridging POS.
DP4 Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.
PV4 P-values for 1) strand bias (exact test); 2) baseQ bias (t-test); 3) mapQ bias (t); 4) tail distance bias (t)
FQ Consensus quality. If positive, FQ equals the phred-scaled probability of there being two or more different alleles. If negative, FQ equals the minus phred-scaled probability of all chromosomes being identical. Notably, given one sample, FQ is positive at hets and negative at homs.
AF1 EM estimate of the site allele frequency of the strongest non-reference allele.
CI95 Equal-tail (Bayesian) credible interval of the site allele frequency at the 95% level.
PC2 Phred-scaled probability of the alternate allele frequency of group1 samples being larger (,smaller) than of group2 samples.
PCHI2 Posterior weighted chi^2 P-value between group1 and group2 samples. This P-value is conservative.
QCHI2 Phred-scaled PCHI2
RP Number of permutations yeilding a smaller PCHI2

Specifications for VCF format are different in Info field.

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[python] universal set - computing subsets from a set (list) (0)	2012.12.24
[python] the ways to call external programs (0)	2012.12.24
[linux] replace comma to tab (0)	2012.12.24
[liftover] convert hg18 to hg19 - liftOver (0)	2012.12.24
[python] Writing An Hadoop MapReduce Program In Python (0)	2012.07.27

Posted by 옥탑방람보

,

[linux] replace comma to tab

Bioinformatics/Biological data analysis 2012. 12. 24. 17:19

consider embedding the octal value of the tab character, 011, inside the sed command

sed 's/,/'"$(printf '\011')"'/g' data.file

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[python] the ways to call external programs (0)	2012.12.24
[File Format] VCF format (0)	2012.12.24
[liftover] convert hg18 to hg19 - liftOver (0)	2012.12.24
[python] Writing An Hadoop MapReduce Program In Python (0)	2012.07.27
[bam] Soft Clipping / Hard Clipping in Sequence Alignment (by Smith-Waterman dyanamic programming algorithm) (0)	2012.07.25

Posted by 옥탑방람보

,

[flex] The Solution to Flex Remoting Over SSL

Bioinformatics/Informatics 2012. 12. 24. 17:18

http://blog.crankybit.com/flex-remoting-over-ssl/

'Bioinformatics > Informatics' 카테고리의 다른 글

[oracle] installation of oracle instant client (0)	2012.12.24
[blazeds] Blazeds 를 이용하여 자바로 부터 데이터 전달 받는 방법 (0)	2012.06.13
[erwin] SQL 문으로 ERWIN 에서 ERD 그리기 (0)	2011.07.19

Posted by 옥탑방람보

,

[oracle] installation of oracle instant client

Bioinformatics/Informatics 2012. 12. 24. 17:17

1. download instant client
http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html
select ones matched verion of oracle server. (rpm)

2.
> mkdir /usr/lib/oracle
> cd /usr/lib/oracle
> rpm -Uvh oracle-instantclient-basic-10.2.0.4-1.x86_64.rpm
> rpm -Uvh oracle-instantclient-sqlplus-10.2.0.4-1.x86_64.rpm
> vi /usr/lib/oracle/10.2.0.4/client64/tnsnames.ora
HDA =
    (DESCRIPTION =
         (ADDRESS = (PROTOCOL = TCP)(HOST =IP_ADRESS)(PORT = 1521))
         (CONNECT_DATA =
              (SERVER = DEDICATED)
              (SERVICE_NAME = hda)
         )
    )
> vi /etc/profile.d/oracle.sh
# for Oracle Instant Client
if [ -d /usr/lib/oracle ]
then
     LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/oracle/10.2.0.4/client64/lib/;export   LD_LIBRARY_PATH
     TNS_ADMIN=/usr/lib/oracle/10.2.0.4/client64; export TNS_ADMIN;
     NLS_LANG=AMERICAN_AMERICA.KO16KSC5601;export NLS_LANG;
#   PATH=$PATH:/usr/lib/oracle;export PATH;
#   SQLPATH=/usr/lib/oracle; export SQLPATH;
fi
> source /etc/profile/oracle.sh
> sqlplus64 DBNAME@HDA

== sqlldr ==
> scp orahda@111.111.111.11:/oracle/orahda/product/10.2.0/bin/sqlldr /usr/bin/
> mkdir /usr/lib/oracle/10.2.0.4/client64/rdbms/mesg/
> scp orahda@111.111.111.11:/oracle/orahda/product/10.2.0/rdbms/mesg/ulus.msb /usr/lib/oracle/10.2.0.4/client64/rdbms/mesg/
> vi /etc/profile.d/oracle.sh
ORACLE_HOME=/usr/lib/oracle/10.2.0.4/client64; export ORACLE_HOME
> source /etc/profile.d/oracle.sh

'Bioinformatics > Informatics' 카테고리의 다른 글

[flex] The Solution to Flex Remoting Over SSL (0)	2012.12.24
[blazeds] Blazeds 를 이용하여 자바로 부터 데이터 전달 받는 방법 (0)	2012.06.13
[erwin] SQL 문으로 ERWIN 에서 ERD 그리기 (0)	2011.07.19

Posted by 옥탑방람보

,

[liftover] convert hg18 to hg19 - liftOver

Bioinformatics/Biological data analysis 2012. 12. 24. 17:10

Use liftOver provided from UCSC web site.
The liftOver is designed to work between differenct assemblies of the same organism. This program does not have multiple alignment function but allows to lift genomic coordinates in the same organism. So, this requires a map.chain file which has the old genome as the target and the new genome as the query. For human genome you can get it at UCSC download page.

Usage: liftOver oldFile map.chain newFile unMapped
oldFile : BED format (http://genome.ucsc.edu/FAQ/FAQformat.html#format1) (optional)
==> It works with a input file having first 3 columns only. All positions MUST be zero base.
map.chain: (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/)
newFile : output file name
unMapped: output file name for un-lifted positions
ex) ./liftOver hg18.positions hg18ToHg19.over.chain ./output/hg19.positions ./output/unmapped.positions

Download liftOver: (http://hgdownload.cse.ucsc.edu/admin/exe/)

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[File Format] VCF format (0)	2012.12.24
[linux] replace comma to tab (0)	2012.12.24
[python] Writing An Hadoop MapReduce Program In Python (0)	2012.07.27
[bam] Soft Clipping / Hard Clipping in Sequence Alignment (by Smith-Waterman dyanamic programming algorithm) (0)	2012.07.25
[trio analysis] de novo dominant variant (0)	2012.03.13

Posted by 옥탑방람보

,

[python] Writing An Hadoop MapReduce Program In Python

Bioinformatics/Biological data analysis 2012. 7. 27. 11:08

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Writing An Hadoop MapReduce Program In Python

by Michael G. Noll on September 21, 2007 (last updated: June 17, 2012)

In this tutorial, I will describe how to write a simple MapReduce program for Hadoop in the Python programming language.

Table of Contents:

Motivation

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python or C++ (the latter since version 0.14.1). However, the documentation and the most prominent Python example on the Hadoop home page could make you think that youmust translate your Python code using Jython into a Java jar file. Obviously, this is not very convenient and can even be problematic if you depend on Python features not provided by Jython. Another issue of the Jython approach is the overhead of writing your Python program in such a way that it can interact with Hadoop – just have a look at the example in<HADOOP_INSTALL>/src/examples/python/WordCount.py and you see what I mean. I still recommend to have at least a look at the Jython approach and maybe even at the new C++ MapReduce API called Pipes, it’s really interesting.

Having that said, the ground is prepared for the purpose of this tutorial: writing a Hadoop MapReduce program in a more Pythonic way, i.e. in a way you should be familiar with.

What we want to do

We will write a simple MapReduce program (see also Wikipedia) for Hadoop in Python but without using Jython to translate our code to Java jar files.

Our program will mimick the WordCount example, i.e. it reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.

Note: You can also use programming languages other than Python such as Perl or Ruby with the “technique” described in this tutorial. I wrote some words about what happens behind the scenes. Feel free to correct me if I’m wrong.

Prerequisites

You should have an Hadoop cluster up and running because we will get our hands dirty. If you don’t have a cluster yet, my following tutorials might help you to build one. The tutorials are tailored to Ubuntu Linux but the information does also apply to other Linux/Unix variants.

Running Hadoop On Ubuntu Linux (Single-Node Cluster)
How to set up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux

Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
How to set up a multi-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux

Python MapReduce Code

The “trick” behind the following Python code is that we will use HadoopStreaming (see also the wiki entry) for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python’s sys.stdin to read input data and print our own output to sys.stdout. That’s all we need to do because HadoopStreaming will take care of everything else! Amazing, isn’t it? Well, at least I had a “wow” experience…

Map: mapper.py

Save the following code in the file /home/hduser/mapper.py. It will read data from STDIN (standard input), split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT (standard output). The Map script will not compute an (intermediate) sum of a word’s occurrences. Instead, it will output “<word> 1″ immediately – even though the <word> might occur multiple times in the input – and just let the subsequent Reduce step do the final sum count. Of course, you can change this behavior in your own scripts as you please, but we will keep it like that in this tutorial because of didactic reasons :-)

Make sure the file has execution permission (chmod +x /home/hduser/mapper.py should do the trick) or you will run into problems.

01
#!/usr/bin/env python 

02
  
03
import sys 

04
  
05
# input comes from STDIN (standard input) 

06
for line in sys.stdin: 

07
    # remove leading and trailing whitespace 

08
    line = line.strip() 

09
    # split the line into words 

10
    words = line.split() 

11
    # increase counters 

12
    for word in words: 

13
        # write the results to STDOUT (standard output); 

14
        # what we output here will be the input for the 

15
        # Reduce step, i.e. the input for reducer.py 

16
        # 

17
        # tab-delimited; the trivial word count is 1 

18
        print '%s\t%s' % (word, 1)

Reduce: reducer.py

Save the following code in the file /home/hduser/reducer.py. It will read the results of mapper.py from STDIN (standard input), and sum the occurrences of each word to a final count, and output its results to STDOUT (standard output).

Make sure the file has execution permission (chmod +x /home/hduser/reducer.py should do the trick) or you will run into problems.

01
#!/usr/bin/env python 

02
  
03
from operator import itemgetter 

04
import sys 

05
  
06
current_word = None

07
current_count = 0

08
word = None

09
  
10
# input comes from STDIN 

11
for line in sys.stdin: 

12
    # remove leading and trailing whitespace 

13
    line = line.strip() 

14
  
15
    # parse the input we got from mapper.py 

16
    word, count = line.split('\t', 1) 

17
  
18
    # convert count (currently a string) to int 

19
    try: 

20
        count = int(count) 

21
    except ValueError: 

22
        # count was not a number, so silently 

23
        # ignore/discard this line 

24
        continue

25
  
26
    # this IF-switch only works because Hadoop sorts map output 

27
    # by key (here: word) before it is passed to the reducer 

28
    if current_word == word: 

29
        current_count += count 

30
    else: 

31
        if current_word: 

32
            # write result to STDOUT 

33
            print '%s\t%s' % (current_word, current_count) 

34
        current_count = count 

35
        current_word = word 

36
  
37
# do not forget to output the last word if needed! 

38
if current_word == word: 

39
    print '%s\t%s' % (current_word, current_count)

Test your code (cat data | map | sort | reduce)

I recommend to test your mapper.py and reducer.py scripts locally before using them in a MapReduce job. Otherwise your jobs might successfully complete but there will be no job result data at all or not the results you would have expected. If that happens, most likely it was you (or me) who screwed up.

Here are some ideas on how to test the functionality of the Map and Reduce scripts.

 # very basic test
 hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py
 foo     1
 foo     1
 quux    1
 labs    1
 foo     1
 bar     1
 quux    1

hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | sort -k1,1 | /home/hduser/reducer.py
 bar     1
 foo     3
 labs    1
 quux    2

 # using one of the ebooks as example input
 # (see below on where to get the ebooks)
 hduser@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hduser/mapper.py
 The     1
 Project 1
 Gutenberg       1
 EBook   1
 of      1
 [...]
 (you get the idea)

Running the Python Code on Hadoop

Download example input data

We will use three ebooks from Project Gutenberg for this example:

Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a temporary directory of choice, for example /tmp/gutenberg.

hduser@ubuntu:~$ ls -l /tmp/gutenberg/
total 3604
-rw-r--r-- 1 hduser hadoop  674566 Feb  3 10:17 pg20417.txt
-rw-r--r-- 1 hduser hadoop 1573112 Feb  3 10:18 pg4300.txt
-rw-r--r-- 1 hduser hadoop 1423801 Feb  3 10:18 pg5000.txt
hduser@ubuntu:~$

Copy local example data to HDFS

Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:40 /user/hduser/gutenberg
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg
Found 3 items
-rw-r--r--   3 hduser supergroup     674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt
-rw-r--r--   3 hduser supergroup    1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt
-rw-r--r--   3 hduser supergroup    1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt
hduser@ubuntu:/usr/local/hadoop$

Run the MapReduce job

Now that everything is prepared, we can finally run our Python MapReduce job on the Hadoop cluster. As I said above, we useHadoopStreaming for helping us passing data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output).

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the -Doption:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -D mapred.reduce.tasks=16 ...

An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks.

The job will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the results in the HDFS directory /user/hduser/gutenberg-output. In general Hadoop will create one output file per reducer; in our case however it will only create a single file because the input files are very small.

Example output of the previous command in the console:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
 additionalConfSpec_:null
 null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
 packageJobJar: [/app/hadoop/tmp/hadoop-unjar54543/]
 [] /tmp/streamjob54544.jar tmpDir=null
 [...] INFO mapred.FileInputFormat: Total input paths to process : 7
 [...] INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
 [...] INFO streaming.StreamJob: Running job: job_200803031615_0021
 [...]
 [...] INFO streaming.StreamJob:  map 0%  reduce 0%
 [...] INFO streaming.StreamJob:  map 43%  reduce 0%
 [...] INFO streaming.StreamJob:  map 86%  reduce 0%
 [...] INFO streaming.StreamJob:  map 100%  reduce 0%
 [...] INFO streaming.StreamJob:  map 100%  reduce 33%
 [...] INFO streaming.StreamJob:  map 100%  reduce 70%
 [...] INFO streaming.StreamJob:  map 100%  reduce 77%
 [...] INFO streaming.StreamJob:  map 100%  reduce 100%
 [...] INFO streaming.StreamJob: Job complete: job_200803031615_0021
 [...] INFO streaming.StreamJob: Output: /user/hduser/gutenberg-output
hduser@ubuntu:/usr/local/hadoop$

As you can see in the output above, Hadoop also provides a basic web interface for statistics and information. When the Hadoop cluster is running, go to http://localhost:50030/ and browse around. Here’s a screenshot of the Hadoop web interface for the job we just ran.

A screenshot of Hadoop's web interface, showing the details of the MapReduce job we just ran.

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output
 Found 1 items
 /user/hduser/gutenberg-output/part-00000     <r 1>   903193  2007-09-21 13:00
 hduser@ubuntu:/usr/local/hadoop$

You can then inspect the contents of the file with the dfs -cat command:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-00000
 "(Lo)cra"       1
 "1490   1
 "1498," 1
 "35"    1
 "40,"   1
 "A      2
 "AS-IS".        2
 "A_     1
 "Absoluti       1
 [...]
 hduser@ubuntu:/usr/local/hadoop$

Note that in this specific output above the quote signs (“) enclosing the words have not been inserted by Hadoop. They are the result of how our Python code splits words, and in this case it matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.

Improved Mapper and Reducer code: using Python iterators and generators

The Mapper and Reducer examples above should have given you an idea of how to create your first MapReduce application. The focus was code simplicity and ease of understanding, particularly for beginners of the Python programming language. In a real-world application however, you might want to optimize your code by using Python iterators and generators (an even better introduction in PDF) as some readers have pointed out.

Generally speaking, iterators and generators (functions that create iterators, for example with Python’s yield statement) have the advantage that an element of a sequence is not produced until you actually need it. This can help a lot in terms of computational expensiveness or memory consumption depending on the task at hand.

Note: The following Map and Reduce scripts will only work “correctly” when being run in the Hadoop context, i.e. as Mapper and Reducer in a MapReduce job. This means that running the naive test “cat DATA | ./mapper.py | sort -k1,1 | ./reducer.py” will not work correctly anymore because some functionality is intentionally outsourced to Hadoop.

Precisely, we compute the sum of a word’s occurrences, e.g. (“foo”, 4), only if by chance the same word (“foo”) appears multiple times in succession. In the majority of cases, however, we let the Hadoop group the (key, value) pairs between the Map and the Reduce step because Hadoop is more efficient in this regard than our simple Python scripts.

mapper.py

01
#!/usr/bin/env python 

02
"""A more advanced Mapper, using Python iterators and generators."""

03
  
04
import sys 

05
  
06
def read_input(file): 

07
    for line in file: 

08
        # split the line into words 

09
        yield line.split() 

10
  
11
def main(separator='\t'): 

12
    # input comes from STDIN (standard input) 

13
    data = read_input(sys.stdin) 

14
    for words in data: 

15
        # write the results to STDOUT (standard output); 

16
        # what we output here will be the input for the 

17
        # Reduce step, i.e. the input for reducer.py 

18
        # 

19
        # tab-delimited; the trivial word count is 1 

20
        for word in words: 

21
            print '%s%s%d' % (word, separator, 1) 

22
  
23
if __name__ == "__main__": 

24
    main()

reducer.py

01
#!/usr/bin/env python 

02
"""A more advanced Reducer, using Python iterators and generators."""

03
  
04
from itertools import groupby 

05
from operator import itemgetter 

06
import sys 

07
  
08
def read_mapper_output(file, separator='\t'): 

09
    for line in file: 

10
        yield line.rstrip().split(separator, 1) 

11
  
12
def main(separator='\t'): 

13
    # input comes from STDIN (standard input) 

14
    data = read_mapper_output(sys.stdin, separator=separator) 

15
    # groupby groups multiple word-count pairs by word, 

16
    # and creates an iterator that returns consecutive keys and their group: 

17
    #   current_word - string containing a word (the key) 

18
    #   group - iterator yielding all ["<current_word>", "<count>"] items 

19
    for current_word, group in groupby(data, itemgetter(0)): 

20
        try: 

21
            total_count = sum(int(count) for current_word, count in group) 

22
            print "%s%s%d" % (current_word, separator, total_count) 

23
        except ValueError: 

24
            # count was not a number, so silently discard this item 

25
            pass

26
  
27
if __name__ == "__main__": 

28
    main()

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

[linux] replace comma to tab (0)	2012.12.24
[liftover] convert hg18 to hg19 - liftOver (0)	2012.12.24
[bam] Soft Clipping / Hard Clipping in Sequence Alignment (by Smith-Waterman dyanamic programming algorithm) (0)	2012.07.25
[trio analysis] de novo dominant variant (0)	2012.03.13
[trio analysis] simple recessive (0)	2012.03.13

Posted by 옥탑방람보

,

'Bioinformatics'에 해당되는 글 40건

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

'Bioinformatics > Informatics' 카테고리의 다른 글

'Bioinformatics > Informatics' 카테고리의 다른 글

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

Writing An Hadoop MapReduce Program In Python

by Michael G. Noll on September 21, 2007 (last updated: June 17, 2012)

Motivation

What we want to do

Prerequisites

Python MapReduce Code

Map: mapper.py

Reduce: reducer.py

Test your code (cat data | map | sort | reduce)

Running the Python Code on Hadoop

Download example input data

Copy local example data to HDFS

Run the MapReduce job

Improved Mapper and Reducer code: using Python iterators and generators

mapper.py

reducer.py

'Bioinformatics > Biological data analysis' 카테고리의 다른 글

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

글 보관함

달력

링크

티스토리툴바