Hadoop MapReduce

以“统计一个无限不循环小数的小数位数字出现频率”为例

Mapper function
import sys

linecount=0
# Get input lines from stdin
for line in sys.stdin:
	# Remove spaces from beginning and end of the line
    line = line.strip()
    
    # Remove the 2., and only keep the decimals 
    if linecount == 0:
        line = line[2:]
	
    # Split it into list of numbers
    numbers = list(line)
	# Output tuples on stdout
    for number in numbers:
        print ('%s\t%s' % (number, "1"))
        
    linecount+=1

mapper得到一大堆key-value pairs, 分别是每一个位置出现的数字以及出现的个数(1次)。

结果是 一堆key value pairs 如果用以上的map reduce来处理sqrt2,那么得到的结果是

Last updated