Я обрабатываю текстовый файл размером 121983 строк через Hadoop. Но на этапе mapReduce я столкнулся с какой-то странной проблемой.
Это моя функция картографа:
Код: Выделить всё
#!/usr/bin/env python
import sys
import re
pattern = r'\b[a-zA-Z0-9]+\b'
numline=0
for line in sys.stdin:
numline+=1
line =line.strip()
words=re.findall(pattern, line)
for word in words:
print ('%s\t%s' % (word,1))
print("%s\t%s" % ("num line",numline))
Код: Выделить всё
#!/usr/bin/env python
import sys
worddict={}
nrow=0
totalwords=0
for line in sys.stdin:
line=line.strip()
word,count=line.split('\t')
if word=="num line":
nrow=int(count)
continue
totalwords+=1
if word not in worddict.keys():
worddict[word]=1
else:
worddict[word]+=1
wd_sorted=sorted(worddict.items(), key=lambda item: item[1],reverse=True)
print("There are %s lines in the text."%nrow)
print("The 100 most frequently used words are:")
for wd,cnt in wd_sorted[:100]:
print("%s\t%s"%(wd,cnt))
print("There are %s words in the text."%totalwords)
print("There are %s unique words in the text."%len(wd_sorted))
Код: Выделить всё
cat shakespere.txt | python WDmapper.pyI am pretty sure this aint some limit for some numeric data type. And I am also pretty sure that the loop in reducer function went thru completely 121983 loops because in the end, there are correct number of words(910915).
enter image description here
It might be some special feature of mapReduce process, but I am a super novice so can someone help me with it?
Источник: https://stackoverflow.com/questions/781 ... -mapreduce
Мобильная версия