Базовый Python, но странная проблема в изменении текстового значения Hadoop-Stream в MapReduce

Базовый Python, но странная проблема в изменении текстового значения Hadoop-Stream в MapReduce ⇐ Linux

1 сообщение • Страница 1 из 1

Гость

Базовый Python, но странная проблема в изменении текстового значения Hadoop-Stream в MapReduce

Цитата

Сообщение Гость » 12 мар 2024, 05:43

Я обрабатываю текстовый файл размером 121983 строк через Hadoop. Но на этапе mapReduce я столкнулся с какой-то странной проблемой.
Это моя функция картографа:

Код: Выделить всё

#!/usr/bin/env python
import sys
import re
pattern = r'\b[a-zA-Z0-9]+\b'
numline=0
for line in sys.stdin:
numline+=1
line =line.strip()
words=re.findall(pattern, line)
for word in words:
print ('%s\t%s' % (word,1))
print("%s\t%s" % ("num line",numline))

This is my reduce function:

Код: Выделить всё

#!/usr/bin/env python
import sys
worddict={}
nrow=0
totalwords=0
for line in sys.stdin:
line=line.strip()
word,count=line.split('\t')
if word=="num line":
nrow=int(count)
continue

totalwords+=1
if word not in worddict.keys():
worddict[word]=1
else:
worddict[word]+=1
wd_sorted=sorted(worddict.items(), key=lambda item: item[1],reverse=True)
print("There are %s lines in the text."%nrow)
print("The 100 most frequently used words are:")
for wd,cnt in wd_sorted[:100]:
print("%s\t%s"%(wd,cnt))
print("There are %s words in the text."%totalwords)
print("There are %s unique words in the text."%len(wd_sorted))

The question is, I've tested my mapper function works well and gives the right row number (121983) by using the code

Код: Выделить всё

cat shakespere.txt | python WDmapper.py

(WDmapper.py is my mapper function and shakespere.txt is the file i need to process) but when I process that and output that in reducer function, it becomes 60845.[/b] enter image description here
I am pretty sure this aint some limit for some numeric data type. And I am also pretty sure that the loop in reducer function went thru completely 121983 loops because in the end, there are correct number of words(910915).
enter image description here
It might be some special feature of mapReduce process, but I am a super novice so can someone help me with it?

Источник: https://stackoverflow.com/questions/781 ... -mapreduce

1710211388

Гость


Я обрабатываю текстовый файл размером 121983 строк через Hadoop. Но на этапе mapReduce я столкнулся с какой-то странной проблемой.
Это моя функция картографа:
[code]#!/usr/bin/env python
import sys
import re
pattern = r'\b[a-zA-Z0-9]+\b'
numline=0
for line in sys.stdin:
numline+=1
line =line.strip()
words=re.findall(pattern, line)
for word in words:
print ('%s\t%s' % (word,1))
print("%s\t%s" % ("num line",numline))
[/code]
This is my reduce function:
[code]#!/usr/bin/env python
import sys
worddict={}
nrow=0
totalwords=0
for line in sys.stdin:
line=line.strip()
word,count=line.split('\t')
if word=="num line":
nrow=int(count)
continue

totalwords+=1
if word not in worddict.keys():
worddict[word]=1
else:
worddict[word]+=1
wd_sorted=sorted(worddict.items(), key=lambda item: item[1],reverse=True)
print("There are %s lines in the text."%nrow)
print("The 100 most frequently used words are:")
for wd,cnt in wd_sorted[:100]:
print("%s\t%s"%(wd,cnt))
print("There are %s words in the text."%totalwords)
print("There are %s unique words in the text."%len(wd_sorted))
[/code]
[b]The question is, I've tested my mapper function works well and gives the right row number (121983) by using the code [code]cat shakespere.txt | python WDmapper.py[/code](WDmapper.py is my mapper function and shakespere.txt is the file i need to process) but when I process that and output that in reducer function, it becomes 60845.[/b]  enter image description here
I am pretty sure this aint some limit for some numeric data type. And I am also pretty sure that the loop in reducer function went thru completely 121983 loops because in the end, there are correct number of words(910915).
enter image description here
It might be some special feature of mapReduce process, but I am a super novice so can someone help me with it?
 

Источник: [url]https://stackoverflow.com/questions/78144393/basic-python-but-wierd-problem-in-hadoop-stream-text-value-changes-in-mapreduce[/url]