这个场景下，怎么提升python并行处理能力？

extend · 发表于 2017-11-15 10:18:57

统计一个800多兆的日志文件有多少行，两个算法，一个单线程，一个并行，但是并行的效率远远低于单线程的算法，怎么提升并行算法的效率呢？
高手帮忙分析下。
单线程的算法和结果如下：
start_t=time.clock()

def block(file,size=65536):
while 1:
      nb=file.read(size)
      if not nb:
         break
      yield nb

with open("D:\Centos7\catalina.out","r",encoding="UTF-8") as f:
#print(type(f))
print(sum(line.count("\n") for line in block(f)))
#print(block(f))
print(time.clock()-start_t)

D:\new\Python\excersise>python test4.py
404325
4.501643689510964
效率4秒多。

并行算法和结果如下：
D:\new\Python\excersise>python test4.py
404325 11.875990500941482

def run(fn):
#print(sum(line.count() for line in fn))
#print(type(fn))
#print(fn)

return len(fn)

if __name__=="__main__":
start_t=time.clock()
sum1=0
fp=open("D:\Centos7\catalina.out","r",encoding="UTF-8")
ff=[]
while 1:
      fb=fp.readlines(1048576) #65536(字节)*16*10，64kb*16=1024KB=1MB,
      if not fb:
         break
      ff.append(fb)
fp.close()
#print(ff)
pool=mp.Pool(16)
#pool.map(run,ff)
sum_t=reduce(lambda x,y:x+y,pool.map(run,ff))
pool.close()
pool.join()
print(sum_t,time.clock()-start_t)
D:\new\Python\excersise>python test4.py
404325 11.875990500941482
结果12秒，无论调整并行度4，8，16，结果都差不多。
怎么提升并行效率呢？

iacxc · 发表于 2018-3-3 10:52:54

做一下profile吧，也许大部分时间都在读文件上.

		自动登录	找回密码
密码			立即注册