一个简单的多线程python抓取网页信息程序中出现list溢出提示,具体代码和错误提示在下面有描述,望高手们解惑,非常感谢
错误日志:
代码每次运行到spider函数内的content=each.xpath()这个地方的时候就报错: - Traceback (most recent call last):
- File "E:/data/C++/Python/����ѧԺ/XPath-and-multithreading-crawler_v1/Դ��/tiebaspider.py", line 50, in <module>
- results = pool.map(spider, page)
- File "E:\data\C++\Python\python2.7.9\lib\multiprocessing\pool.py", line 251, in map
- return self.map_async(func, iterable, chunksize).get()
- File "E:\data\C++\Python\python2.7.9\lib\multiprocessing\pool.py", line 558, in get
- raise self._value
- IndexError: list index out of range
复制代码
以下为源代码
- #-*-coding:utf8-*-
- from lxml import etree
- from multiprocessing.dummy import Pool as ThreadPool
- import requests
- import json
- import sys
- reload(sys)
- sys.setdefaultencoding('utf-8')
- '''重新运行之前请删除content.txt,因为文件操作使用追加方式,会导致内容太多。'''
- def towrite(contentdict):
- f.writelines(u'回帖时间:' + str(contentdict['topic_reply_time']) + '\n')
- f.writelines(u'回帖内容:' + unicode(contentdict['topic_reply_content']) + '\n')
- f.writelines(u'回帖人:' + contentdict['user_name'] + '\n\n')
- def spider(url):
- html = requests.get(url)
- selector = etree.HTML(html.text)
- print selector
- content_field = selector.xpath('//div[@class="l_post j_l_post l_post_bright "]')
- item = {}
- for each in content_field:
- reply_info = json.loads(each.xpath('@data-field')[0].replace('"', ''))
- author = reply_info['author']['user_name']
- reply_time = reply_info['content']['date']
- content = each.xpath('div[@class="d_post_content_main"]/div/cc/\
- div[@class="d_post_content j_d_post_content"]/text()')[0]
- print (content)
- print (reply_time)
- print (author)
- item['user_name'] = author
- item['topic_reply_content'] = content
- item['topic_reply_time'] = reply_time
- towrite(item)
- if __name__ == '__main__':
- pool = ThreadPool(2)
- f = open('content.txt', 'a')
- page = []
- for i in range(1, 21):
- newpage = 'http://tieba.baidu.com/p/3522395718?pn=' + str(i)
- page.append(newpage)
- results = pool.map(spider, page)
- pool.close()
- pool.join()
- f.close()
复制代码
|