python2.7.6 和BeautifulSoup4.4.1 抓取糗事百科，带评论

pythonLearner · 发表于 2015-12-14 12:22:30

__author__ = 'KS'
# -*- coding: utf-8 -*-
import urllib2
import re
from bs4 import BeautifulSoup
def getContentOrComment(urlJoin):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
req = urllib2.Request(url=urlJoin, headers=headers)
response = urllib2.urlopen(req)
content = response.read().decode('utf-8','ignore')
return content
# 文章地址
articleUrl = "http://www.qiushibaike.com/textnew/page/%d"
# 评论地址
commentUrl = "http://www.qiushibaike.com/article/%s"
page = 0
while True:
getFromCustomer = raw_input("next page ? print Enter key to continue or 'exit' to stop\n")
if getFromCustomer == "exit":
break
page += 1
urlJoin = articleUrl % page
print urlJoin
articlePage = getContentOrComment(urlJoin)
soupArticle = BeautifulSoup(articlePage, 'html.parser')
# print soup.prettify()
articleFloor = 1
for string in soupArticle.find_all(attrs="article block untagged mb15"):
commentId = str(string.get('id')).strip()[11:]
print "\n"
print articleFloor, ".", string.find(attrs="content").get_text().strip()
articleFloor += 1
commentPage = getContentOrComment(commentUrl % commentId)
soupComment = BeautifulSoup(commentPage, 'html.parser')
commentFloor = 1
for comment in soupComment.find_all(attrs="body"):
print " ", commentFloor, "楼回复:", comment.get_text().strip()
commentFloor += 1

复制代码

wangcj456 · 发表于 2016-8-26 15:29:43

学习了，谢谢楼主

mongo · 发表于 2016-8-30 16:58:56

学习了.{:8_204:}

		自动登录	找回密码
密码			立即注册

[代码与实例] python2.7.6 和BeautifulSoup4.4.1 抓取糗事百科，带评论