【爬虫】永久查看糗事百科热门笑话

北极 · 发表于 2016-11-2 10:25:57

本帖最后由北极于 2016-11-2 10:25 编辑

****************************************************************************************************
知识点：
1.urllib2 了解
2.re 了解
3.beautifulSoup 了解

使用beautifulsoup是为了不管糗百如何变化span的结构，我们都能爬取到内容。从而一劳永逸！

****************************************************************************************************

代码：

# coding:utf-8
#!/usr/bin/python
import urllib2
import re
from bs4 import BeautifulSoup
number =0 #定义序号变量
page = raw_input("Please input page:")
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
request = urllib2.Request('http://www.qiushibaike.com/hot/page/'+str(page)+'/?s=4915651',headers = headers)
response = urllib2.urlopen(request)
html = response.read() #获取页面源码
soup = BeautifulSoup(html,"html.parser")
items = soup.find_all('div',attrs={"class":"content"}) #搜索div标签并且同时含有class=content内容
for item in items:
number +=1
pattern = re.compile('<span>(.*?)</span>',re.S)
lists = re.findall(pattern,str(item))
for list in lists:
print 'NO',number,':',list.decode('utf-8'),'\n'
print "End..."

复制代码

效果：

blueelwang · 发表于 2016-11-2 19:31:24

{:8_204:} 非常实用！

sqlfeng · 发表于 2016-11-4 12:20:21

非常实用！感谢分享

whydo1 · 发表于 2016-11-4 21:04:18

支持!

aliali · 发表于 2016-11-12 18:54:00

参考一下，实验一下

小鱼 · 发表于 2017-1-26 00:54:25

python 2点几啊

北极 · 发表于 2017-3-23 14:50:30

小鱼发表于 2017-1-26 00:54
python 2点几啊

python2.7

		自动登录	找回密码
密码			立即注册

[代码与实例] 【爬虫】永久查看糗事百科热门笑话

活跃会员

热心会员

默默耕耘

优秀版主

论坛元老

最佳导师

突出贡献

荣誉管理

最佳新人