|
本帖最后由 北极 于 2016-11-2 10:25 编辑
****************************************************************************************************
知识点:
1.urllib2 了解
2.re 了解
3.beautifulSoup 了解
使用beautifulsoup是为了不管糗百如何变化span的结构,我们都能爬取到内容。从而一劳永逸!
****************************************************************************************************
代码:
- # coding:utf-8
- #!/usr/bin/python
- import urllib2
- import re
- from bs4 import BeautifulSoup
- number =0 #定义序号变量
- page = raw_input("Please input page:")
- user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
- headers = { 'User-Agent' : user_agent }
- request = urllib2.Request('http://www.qiushibaike.com/hot/page/'+str(page)+'/?s=4915651',headers = headers)
- response = urllib2.urlopen(request)
- html = response.read() #获取页面源码
- soup = BeautifulSoup(html,"html.parser")
- items = soup.find_all('div',attrs={"class":"content"}) #搜索div标签并且同时含有class=content内容
- for item in items:
- number +=1
- pattern = re.compile('<span>(.*?)</span>',re.S)
- lists = re.findall(pattern,str(item))
- for list in lists:
- print 'NO',number,':',list.decode('utf-8'),'\n'
-
- print "End..."
复制代码
效果:
|
|