|
代码:
#-*- coding:utf-8 -*-
from bs4 import BeautifulSoup
import requests
import json
url = 'http://www.qiushibaike.com/'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select(' h2')
ages = soup.select('div.author.clearfix > div.articleGender')
contents = soup.select('div.content > span')
pics = soup.select('div.thumb > a > img')
for name,age,content,pic in zip(names,ages,contents,pics):
data = {
'name':name.get_text(),
'age':age.get_text(),
'content':content.get_text(),
'pic':pic.get('src')
}
print data
输出结果:
{'content': u'\u5728\u5355\u4f4d\u63a5\u5230\u8001\u5a46\u7535\u8bdd:\u201c\u5bb6\u91cc\u7684\u6c34\u7ba1\u574f\u4e86\uff0c\u6211\u627exx(\u5c0f\u8205\u5b50)\u8fc7\u6765\u4fee\u4e0a\u4e86\uff0c\u8fd9\u8981\u662f\u627e\u4e13\u95e8\u7684\u6c34\u7535\u5de5\u5f97100\u5757\u94b1\u3002\u201d\u6211\u8bf4:\u201c\u662f\u554a\uff0c\u90a3\u4f60\u4eec\u5728\u5bb6\u5403\u7684\u996d\u554a\uff1f\u201d\u201c\u6ca1\u6709\uff0c\u6211\u5e26\u4ed6\u4e0b\u7684\u996d\u5e97\uff0c\u82b1\u4e86\u4e00\u767e\u591a\uff0c\u53c8\u7ed9\u4ed6\u4e70\u4e86\u4e00\u6761\u70df\u4e8c\u767e\u591a\uff01\u201d', 'age': u'21', 'pic': 'http://pic.qiushibaike.com/system/pictures/11870/118705213/medium/app118705213.jpg', 'name': u'\u8303\u95f2\u32a3'}
我的代码是这样的,爬去的汉字输出不对?请问如何才能正确输出汉字?
|
|