如何将爬取得内容写进txt文档中啊？还请大神不吝赐教？

淼妙 · 发表于 2018-1-15 21:33:09

希望能够爬取python百度百科的词条内容，但在最后写入txt文档中卡住了，python小白学习编程难啊，希望大神解答这个小小的问题！下面的是代码：
import urllib2
import re
from bs4 import BeautifulSoup
url='https://baike.baidu.com/item/Python/407313?fr=aladdin'
f=open('python.text','w')
webpage=urllib2.urlopen(url).read()
soup=BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
ds=soup.find_all('div')
for content in ds:
if content.get('class')==['para']:
   f.write(content.get_text())

Spyder显示的错误原因有：runfile('C:/Users/123/Desktop/爬虫练习/1.14.py', wdir='C:/Users/123/Desktop/爬虫练习')
Traceback (most recent call last):

  File "<ipython-input-10-905b2147af03>", line 1, in <module>
runfile('C:/Users/123/Desktop/爬虫练习/1.14.py', wdir='C:/Users/123/Desktop/爬虫练习')

  File "C:\Users\123\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 585, in runfile
execfile(filename, namespace)

  File "C:/Users/123/Desktop/爬虫练习/1.14.py", line 25, in <module>
f.write(content.get_text())

UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position 9: illegal multibyte sequence

剑心无痕 · 发表于 2018-1-16 09:49:49

\xa0 属于 latin1编码，其实就是在html里常见的  替换掉就行了
f.write(content.get_text().replace('\xa0', ''))

风乎舞雩 · 发表于 2018-1-16 11:50:42

我是有这种错误的pass掉，不过可能漏掉文件楼上方法不错

淼妙 · 发表于 2018-1-19 15:22:53

结果还是有问题呀，runfile('C:/Users/123/.spyder2/temp.py', wdir='C:/Users/123/.spyder2')
Traceback (most recent call last):

  File "<ipython-input-23-a8e3f3c420ce>", line 1, in <module>
runfile('C:/Users/123/.spyder2/temp.py', wdir='C:/Users/123/.spyder2')

  File "C:\Users\123\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 585, in runfile
execfile(filename, namespace)

  File "C:/Users/123/.spyder2/temp.py", line 16, in <module>
f.write(content.get_text().replace('\xa0', ''))

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa0 in position 0: incomplete multibyte sequence

淼妙 · 发表于 2018-1-19 15:26:24

我换了一种方式来爬python百度百科的词条内容，spyder里没有显示错误，但是excel表中没有爬取任何的内容，白的

还请大神看看代码：import urllib2
from bs4 import BeautifulSoup
import xlwt
url='https://baike.baidu.com/item/Python/407313?fr=aladdin'
wbk=xlwt.Workbook()
sheet=wbk.add_sheet('sheet 1',cell_overwrite_ok=True)
webpage=urllib2.urlopen(url).read()
soup=BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
links=soup.find_all('class')
m=1
for link1 in links:
n=o
if link2.get('class')==('main-content'):
sheet.write(m,n,link2.get_text())
n=n+1
m=m+1
wbk.save(r'C:\Users\123\Desktop/python2.xls')

淼妙 · 发表于 2018-1-19 15:28:13

我换了一种方式还是爬python百度百科中的词条，spyder中没有显示任何错误，但是写入数据的excel中没有任何的数据，白的

还请大神看看代码：
import urllib2
from bs4 import BeautifulSoup
import xlwt
url='https://baike.baidu.com/item/Python/407313?fr=aladdin'
wbk=xlwt.Workbook()
sheet=wbk.add_sheet('sheet 1',cell_overwrite_ok=True)
webpage=urllib2.urlopen(url).read()
soup=BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
links=soup.find_all('class')
m=1
for link1 in links:
n=o
if link2.get('class')==('main-content'):
sheet.write(m,n,link2.get_text())
n=n+1
m=m+1
wbk.save(r'C:\Users\123\Desktop/python2.xls')

剑心无痕 · 发表于 2018-1-19 15:47:12

本帖最后由剑心无痕于 2018-1-19 15:48 编辑

淼妙发表于 2018-1-19 15:22
结果还是有问题呀，runfile('C:/Users/123/.spyder2/temp.py', wdir='C:/Users/123/.spyder2')
Traceback ( ...

你这个爬虫解码不对啊，解出来是byte 0xa0 不是str \xa0
你看看你的content.get_text()返回的是不是byte数组类型不是str类型了
上次报错：
UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position 9: illegal multibyte sequence
这次报错：
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa0 in position 0: incomplete multibyte sequence
仔细看类型区别，一个character u'\xa0' 一个byte 0xa0

淼妙 · 发表于 2018-1-19 16:00:42

剑心无痕发表于 2018-1-19 15:47
你这个爬虫解码不对啊，解出来是byte 0xa0 不是str \xa0
你看看你的content.get_text()返回的是不是byte数 ...

但系统显示的解码就是这个

剑心无痕 · 发表于 2018-1-19 16:31:13

淼妙发表于 2018-1-19 16:00
但系统显示的解码就是这个

不同的网页编码方式不一样，不能都用utf-8解码吧
就算都用utf-8也得保证content的类型一致吧，不能一会byte一会str
if is instance(content, byte):
content = content.decode('utf-8') #好歹统一一下格式

淼妙 · 发表于 2018-1-20 21:22:02

已解决，换了一种方式来爬
import urllib2
from bs4 import BeautifulSoup
import xlwt

url='https://baike.baidu.com/item/Python/407313?fr=aladdin'
wbk=xlwt.Workbook()
sheet=wbk.add_sheet('sheet 1',cell_overwrite_ok=True)
webpage=urllib2.urlopen(url)
soup=BeautifulSoup(webpage,'lxml',from_encoding='utf-8')
links=soup.find_all('div',class_='main-content')
m=0
for link2 in links:
n=0
sheet.write(m,n,link2.get_text())
n=n+1
m=m+1
wbk.save(r'C:\Users\123\Desktop/python2.xls')

		自动登录	找回密码
密码			立即注册

[求助] 如何将爬取得内容写进txt文档中啊？还请大神不吝赐教？

热心会员

默默耕耘

优秀版主