爬虫

wx_Z9LTRnIn · 发表于 2021-6-5 18:02:38

可不可以麻烦大佬发一份成功爬取网上某篇文章的某一段落的文字，就是
http://www.ruiwen.com/wenxue/zhuziqing/419754.html
朱自清散文，只爬取开头一段

我爱喝奶茶 · 发表于 2021-6-14 16:52:18

这几天心里颇不宁静。今晚在院子里坐着乘凉，忽然想起日日走过的荷塘，在这满月的光里，总该另有一番样子吧。月亮渐渐地升高了，墙外马路上孩子们的欢笑，已经听不见了；妻在屋里拍着闰儿，迷迷糊糊地哼着眠歌。我悄悄地披了大衫，带上门出去。

wanghan519 · 发表于 2021-6-16 09:28:19

curl "http://www.ruiwen.com/wenxue/zhuziqing/419754.html" -s | iconv -f gb18030 -t utf-8 | grep 'class="content' -A1

noobyxg · 发表于 2021-6-18 14:41:15

几天心里颇不宁静。今晚在院子里坐着乘凉，忽然想起日日走过的荷塘，在这满月的光里，总该另有一番样子吧。月亮渐渐地升高了，墙外马路上孩子们的欢笑，已经听不见了；妻在屋里拍着闰儿，迷迷糊糊地哼着眠歌。我悄悄地披了大衫，带上门出去。

一杆钓起满天星 · 发表于 2021-7-4 11:15:36

import requests
from lxml import etree
url = 'http://www.ruiwen.com/wenxue/zhuziqing/419754.html'
def getcontent(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
resp**e = requests.get(url, headers=headers)
resp**e.encoding = "gbk"
html = etree.HTML(resp**e.text)
content = html.xpath('/html/body//div[@class="content"]/*/text()')
return content
datalist = getcontent(url)
for line in datalist:
print(line.strip('\u3000') + "\r")

复制代码

一杆钓起满天星 · 发表于 2021-7-4 11:21:09

import requests
from lxml import etree
url = 'http://www.ruiwen.com/wenxue/zhuziqing/419754.html'
def getcontent(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
resp**e = requests.get(url, headers=headers)
resp**e.encoding = "gbk"
html = etree.HTML(resp**e.text)
content = html.xpath('/html/body//div[@class="content"]/*/text()')
return content

datalist = getcontent(url)
for line in datalist:
print(line.strip('\u3000') + "\r")

		自动登录	找回密码
密码			立即注册

[代码与实例] 爬虫