|
- # -*- coding: utf-8 -*-
- from scrapy.spiders import CrawlSpider, Rule
- from scrapy.linkextractors import LinkExtractor
- from master.items import MasterItem
- class CsdnMasterSpider(CrawlSpider):
- name = 'book_master'
- allowed_domains = ['douban.com']
- item = MasterItem()
- start_urls = ['https://book.douban.com/tag/']
- rules = (
- Rule(LinkExtractor(allow=('https://book.douban.com/tag/[\u4e00-\u9fa5]+',)), callback='parse_item',
- follow=True),
- )
- def parse_item(self, resp**e):
- for li in resp**e.css('#subject_list > ul > li'):
- href = li.css('div.info > h2 > a::attr("href")').extract_first()
- item = self.item
- item['url'] = href
- yield item
复制代码
做分布式练习,这是主爬虫,只是将所有的图书链接扔到redis里
接了 蘑菇代理中的隧道代理,测试也代理成功了
但是几百或一千条数据也会跳转到验证码页面
还会报个错
停一下,等待1分钟 再爬还能爬几百条。
怎么才能更“稳定的”爬取呢 |
|