Scrapy-控制爬虫行为-抓取,中间件,启停等

一、在爬虫自动控制爬虫停止:

在爬虫代码中执行: self.crawler.engine.close_spider(self, ‘reason’)

二、自定义爬虫启动方式:

runner = CrawlerRunner(settings)
runner.crawl(spiderobj)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() #
或者:
process = CrawlerProcess(settings)
process.crawl(spiderobj)
process.start(stop_after_crawl=True) #完成后停止
连续执行两个爬虫时,会报错twisted.internet.error.ReactorNotRestartable
可采用两种方式解决:
1. 单独的进程运行:
Process(target=self.start_crawl, args=(spiderobj, settings)).start()
2. os.execl(sys.executable, sys.executable, *sys.argv) 方式

三、自定义去重

settings.py文件中修改默认去重规则
DUPEFILTER_CLASS = ‘www.dupefilters.WwwDupeFilter’

自定义去重类如下:

from scrapy.utils.request import request_fingerprint
from scrapy.dupefilter import BaseDupeFilter


class WwwDupeFilter(BaseDupeFilter):
    def __init__(self):
        # 初始化visited_fd为一个集合【也可以放到redis中】
        self.visited_fd = set()

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        # request:请求的url【进行类似md5加密的操作】
        # http://www.baidu.com?su=123&456
        # http://www.baidu.com?su=456&123  以上两个的伪md5值时一样的
        # 伪md5值得方法是request_fingerprint
        fd = request_fingerprint(request=request)
        # 如果路径在visited_fd中返回True
        if fd in self.visited_fd:
            return True
        # 添加到集合中
        self.visited_fd.add(fd)

    def open(self):  # can return deferred
        print('开始')

    def close(self, reason):  # can return a deferred
        print('结束')

    def log(self, request, spider):  # log that a request has been filtered
        print('日志')

四、添加爬虫结束处理方法

在爬虫添加信号处理: https://doc.scrapy.org/en/latest/topics/signals.html?highlight=closed

from scrapy import signals
from scrapy import Spider


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]


    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed) #结束
        crawler.signals.connect(spider.on_spider_opened, signal=signals.spider_opened)#开始
        return spider


    def spider_closed(self, spider):
        spider.logger.info('Spider closed: %s', spider.name)


    def parse(self, response):
        pass

You May Also Like

About the Author: daidai5771

发表评论

电子邮件地址不会被公开。 必填项已用*标注