一、在爬虫自动控制爬虫停止:
在爬虫代码中执行: self.crawler.engine.close_spider(self, ‘reason’)
二、自定义爬虫启动方式:
runner = CrawlerRunner(settings)
runner.crawl(spiderobj)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() #
或者:
process = CrawlerProcess(settings)
process.crawl(spiderobj)
process.start(stop_after_crawl=True) #完成后停止
连续执行两个爬虫时,会报错twisted.internet.error.ReactorNotRestartable
可采用两种方式解决:
1. 单独的进程运行:
Process(target=self.start_crawl, args=(spiderobj, settings)).start()
2. os.execl(sys.executable, sys.executable, *sys.argv) 方式
三、自定义去重
settings.py文件中修改默认去重规则
DUPEFILTER_CLASS = ‘www.dupefilters.WwwDupeFilter’
自定义去重类如下:
from scrapy.utils.request import request_fingerprint from scrapy.dupefilter import BaseDupeFilter class WwwDupeFilter(BaseDupeFilter): def __init__(self): # 初始化visited_fd为一个集合【也可以放到redis中】 self.visited_fd = set() @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): # request:请求的url【进行类似md5加密的操作】 # http://www.baidu.com?su=123&456 # http://www.baidu.com?su=456&123 以上两个的伪md5值时一样的 # 伪md5值得方法是request_fingerprint fd = request_fingerprint(request=request) # 如果路径在visited_fd中返回True if fd in self.visited_fd: return True # 添加到集合中 self.visited_fd.add(fd) def open(self): # can return deferred print('开始') def close(self, reason): # can return a deferred print('结束') def log(self, request, spider): # log that a request has been filtered print('日志')
四、添加爬虫结束处理方法
在爬虫添加信号处理: https://doc.scrapy.org/en/latest/topics/signals.html?highlight=closed from scrapy import signals from scrapy import Spider class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed) #结束 crawler.signals.connect(spider.on_spider_opened, signal=signals.spider_opened)#开始 return spider def spider_closed(self, spider): spider.logger.info('Spider closed: %s', spider.name) def parse(self, response): pass