一、在爬虫自动控制爬虫停止:
在爬虫代码中执行: self.crawler.engine.close_spider(self, ‘reason’)
二、自定义爬虫启动方式:
runner = CrawlerRunner(settings)
runner.crawl(spiderobj)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() #
或者:
process = CrawlerProcess(settings)
process.crawl(spiderobj)
process.start(stop_after_crawl=True) #完成后停止
连续执行两个爬虫时,会报错twisted.internet.error.ReactorNotRestartable
可采用两种方式解决:
1. 单独的进程运行:
Process(target=self.start_crawl, args=(spiderobj, settings)).start()
2. os.execl(sys.executable, sys.executable, *sys.argv) 方式
三、自定义去重
settings.py文件中修改默认去重规则
DUPEFILTER_CLASS = ‘www.dupefilters.WwwDupeFilter’
自定义去重类如下:
from scrapy.utils.request import request_fingerprint
from scrapy.dupefilter import BaseDupeFilter
class WwwDupeFilter(BaseDupeFilter):
def __init__(self):
# 初始化visited_fd为一个集合【也可以放到redis中】
self.visited_fd = set()
@classmethod
def from_settings(cls, settings):
return cls()
def request_seen(self, request):
# request:请求的url【进行类似md5加密的操作】
# http://www.baidu.com?su=123&456
# http://www.baidu.com?su=456&123 以上两个的伪md5值时一样的
# 伪md5值得方法是request_fingerprint
fd = request_fingerprint(request=request)
# 如果路径在visited_fd中返回True
if fd in self.visited_fd:
return True
# 添加到集合中
self.visited_fd.add(fd)
def open(self): # can return deferred
print('开始')
def close(self, reason): # can return a deferred
print('结束')
def log(self, request, spider): # log that a request has been filtered
print('日志')
四、添加爬虫结束处理方法
在爬虫添加信号处理: https://doc.scrapy.org/en/latest/topics/signals.html?highlight=closed from scrapy import signals from scrapy import Spider class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed) #结束 crawler.signals.connect(spider.on_spider_opened, signal=signals.spider_opened)#开始 return spider def spider_closed(self, spider): spider.logger.info('Spider closed: %s', spider.name) def parse(self, response): pass