Scrapy – 动态配置启动爬虫

可以利用scrapy提供的核心API通过编程方式启动scrapy,代替传统的scrapy crawl启动方式。

Scrapy构建于Twisted异步网络框架基础之上,因此你需要在Twisted reactor里面运行。

一、 CrawlerProcess启动爬虫

首先你可以使用scrapy.crawler.CrawlerProcess这个类来运行你的spider,这个类会为你启动一个Twisted reactor,并能配置你的日志和shutdown处理器。所有的scrapy命令都使用这个类。

run.py

       import scrapy
       from scrapy.crawler import CrawlerProcess
       from scrapy.utils.project import get_project_settings
       process = CrawlerProcess(get_project_settings())
     process.crawl(MySpider)
     process.start() # the script will block here until the crawling is finished
     然后你就可以直接执行这个脚本

python run.py

二、CrawlerRunner启动爬虫

另外一个功能更强大的类是scrapy.crawler.CrawlerRunner

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
# Your spider definition

configure_logging({‘LOG_FORMAT’: ‘%(levelname)s: %(message)s’})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

三、同一进程运行多个spider

默认情况当你每次执行scrapy crawl命令时会创建一个新的进程。但我们可以使用核心API在同一个进程中同时运行多个spider

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
# Your first spider definition

class MySpider2(scrapy.Spider):
# Your second spider definition

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

You May Also Like

About the Author: daidai5771

发表评论

电子邮件地址不会被公开。 必填项已用*标注