可以利用scrapy提供的核心API通过编程方式启动scrapy,代替传统的scrapy crawl
启动方式。
Scrapy构建于Twisted异步网络框架基础之上,因此你需要在Twisted reactor里面运行。
一、 CrawlerProcess启动爬虫
首先你可以使用scrapy.crawler.CrawlerProcess
这个类来运行你的spider,这个类会为你启动一个Twisted reactor,并能配置你的日志和shutdown处理器。所有的scrapy命令都使用这个类。
run.py
python run.py
二、CrawlerRunner启动爬虫
另外一个功能更强大的类是scrapy.crawler.CrawlerRunner
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_loggingclass MySpider(scrapy.Spider):
# Your spider definition
…configure_logging({‘LOG_FORMAT’: ‘%(levelname)s: %(message)s’})
runner = CrawlerRunner()d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
三、同一进程运行多个spider
默认情况当你每次执行scrapy crawl
命令时会创建一个新的进程。但我们可以使用核心API在同一个进程中同时运行多个spider
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_loggingclass MySpider1(scrapy.Spider):
# Your first spider definition
…class MySpider2(scrapy.Spider):
# Your second spider definition
…configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())reactor.run() # the script will block here until all crawling jobs are finished