Scrapy – 动态配置启动爬虫 – 我帮你-呆呆的笔记小屋

可以利用scrapy提供的核心API通过编程方式启动scrapy，代替传统的scrapy crawl启动方式。

Scrapy构建于Twisted异步网络框架基础之上，因此你需要在Twisted reactor里面运行。

一、 CrawlerProcess启动爬虫

首先你可以使用scrapy.crawler.CrawlerProcess这个类来运行你的spider，这个类会为你启动一个Twisted reactor，并能配置你的日志和shutdown处理器。所有的scrapy命令都使用这个类。

run.py

import scrapy

from scrapy.crawler import CrawlerProcess

from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl(MySpider)

     process.start() # the script will block here until the crawling is finished
     然后你就可以直接执行这个脚本

python run.py

二、CrawlerRunner启动爬虫

另外一个功能更强大的类是scrapy.crawler.CrawlerRunner

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
# Your spider definition
…

configure_logging({‘LOG_FORMAT’: ‘%(levelname)s: %(message)s’})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

三、同一进程运行多个spider

默认情况当你每次执行scrapy crawl命令时会创建一个新的进程。但我们可以使用核心API在同一个进程中同时运行多个spider

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
# Your first spider definition
…

class MySpider2(scrapy.Spider):
# Your second spider definition
…

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

我帮你-呆呆的笔记小屋

挥一挥衣袖，不带走一片云彩

Scrapy – 动态配置启动爬虫

一、 CrawlerProcess启动爬虫

二、CrawlerRunner启动爬虫

三、同一进程运行多个spider

About the Author: daidai5771

发表评论取消回复

一、 CrawlerProcess启动爬虫

二、CrawlerRunner启动爬虫

三、同一进程运行多个spider

You May Also Like

AI模型工具

squid的代理部署问题（http/s）

About the Author: daidai5771

发表评论 取消回复

发表评论取消回复