博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
scrapy之中间件
阅读量:5287 次
发布时间:2019-06-14

本文共 12630 字,大约阅读时间需要 42 分钟。

1 Dowloader Middeware

下载中间件的用途    1、在process——request内,自定义下载,不用scrapy的下载    2、对请求进行二次加工,比如        设置请求头        设置cookie        添加代理            scrapy自带的代理组件:                from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware                from urllib.request import getproxies
class DownMiddleware1(object):    def process_request(self, request, spider):        """        请求需要被下载时,经过所有下载器中间件的process_request调用        :param request:         :param spider:         :return:              None,继续后续中间件去下载;            Response对象,停止process_request的执行,开始执行process_response            Request对象,停止中间件的执行,将Request重新调度器            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception        """        pass    def process_response(self, request, response, spider):        """        spider处理完成,返回时调用        :param response:        :param result:        :param spider:        :return:             Response 对象:转交给其他中间件process_response            Request 对象:停止中间件,request会被重新调度下载            raise IgnoreRequest 异常:调用Request.errback        """        print('response1')        return response    def process_exception(self, request, exception, spider):        """        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常        :param response:        :param exception:        :param spider:        :return:             None:继续交给后续中间件处理异常;            Response对象:停止后续process_exception方法            Request对象:停止中间件,request将会被重新调用下载        """        return None
下载器中间件
#1、与middlewares.py同级目录下新建proxy_handle.pyimport requestsdef get_proxy():    return requests.get("http://127.0.0.1:5010/get/").textdef delete_proxy(proxy):    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))        #2、middlewares.pyfrom Amazon.proxy_handle import get_proxy,delete_proxyclass DownMiddleware1(object):    def process_request(self, request, spider):        """        请求需要被下载时,经过所有下载器中间件的process_request调用        :param request:        :param spider:        :return:            None,继续后续中间件去下载;            Response对象,停止process_request的执行,开始执行process_response            Request对象,停止中间件的执行,将Request重新调度器            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception        """        proxy="http://" + get_proxy()        request.meta['download_timeout']=20        request.meta["proxy"] = proxy        print('为%s 添加代理%s ' % (request.url, proxy),end='')        print('元数据为',request.meta)    def process_response(self, request, response, spider):        """        spider处理完成,返回时调用        :param response:        :param result:        :param spider:        :return:            Response 对象:转交给其他中间件process_response            Request 对象:停止中间件,request会被重新调度下载            raise IgnoreRequest 异常:调用Request.errback        """        print('返回状态吗',response.status)        return response    def process_exception(self, request, exception, spider):        """        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常        :param response:        :param exception:        :param spider:        :return:            None:继续交给后续中间件处理异常;            Response对象:停止后续process_exception方法            Request对象:停止中间件,request将会被重新调用下载        """        print('代理%s,访问%s出现异常:%s' %(request.meta['proxy'],request.url,exception))        import time        time.sleep(5)        delete_proxy(request.meta['proxy'].split("//")[-1])        request.meta['proxy']='http://'+get_proxy()        return request
配置代理

2 Spider Middleware

1、爬虫中间件方法介绍

from scrapy import signalsclass SpiderMiddleware(object):    # Not all methods need to be defined. If a method is not defined,    # scrapy acts as if the spider middleware does not modify the    # passed objects.    @classmethod    def from_crawler(cls, crawler):        # This method is used by Scrapy to create your spiders.        s = cls()        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #当前爬虫执行时触发spider_opened        return s    def spider_opened(self, spider):        # spider.logger.info('我是egon派来的爬虫1: %s' % spider.name)        print('我是egon派来的爬虫1: %s' % spider.name)    def process_start_requests(self, start_requests, spider):        # Called with the start requests of the spider, and works        # similarly to the process_spider_output() method, except        # that it doesn’t have a response associated.        # Must return only requests (not items).        print('start_requests1')        for r in start_requests:            yield r    def process_spider_input(self, response, spider):        # Called for each response that goes through the spider        # middleware and into the spider.        # 每个response经过爬虫中间件进入spider时调用        # 返回值:Should return None or raise an exception.        #1、None: 继续执行其他中间件的process_spider_input        #2、抛出异常:        # 一旦抛出异常则不再执行其他中间件的process_spider_input        # 并且触发request绑定的errback        # errback的返回值倒着传给中间件的process_spider_output        # 如果未找到errback,则倒着执行中间件的process_spider_exception        print("input1")        return None    def process_spider_output(self, response, result, spider):        # Called with the results returned from the Spider, after        # it has processed the response.        # Must return an iterable of Request, dict or Item objects.        print('output1')        # 用yield返回多次,与return返回一次是一个道理        # 如果生成器掌握不好(函数内有yield执行函数得到的是生成器而并不会立刻执行),生成器的形式会容易误导你对中间件执行顺序的理解        # for i in result:        #     yield i        return result    def process_spider_exception(self, response, exception, spider):        # Called when a spider or process_spider_input() method        # (from other spider middleware) raises an exception.        # Should return either None or an iterable of Response, dict        # or Item objects.        print('exception1')
爬虫中间件

2、当前爬虫启动时以及初始请求产生时

#步骤一:'''打开注释:SPIDER_MIDDLEWARES = {   'Baidu.middlewares.SpiderMiddleware1': 200,   'Baidu.middlewares.SpiderMiddleware2': 300,   'Baidu.middlewares.SpiderMiddleware3': 400,}'''#步骤二:middlewares.pyfrom scrapy import signalsclass SpiderMiddleware1(object):    @classmethod    def from_crawler(cls, crawler):        s = cls()        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) #当前爬虫执行时触发spider_opened        return s    def spider_opened(self, spider):        print('我是egon派来的爬虫1: %s' % spider.name)    def process_start_requests(self, start_requests, spider):        # Must return only requests (not items).        print('start_requests1')        for r in start_requests:            yield r                class SpiderMiddleware2(object):    @classmethod    def from_crawler(cls, crawler):        s = cls()        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)  # 当前爬虫执行时触发spider_opened        return s    def spider_opened(self, spider):        print('我是egon派来的爬虫2: %s' % spider.name)    def process_start_requests(self, start_requests, spider):        print('start_requests2')        for r in start_requests:            yield rclass SpiderMiddleware3(object):    @classmethod    def from_crawler(cls, crawler):        s = cls()        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)  # 当前爬虫执行时触发spider_opened        return s    def spider_opened(self, spider):        print('我是egon派来的爬虫3: %s' % spider.name)    def process_start_requests(self, start_requests, spider):        print('start_requests3')        for r in start_requests:            yield r#步骤三:分析运行结果#1、启动爬虫时则立刻执行:我是egon派来的爬虫1: baidu我是egon派来的爬虫2: baidu我是egon派来的爬虫3: baidu#2、然后产生一个初始的request请求,依次经过爬虫中间件1,2,3:start_requests1start_requests2start_requests3
View Code

3、process_spider_input返回None时

#步骤一:打开注释:SPIDER_MIDDLEWARES = {   'Baidu.middlewares.SpiderMiddleware1': 200,   'Baidu.middlewares.SpiderMiddleware2': 300,   'Baidu.middlewares.SpiderMiddleware3': 400,}'''#步骤二:middlewares.pyfrom scrapy import signalsclass SpiderMiddleware1(object):    def process_spider_input(self, response, spider):        print("input1")    def process_spider_output(self, response, result, spider):        print('output1')        return result    def process_spider_exception(self, response, exception, spider):        print('exception1')class SpiderMiddleware2(object):    def process_spider_input(self, response, spider):        print("input2")        return None    def process_spider_output(self, response, result, spider):        print('output2')        return result    def process_spider_exception(self, response, exception, spider):        print('exception2')class SpiderMiddleware3(object):    def process_spider_input(self, response, spider):        print("input3")        return None    def process_spider_output(self, response, result, spider):        print('output3')        return result    def process_spider_exception(self, response, exception, spider):        print('exception3')#步骤三:运行结果分析#1、返回response时,依次经过爬虫中间件1,2,3input1input2input3#2、spider处理完毕后,依次经过爬虫中间件3,2,1output3output2output1
View Code

4、process_spider_input抛出异常时

#步骤一:'''打开注释:SPIDER_MIDDLEWARES = {   'Baidu.middlewares.SpiderMiddleware1': 200,   'Baidu.middlewares.SpiderMiddleware2': 300,   'Baidu.middlewares.SpiderMiddleware3': 400,}'''#步骤二:middlewares.pyfrom scrapy import signalsclass SpiderMiddleware1(object):    def process_spider_input(self, response, spider):        print("input1")    def process_spider_output(self, response, result, spider):        print('output1')        return result    def process_spider_exception(self, response, exception, spider):        print('exception1')class SpiderMiddleware2(object):    def process_spider_input(self, response, spider):        print("input2")        raise Type    def process_spider_output(self, response, result, spider):        print('output2')        return result    def process_spider_exception(self, response, exception, spider):        print('exception2')class SpiderMiddleware3(object):    def process_spider_input(self, response, spider):        print("input3")        return None    def process_spider_output(self, response, result, spider):        print('output3')        return result    def process_spider_exception(self, response, exception, spider):        print('exception3')        #运行结果        input1input2exception3exception2exception1#分析:#1、当response经过中间件1的 process_spider_input返回None,继续交给中间件2的process_spider_input#2、中间件2的process_spider_input抛出异常,则直接跳过后续的process_spider_input,将异常信息传递给Spiders里该请求的errback#3、没有找到errback,则该response既没有被Spiders正常的callback执行,也没有被errback执行,即Spiders啥事也没有干,那么开始倒着执行process_spider_exception#4、如果process_spider_exception返回None,代表该方法推卸掉责任,并没处理异常,而是直接交给下一个process_spider_exception,全都返回None,则异常最终交给Engine抛出
View Code

5、指定errback

#步骤一:spider.pyimport scrapyclass BaiduSpider(scrapy.Spider):    name = 'baidu'    allowed_domains = ['www.baidu.com']    start_urls = ['http://www.baidu.com/']    def start_requests(self):        yield scrapy.Request(url='http://www.baidu.com/',                             callback=self.parse,                             errback=self.parse_err,                             )    def parse(self, response):        pass    def parse_err(self,res):        #res 为异常信息,异常已经被该函数处理了,因此不会再抛给因此,于是开始走process_spider_output        return [1,2,3,4,5] #提取异常信息中有用的数据以可迭代对象的形式存放于管道中,等待被process_spider_output取走#步骤二:'''打开注释:SPIDER_MIDDLEWARES = {   'Baidu.middlewares.SpiderMiddleware1': 200,   'Baidu.middlewares.SpiderMiddleware2': 300,   'Baidu.middlewares.SpiderMiddleware3': 400,}'''#步骤三:middlewares.pyfrom scrapy import signalsclass SpiderMiddleware1(object):    def process_spider_input(self, response, spider):        print("input1")    def process_spider_output(self, response, result, spider):        print('output1',list(result))        return result    def process_spider_exception(self, response, exception, spider):        print('exception1')class SpiderMiddleware2(object):    def process_spider_input(self, response, spider):        print("input2")        raise TypeError('input2 抛出异常')    def process_spider_output(self, response, result, spider):        print('output2',list(result))        return result    def process_spider_exception(self, response, exception, spider):        print('exception2')class SpiderMiddleware3(object):    def process_spider_input(self, response, spider):        print("input3")        return None    def process_spider_output(self, response, result, spider):        print('output3',list(result))        return result    def process_spider_exception(self, response, exception, spider):        print('exception3')#步骤四:运行结果分析input1input2output3 [1, 2, 3, 4, 5] #parse_err的返回值放入管道中,只能被取走一次,在output3的方法内可以根据异常信息封装一个新的request请求output2 []output1 []
View Code

 

转载于:https://www.cnblogs.com/lujiacheng-Python/p/10162645.html

你可能感兴趣的文章
利用F#编写、理解Y组合子函数
查看>>
React中路由的基本使用
查看>>
【Oracle】Oracle中复合数据类型
查看>>
File类中的list和listFiles方法
查看>>
嵌入式基础知识
查看>>
AnimationUtil
查看>>
成员内部类 局部内部类 匿名内部类
查看>>
[7] 算法之路 - 高速排序之3轴演算
查看>>
PHP查询MySQL大量数据的内存占用分析
查看>>
git批量删除分支
查看>>
LaTex Font Size 字体大小命令
查看>>
[LeetCode] 1028. Recover a Tree From Preorder Traversal 从先序遍历还原二叉树
查看>>
client offset scroll
查看>>
从关系型数据库到非关系型数据库
查看>>
03_10_Object类的toString equals等方法
查看>>
python之三层菜单递归
查看>>
Flink学习笔记:Time的故事
查看>>
JS获取浏览器窗口大小 获取屏幕,浏览器,网页高度宽度
查看>>
Handler对象
查看>>
spring注入
查看>>