CrawlSpider X Splash

Author： Yohane
发布时间：August 21, 2018
22927views
No comments
8270字数
Categories：学

CrawlSpider和Splash的整合
CrawlSpider是Scrapy的CrawlSpider
Splash是Splash和Scrapy-Splash插件

前言

由于官方文档只是给出简单使用示例当与CrawlSpider整合的时候只能自己看源码了
在不破坏框架结构和调用流程的情况下在其最边缘整合是再好不过的了
如果用scrapy-redis时 DUPEFILTER_CLASS配置冲突建议用scrapy-redis的就行

源码解析

以下为crawl源码原来只想贴解析的部分然后发现代码挺少索性就全贴上来了
源码路径 scrapy/spiders/crawl.py

import copy
import six

from scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spider


def identity(x):
    return x


class Rule(object):

    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
        self.link_extractor = link_extractor
        self.callback = callback
        self.cb_kwargs = cb_kwargs or {}
        self.process_links = process_links
        self.process_request = process_request
        if follow is None:
            self.follow = False if callback else True
        else:
            self.follow = follow


class CrawlSpider(Spider):

    rules = ()

    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

    def parse_start_url(self, response):
        return []

    def process_results(self, response, results):
        return results

    def _build_request(self, rule, link):
        r = Request(url=link.url, callback=self._response_downloaded)
        r.meta.update(rule=rule, link_text=link.text)
        return r

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
        spider._follow_links = crawler.settings.getbool(
            'CRAWLSPIDER_FOLLOW_LINKS', True)
        return spider

    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

尝试重写过make_requests_from_url和_requests_to_follow 发现都有问题不解析或者不跟踪
然后发现Rule有个process_request参数一时兴奋于是自己写了个方法发现还是没什么用
最后看了看源码发现最后生成器yield返回的是这样的：

r = self._build_request(n, link)
yield rule.process_request(r)

退一步用的的_build_request方法：

def _build_request(self, rule, link):
    r = Request(url=link.url, callback=self._response_downloaded)
    r.meta.update(rule=rule, link_text=link.text)
    return r

用的是scrapy原生的Request 并将rule、link的url和text以及_response_downloaded作为参数传递并返回其中_response_downloaded方法：

def _response_downloaded(self, response):
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

包含了rule.callback 也就是写Rule时的callback参数就保证了下载器下载后能够回调解析

重写

综上所述仅仅是写个方法给process_request是不够的最终还是用的原生的Request 所以需要重写_build_request方法替换掉原生的Request 即：

def _build_request(self, rule, link):
      r = SplashRequest(url=link.url, callback=self._response_downloaded)
      r.meta.update(rule=rule, link_text=link.text)
      return r

大功告成这是最简单的

Splash 禁止下载图片

scrapy-splash并没有直接给出禁止下载图片的api 但是splash已经给出

Spalsh HTTP API

其中render.html接口有个参数

images : integer : optional
Whether to download images. Possible values are 1 (download images) and 0
(don’t download images). Default is 1.
Note that cached images may be displayed even if this parameter is 0. You can > also use Request Filters to strip unwanted contents based on URL.

在使用SplashRequest传入相关参数即可：

SplashRequest(args={"images":0})

下边有一条Note就自己研究吧

Splash Scripts Reference

第二个方法就是写lua脚本

splash.images_enabled
Enable/disable images.

即：

script = """
        function main(splash,args)
            splash.images_enabled = false
        end
        """
SplashRequest(endpoint='execute',args={"lua_source":script})

例子为了保证简单明了只填写了基本的参数其他如 url 等参数和return splash:html()等脚本必要的需自己写上

Splash 设置 UA

同样scrapy-splash并没有直接给出但设置User-Agent发方法很多
splash:set_user_agent、 splash:set_custom_headers、splash:go在lua脚本中都可以设置
以下是官方文档的例子就直接拿过来了稍稍整理了一下

splash:set_user_agent

splash:set_user_agent
Overwrite the User-Agent header for all further requests.
Signature: splash:set_user_agent(value)
Parameters:
value - string, a value of User-Agent HTTP header.
Returns: nil.
Async: no.

  splash:set_user_agent("reki")

splash:set_custom_headers

splash:set_custom_headers
Set custom HTTP headers to send with each request.
Signature: splash:set_custom_headers(headers)
Parameters:
headers - a Lua table with HTTP headers.
Returns: nil.
Async: no.
Headers are merged with WebKit default headers, overwriting WebKit values in case of conflicts.

splash:set_custom_headers({
 ["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:61.0) Gecko/20100101 Firefox/61.0",
})

splash:go

splash:go
Go to an URL. This is similar to entering an URL in a browser address bar, pressing Enter and waiting until page loads.
Signature: ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil}
Parameters:
-headers - a Lua table with HTTP headers to add/replace in the initial request.
Returns: ok, reason pair. If ok is nil then error happened during page load; reason provides an information about error type.
Async: yes, unless the navigation is locked.
headers argument allows to add or replace default HTTP headers for the initial request. To set custom headers for all further requests (including requests to related resources) use splash:set_custom_headers or splash:on_request.
User-Agent header is special: once used, it is kept for further requests. This is an implementation detail and it could change in future releases; to set User-Agent header it is recommended to use splash:set_user_agent method.

splash:go{"https://reki.me", headers={
    ["User-Agent"] = "Yohane",
}})

参考

版权属于：羽子
本文链接：https://reki.me/studying/crawlspider-x-splash.html
本文采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
允许自由转载和修改，但请务必标明文章来源且不得运用于商业目的并以相同方式分享。

CrawlSpider X Splash

Yohane • 2018 年 08 月 21 日

CrawlSpider和Splash的整合
CrawlSpider是Scrapy的CrawlSpider
Splash是Splash和Scrapy-Splash插件

前言

由于官方文档只是给出简单使用示例当与CrawlSpider整合的时候只能自己看源码了
在不破坏框架结构和调用流程的情况下在其最边缘整合是再好不过的了
如果用scrapy-redis时 DUPEFILTER_CLASS配置冲突建议用scrapy-redis的就行

源码解析

以下为crawl源码原来只想贴解析的部分然后发现代码挺少索性就全贴上来了
源码路径 scrapy/spiders/crawl.py

import copy
import six

from scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spider


def identity(x):
    return x


class Rule(object):

    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
        self.link_extractor = link_extractor
        self.callback = callback
        self.cb_kwargs = cb_kwargs or {}
        self.process_links = process_links
        self.process_request = process_request
        if follow is None:
            self.follow = False if callback else True
        else:
            self.follow = follow


class CrawlSpider(Spider):

    rules = ()

    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

    def parse_start_url(self, response):
        return []

    def process_results(self, response, results):
        return results

    def _build_request(self, rule, link):
        r = Request(url=link.url, callback=self._response_downloaded)
        r.meta.update(rule=rule, link_text=link.text)
        return r

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
        spider._follow_links = crawler.settings.getbool(
            'CRAWLSPIDER_FOLLOW_LINKS', True)
        return spider

    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

r = self._build_request(n, link)
yield rule.process_request(r)

退一步用的的_build_request方法：

def _build_request(self, rule, link):
    r = Request(url=link.url, callback=self._response_downloaded)
    r.meta.update(rule=rule, link_text=link.text)
    return r

用的是scrapy原生的Request 并将rule、link的url和text以及_response_downloaded作为参数传递并返回其中_response_downloaded方法：

def _response_downloaded(self, response):
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

包含了rule.callback 也就是写Rule时的callback参数就保证了下载器下载后能够回调解析

重写

综上所述仅仅是写个方法给process_request是不够的最终还是用的原生的Request 所以需要重写_build_request方法替换掉原生的Request 即：

def _build_request(self, rule, link):
      r = SplashRequest(url=link.url, callback=self._response_downloaded)
      r.meta.update(rule=rule, link_text=link.text)
      return r

大功告成这是最简单的

Splash 禁止下载图片

scrapy-splash并没有直接给出禁止下载图片的api 但是splash已经给出

Spalsh HTTP API

其中render.html接口有个参数

images : integer : optional
Whether to download images. Possible values are 1 (download images) and 0
(don’t download images). Default is 1.
Note that cached images may be displayed even if this parameter is 0. You can > also use Request Filters to strip unwanted contents based on URL.

在使用SplashRequest传入相关参数即可：

SplashRequest(args={"images":0})

下边有一条Note就自己研究吧

Splash Scripts Reference

第二个方法就是写lua脚本

splash.images_enabled
Enable/disable images.

即：

script = """
        function main(splash,args)
            splash.images_enabled = false
        end
        """
SplashRequest(endpoint='execute',args={"lua_source":script})

例子为了保证简单明了只填写了基本的参数其他如 url 等参数和return splash:html()等脚本必要的需自己写上

Splash 设置 UA

splash:set_user_agent

splash:set_user_agent
Overwrite the User-Agent header for all further requests.
Signature: splash:set_user_agent(value)
Parameters:
value - string, a value of User-Agent HTTP header.
Returns: nil.
Async: no.

  splash:set_user_agent("reki")

splash:set_custom_headers

splash:set_custom_headers
Set custom HTTP headers to send with each request.
Signature: splash:set_custom_headers(headers)
Parameters:
headers - a Lua table with HTTP headers.
Returns: nil.
Async: no.
Headers are merged with WebKit default headers, overwriting WebKit values in case of conflicts.

splash:set_custom_headers({
 ["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:61.0) Gecko/20100101 Firefox/61.0",
})

splash:go

splash:go
Go to an URL. This is similar to entering an URL in a browser address bar, pressing Enter and waiting until page loads.
Signature: ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil}
Parameters:
-headers - a Lua table with HTTP headers to add/replace in the initial request.
Returns: ok, reason pair. If ok is nil then error happened during page load; reason provides an information about error type.
Async: yes, unless the navigation is locked.
headers argument allows to add or replace default HTTP headers for the initial request. To set custom headers for all further requests (including requests to related resources) use splash:set_custom_headers or splash:on_request.
User-Agent header is special: once used, it is kept for further requests. This is an implementation detail and it could change in future releases; to set User-Agent header it is recommended to use splash:set_user_agent method.

splash:go{"https://reki.me", headers={
    ["User-Agent"] = "Yohane",
}})

参考

版权属于：羽子
本文链接：https://reki.me/studying/crawlspider-x-splash.html
本文采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
允许自由转载和修改，但请务必标明文章来源且不得运用于商业目的并以相同方式分享。

CrawlSpider X Splash

前言

源码解析

重写

Splash 禁止下载图片

Spalsh HTTP API

Splash Scripts Reference

Splash 设置 UA

splash:set_user_agent

splash:set_custom_headers

splash:go

参考

Leave a Comment Cancel reply

欢迎来到我的小窝

【桌面升级】杜伽K620w开箱

树莓派nas

Reboot

呐你说这个夏天她还是那个味吗

我喜欢的音乐

索尼SVF15326SCB黑苹果后

《here》填词

CrawlSpider X Splash

Reboot

CrawlSpider X Splash

前言

源码解析

重写

Splash 禁止下载图片

Spalsh HTTP API

Splash Scripts Reference

Splash 设置 UA

splash:set_user_agent

splash:set_custom_headers

splash:go

参考