.

  • CrawlSpider和Splash的整合
  • CrawlSpider是Scrapy的CrawlSpider
  • Splash是Splash和Scrapy-Splash插件

前言

  • 由于官方文档只是给出简单使用示例 当与CrawlSpider整合的时候只能自己看源码了
  • 在不破坏框架结构和调用流程的情况下 在其最边缘整合是再好不过的了
  • 如果用scrapy-redisDUPEFILTER_CLASS配置冲突建议用scrapy-redis的就行

源码解析

  • 以下为crawl源码 原来只想贴解析的部分 然后发现代码挺少 索性就全贴上来了
  • 源码路径 scrapy/spiders/crawl.py
import copy
import six

from scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spider


def identity(x):
    return x


class Rule(object):

    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
        self.link_extractor = link_extractor
        self.callback = callback
        self.cb_kwargs = cb_kwargs or {}
        self.process_links = process_links
        self.process_request = process_request
        if follow is None:
            self.follow = False if callback else True
        else:
            self.follow = follow


class CrawlSpider(Spider):

    rules = ()

    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

    def parse_start_url(self, response):
        return []

    def process_results(self, response, results):
        return results

    def _build_request(self, rule, link):
        r = Request(url=link.url, callback=self._response_downloaded)
        r.meta.update(rule=rule, link_text=link.text)
        return r

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
        spider._follow_links = crawler.settings.getbool(
            'CRAWLSPIDER_FOLLOW_LINKS', True)
        return spider

    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

尝试重写过make_requests_from_url_requests_to_follow 发现都有问题不解析或者不跟踪
然后发现Rule有个process_request参数 一时兴奋 于是自己写了个方法 发现还是没什么用
最后看了看源码发现最后生成器yield返回的是这样的:

r = self._build_request(n, link)
yield rule.process_request(r)

退一步 用的的_build_request方法:

def _build_request(self, rule, link):
    r = Request(url=link.url, callback=self._response_downloaded)
    r.meta.update(rule=rule, link_text=link.text)
    return r

用的是scrapy原生的Request 并将rulelinkurltext以及_response_downloaded作为参数传递并返回 其中_response_downloaded方法:

def _response_downloaded(self, response):
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

包含了rule.callback 也就是写Rule时的callback参数 就保证了下载器下载后能够回调解析

重写

综上所述 仅仅是写个方法给process_request是不够的 最终还是用的原生的Request 所以需要重写_build_request方法 替换掉原生的Request 即:

def _build_request(self, rule, link):
      r = SplashRequest(url=link.url, callback=self._response_downloaded)
      r.meta.update(rule=rule, link_text=link.text)
      return r

大功告成 这是最简单的

Splash 禁止下载图片

scrapy-splash并没有直接给出禁止下载图片的api 但是splash已经给出

Spalsh HTTP API

其中render.html接口有个参数

images : integer : optional

  • Whether to download images. Possible values are 1 (download images) and 0

(don’t download images). Default is 1.

  • Note that cached images may be displayed even if this parameter is 0. You can > also use Request Filters to strip unwanted contents based on URL.

在使用SplashRequest传入相关参数即可:

SplashRequest(args={"images":0})

下边有一条Note就自己研究吧

Splash Scripts Reference

第二个方法就是写lua脚本

splash.images_enabled

  • Enable/disable images.

即:

script = """
        function main(splash,args)
            splash.images_enabled = false
        end
        """
SplashRequest(endpoint='execute',args={"lua_source":script})

例子为了保证简单明了 只填写了基本的参数 其他如 url 等参数和return splash:html()等脚本 必要的需自己写上

Splash 设置 UA

同样scrapy-splash并没有直接给出 但设置User-Agent发方法很多
splash:set_user_agentsplash:set_custom_headerssplash:go在lua脚本中都可以设置
以下是官方文档的例子就直接拿过来了 稍稍整理了一下

splash:set_user_agent

  • splash:set_user_agent

    • Overwrite the User-Agent header for all further requests.
  • Signature: splash:set_user_agent(value)
  • Parameters:

    • value - string, a value of User-Agent HTTP header.
  • Returns: nil.
  • Async: no.
  splash:set_user_agent("reki")

splash:set_custom_headers

  • splash:set_custom_headers

    • Set custom HTTP headers to send with each request.
  • Signature: splash:set_custom_headers(headers)
  • Parameters:

    • headers - a Lua table with HTTP headers.
  • Returns: nil.
  • Async: no.
  • Headers are merged with WebKit default headers, overwriting WebKit values in case of conflicts.
splash:set_custom_headers({
 ["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:61.0) Gecko/20100101 Firefox/61.0",
})

splash:go

  • splash:go

    • Go to an URL. This is similar to entering an URL in a browser address bar, pressing Enter and waiting until page loads.
  • Signature: ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil}
  • Parameters:

    • -headers - a Lua table with HTTP headers to add/replace in the initial request.
  • Returns: ok, reason pair. If ok is nil then error happened during page load; reason provides an information about error type.
  • Async: yes, unless the navigation is locked.
  • headers argument allows to add or replace default HTTP headers for the initial request. To set custom headers for all further requests (including requests to related resources) use splash:set_custom_headers or splash:on_request.
  • User-Agent header is special: once used, it is kept for further requests. This is an implementation detail and it could change in future releases; to set User-Agent header it is recommended to use splash:set_user_agent method.
splash:go{"https://reki.me", headers={
    ["User-Agent"] = "Yohane",
}})

参考