.
- CrawlSpider和Splash的整合
- CrawlSpider是Scrapy的CrawlSpider
- Splash是Splash和Scrapy-Splash插件
前言
- 由于官方文档只是给出简单使用示例 当与CrawlSpider整合的时候只能自己看源码了
- 在不破坏框架结构和调用流程的情况下 在其最边缘整合是再好不过的了
- 如果用
scrapy-redis
时DUPEFILTER_CLASS
配置冲突建议用scrapy-redis
的就行
源码解析
- 以下为crawl源码 原来只想贴解析的部分 然后发现代码挺少 索性就全贴上来了
- 源码路径 scrapy/spiders/crawl.py
import copy
import six
from scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spider
def identity(x):
return x
class Rule(object):
def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
self.link_extractor = link_extractor
self.callback = callback
self.cb_kwargs = cb_kwargs or {}
self.process_links = process_links
self.process_request = process_request
if follow is None:
self.follow = False if callback else True
else:
self.follow = follow
class CrawlSpider(Spider):
rules = ()
def __init__(self, *a, **kw):
super(CrawlSpider, self).__init__(*a, **kw)
self._compile_rules()
def parse(self, response):
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
def parse_start_url(self, response):
return []
def process_results(self, response, results):
return results
def _build_request(self, rule, link):
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=rule, link_text=link.text)
return r
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
self._rules = [copy.copy(r) for r in self.rules]
for rule in self._rules:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
spider._follow_links = crawler.settings.getbool(
'CRAWLSPIDER_FOLLOW_LINKS', True)
return spider
def set_crawler(self, crawler):
super(CrawlSpider, self).set_crawler(crawler)
self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)
尝试重写过make_requests_from_url
和_requests_to_follow
发现都有问题不解析或者不跟踪
然后发现Rule
有个process_request
参数 一时兴奋 于是自己写了个方法 发现还是没什么用
最后看了看源码发现最后生成器yield
返回的是这样的:
r = self._build_request(n, link)
yield rule.process_request(r)
退一步 用的的_build_request
方法:
def _build_request(self, rule, link):
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=rule, link_text=link.text)
return r
用的是scrapy
原生的Request
并将rule
、link
的url
和text
以及_response_downloaded
作为参数传递并返回 其中_response_downloaded
方法:
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
包含了rule.callback
也就是写Rule
时的callback
参数 就保证了下载器下载后能够回调解析
重写
综上所述 仅仅是写个方法给process_request
是不够的 最终还是用的原生的Request
所以需要重写_build_request
方法 替换掉原生的Request
即:
def _build_request(self, rule, link):
r = SplashRequest(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=rule, link_text=link.text)
return r
大功告成 这是最简单的
Splash 禁止下载图片
scrapy-splash并没有直接给出禁止下载图片的api 但是splash已经给出
Spalsh HTTP API
其中render.html
接口有个参数
images : integer : optional
- Whether to download images. Possible values are 1 (download images) and 0
(don’t download images). Default is 1.
- Note that cached images may be displayed even if this parameter is 0. You can > also use Request Filters to strip unwanted contents based on URL.
在使用SplashRequest
传入相关参数即可:
SplashRequest(args={"images":0})
下边有一条Note
就自己研究吧
Splash Scripts Reference
第二个方法就是写lua脚本
splash.images_enabled
- Enable/disable images.
即:
script = """
function main(splash,args)
splash.images_enabled = false
end
"""
SplashRequest(endpoint='execute',args={"lua_source":script})
Splash 设置 UA
同样scrapy-splash并没有直接给出 但设置User-Agent
发方法很多splash:set_user_agent
、 splash:set_custom_headers
、splash:go
在lua脚本中都可以设置
以下是官方文档的例子就直接拿过来了 稍稍整理了一下
splash:set_user_agent
splash:set_user_agent
- Overwrite the User-Agent header for all further requests.
- Signature: splash:set_user_agent(value)
Parameters:
- value - string, a value of User-Agent HTTP header.
- Returns: nil.
- Async: no.
splash:set_user_agent("reki")
splash:set_custom_headers
splash:set_custom_headers
- Set custom HTTP headers to send with each request.
- Signature: splash:set_custom_headers(headers)
Parameters:
- headers - a Lua table with HTTP headers.
- Returns: nil.
- Async: no.
- Headers are merged with WebKit default headers, overwriting WebKit values in case of conflicts.
splash:set_custom_headers({
["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:61.0) Gecko/20100101 Firefox/61.0",
})
splash:go
splash:go
- Go to an URL. This is similar to entering an URL in a browser address bar, pressing Enter and waiting until page loads.
- Signature: ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil}
Parameters:
- -headers - a Lua table with HTTP headers to add/replace in the initial request.
- Returns: ok, reason pair. If ok is nil then error happened during page load; reason provides an information about error type.
- Async: yes, unless the navigation is locked.
- headers argument allows to add or replace default HTTP headers for the initial request. To set custom headers for all further requests (including requests to related resources) use splash:set_custom_headers or splash:on_request.
- User-Agent header is special: once used, it is kept for further requests. This is an implementation detail and it could change in future releases; to set User-Agent header it is recommended to use splash:set_user_agent method.
splash:go{"https://reki.me", headers={
["User-Agent"] = "Yohane",
}})
参考
版权属于:羽子
本文链接:https://reki.me/studying/crawlspider-x-splash.html
本文采用 知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议 进行许可。
允许自由转载和修改,但请务必标明文章来源且不得运用于商业目的并以相同方式分享。