sspider¶
sspider.抽象结构¶
该模块时爬虫整体的抽象架构,可直接继承该模块中的类进行扩展功能
-
class
sspider.spider.
AbstractRequestManager
[source]¶ Bases:
object
请求管理器
管理所有的请求
-
add_new_request
(request)[source]¶ 添加request对象到requestManager中进行管理
@param :: request : request对象 return None
-
-
class
sspider.spider.
AbstractSpider
(downloader=<sspider.spider.AbstractDownloader object>, parser=<sspider.spider.AbstractParser object>, requestManager=<sspider.spider.AbstractRequestManager object>, writter=<sspider.spider.AbstractWritter object>, logger=<sspider.spider.AbstractLogger object>)[source]¶ Bases:
object
爬取调度器
@member :: downloader : 下载器 @member :: parser : 解析器 @member :: requestManager : 请求管理器 @member :: writter : 文本写入 @member :: logger : 日志
-
attrs
= ['downloader', 'parser', 'requestManager', 'writter', 'logger']¶
-
crawl
(request)[source]¶ 对request进行请求进行爬取并解析结果的运行单元,子类可对该方法重写进行多线程、多进程运行或异步抓取与解析 下载器对传入的request进行下载,解析器解析下载到的文档,并将解析出的request扔进requestManager中进行管理,以进行深度爬取;将解析出的data扔进writter中,将数据存储到磁盘上
@param :: request : 请求 return None
-
sspider 通用实现¶
commons模块下是对spider模块下的接口的通用实现,可直接对commons下的模块组件进行扩展,也可以对spider下的接口直接继承实现
-
class
sspider.commons.
CommonWritter
[source]¶ Bases:
sspider.spider.AbstractWritter
数据写入类
将数据以特定格式写入到磁盘中
-
flush_buffer
(**kwargs)¶
-
headers
¶
-
items
¶
-
remove_buffer
(**kwargs)¶
-
write_buffer
(**kwargs)¶
-
-
class
sspider.commons.
HtmlDownloader
(timeout=3)[source]¶ Bases:
sspider.spider.AbstractDownloader
下载器
对传入的请求进行请求下载
@member :: timeout : 下载超时时间
对与下载器下载前后,很多时候都需要进行一些扩展中间件,在此框架中,可以运用装饰者模式对其进行扩展,也可以自己继承实现 AbstractDownloader 接口中的方法。
例如:通过装饰器实现添加代理
>>> # 我们需要一个代理池(可将代理池设置为单例类),类似与requestManager.假定我们得代理池如下: >>> class ProxyPool(object): >>> def get_new_proxy: >>> ''' >>> 请求代理,返回一个可用的、未被使用过的代理 >>> return proxy >>> ''' >>> passs >>> def proxyWrapper(): >>> ''' >>> 通过装饰器给请求动态添加代理 >>> ''' >>> def decorate(func): >>> @wraps(func) >>> def wrapper(request): >>> proxy = ProxyPool().get_new_proxy() >>> request.proxy = proxy >>> return func(*args, **kwargs) >>> return wrapper >>> return decorate
完成代理的装饰器后可直接在 download 方法中进行使用:
>>> @typeassert(request=Request) >>> @proxyWrapper >>> def download(self, request): >>> with sessions.Session() as session: >>> return Response(request, cls=session.request(**request.TransRequestParam()))
同样的,也可以将下载单元加入单线程、多进程进行下载请求资源,这样就与解析异步进行,提高爬取效率 对于cookie,请求头等的设置类似上面代理装饰器
-
class
sspider.commons.
HtmlParser
[source]¶ Bases:
sspider.spider.AbstractParser
解析器
对传入的文本进行解析
在爬取网页中,这部分时很难统一的,各个网站有不同的特色,所以此部分是一般需要用户自己独立重写的
-
class
sspider.commons.
Request
(method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=None, cert=None, json=None, level=1)[source]¶ Bases:
object
请求对象
@member :: method : 请求方法,有GET、POST、PUT、DELETE、OPTION
@member :: url : 请求url
@member :: params : 请求参数
@member :: data : 请求body数据
@member :: headers : 请求headers
@member :: cookies : cookies
@member :: files : files
@member :: auth : auth
@member :: timeout : timeout
@member :: allow_redirects : allow_redirects
@member :: proxies : 代理,为字典结构,{'http':'10.10.154.23:10002','https':'10.10.154.23:10004'}
@member :: hooks : hooks
@member :: stream : stream
@member :: verify : verify
@member :: cert : cert
@member :: json : json
@member :: level : level
通常开始爬虫时,我们需要初始化一个Request对象或者一组request对象。
初始化Request对象,必须初始化其请求方法method与请求url,其余可选 request = Request('get','http://www.baidu.com')
带有请求头的Request对象 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'} request = Request('get','http://www.baidu.com', headers=headers)
-
class
sspider.commons.
RequestManager
(limit_level=1)[source]¶ Bases:
sspider.spider.AbstractRequestManager
请求管理器
管理所有的请求
-
get_new_request
(level=None)[source]¶ 获取一个未被请求过的请求
@param :: level : 从指定层数中提取,默认为None
return : request
-
has_new_request
(level=None)[source]¶ 判断是否还有待爬取的请求,当传入level后,判断相应的层数中是否还有待爬取的请求。默认判断所有层数
@param :: level : 待添加请求集合
return : Bool
-
level
¶ 获取当前层级
return : int
-
-
class
sspider.commons.
Response
(request, cls=None, **kwargs)[source]¶ Bases:
requests.models.Response
响应对象
@member :: _content : 响应二进制数据
@member :: _content_consumed : _content_consumed
@member :: _next : _next
@member :: status_code : 响应状态码
@member :: headers : 响应头
@member :: raw : raw
@member :: url : 请求url
@member :: encoding : 响应编码
@member :: history : 响应历史
@member :: reason : reason
@member :: cookies : 响应cookies
@member :: elapsed : elapsed
@member :: request : request
@member :: level : 对应request的level
-
html
(encoding=None, **kwargs)[source]¶ 将response解析为HtmlElement对象,可通过css选择器或者xpath语法获取数据
- 如:
>>> doc = response.html() >>> # 通过xpath获取a元素里的href >>> links = doc.xpath('//a/@href') >>> # 通过xpath获取span元素中的text >>> spans = doc.xpath('//span/text()') >>> # 更多用法,请自行查询css选择器与xpath语法进行使用
常用方法: find, findall, findtext, get, getchildren, getiterator, getnext, getparent, getprevious, getroottree, index, insert, items, iter, iterancestors, iterchildren, iterdescendants, iterfind, itersiblings, itertext, keys, makeelement, remove, replace, values, xpath
>>> .drop_tree(): Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element. >>> .drop_tag(): Drops the tag, but keeps its children and text. >>> .find_class(class_name): Returns a list of all the elements with the given CSS class name. Note that class names are space separated in HTML, so doc.find_class_name('highlight') will find an element like <div class="sidebar highlight">. Class names are case sensitive. >>> .find_rel_links(rel): Returns a list of all the <a rel="{rel}"> elements. E.g., doc.find_rel_links('tag') returns all the links marked as tags. >>> .get_element_by_id(id, default=None): Return the element with the given id, or the default if none is found. If there are multiple elements with the same id (which there shouldn't be, but there often is), this returns only the first. >>> .text_content(): Returns the text content of the element, including the text content of its children, with no markup. >>> .cssselect(expr): Select elements from this element and its children, using a CSS selector expression. (Note that .xpath(expr) is also available as on all lxml elements.) >>> .label: Returns the corresponding <label> element for this element, if any exists (None if there is none). Label elements have a label.for_element attribute that points back to the element. >> .base_url: The base URL for this element, if one was saved from the parsing. This attribute is not settable. Is None when no base URL was saved. >>> .classes: Returns a set-like object that allows accessing and modifying the names in the 'class' attribute of the element. (New in lxml 3.5). >>> .set(key, value=None): Sets an HTML attribute. If no value is given, or if the value is None, it creates a boolean attribute like <form novalidate></form> or <div custom-attribute></div>. In XML, attributes must have at least the empty string as their value like <form novalidate=""></form>, but HTML boolean attributes can also be just present or absent from an element without having a value.
-
-
class
sspider.commons.
Spider
(downloader=<sspider.commons.HtmlDownloader object>, parser=<sspider.commons.HtmlParser object>, requestManager=<sspider.commons.RequestManager object>, writter=<sspider.commons.CommonWritter object>, logger=None)[source]¶