sspider¶
sspider¶
sspider.api¶
simpleSpider,快速开始:
- copyright
2018 by pengr.
- license
GNU GENERAL PUBLIC LICENSE, see LICENSE for more details.
-
class
sspider.commons.
CommonWritter
[source]¶ Bases:
sspider.spider.AbstractWritter
数据写入类
将数据以特定格式写入到磁盘中
-
flush_buffer
(**kwargs)¶
-
headers
¶
-
items
¶
-
remove_buffer
(**kwargs)¶
-
write_buffer
(**kwargs)¶
-
-
class
sspider.commons.
HtmlDownloader
(timeout=3)[source]¶ Bases:
sspider.spider.AbstractDownloader
下载器
对传入的请求进行请求下载
@member :: timeout : 下载超时时间
对与下载器下载前后,很多时候都需要进行一些扩展中间件,在此框架中,可以运用装饰者模式对其进行扩展,也可以自己继承实现 AbstractDownloader 接口中的方法。
例如:通过装饰器实现添加代理
# 我们需要一个代理池(可将代理池设置为单例类),类似与requestManager.假定我们得代理池如下: class ProxyPool(object):
- def get_new_proxy:
''' 请求代理,返回一个可用的、未被使用过的代理
return proxy
''' passs
- def proxyWrapper():
''' 通过装饰器给请求动态添加代理 ''' def decorate(func):
return decorate
完成代理的装饰器后可直接在 download 方法中进行使用:
@typeassert(request=Request) @proxyWrapper def download(self, request):
- with sessions.Session() as session:
return Response(request, cls=session.request(**request.TransRequestParam()))
同样的,也可以将下载单元加入单线程、多进程进行下载请求资源,这样就与解析异步进行,提高爬取效率 对于cookie,请求头等的设置类似上面代理装饰器
-
class
sspider.commons.
HtmlParser
[source]¶ Bases:
sspider.spider.AbstractParser
解析器
对传入的文本进行解析
在爬取网页中,这部分时很难统一的,各个网站有不同的特色,所以此部分是一般需要用户自己独立重写的
-
class
sspider.commons.
Request
(method, url, params=None, data=None, headers=None, cookies=None, files=None, auth=None, timeout=None, allow_redirects=True, proxies=None, hooks=None, stream=None, verify=None, cert=None, json=None, level=1)[source]¶ Bases:
object
请求对象
@member :: method : 请求方法,有GET、POST、PUT、DELETE、OPTION @member :: url : 请求url @member :: params : 请求参数 @member :: data : 请求body数据 @member :: headers : 请求headers @member :: cookies : cookies @member :: files : files @member :: auth : auth @member :: timeout : timeout @member :: allow_redirects : allow_redirects @member :: proxies : 代理,为字典结构,{'http':'10.10.154.23:10002','https':'10.10.154.23:10004'} @member :: hooks : hooks @member :: stream : stream @member :: verify : verify @member :: cert : cert @member :: json : json @member :: level : level
通常开始爬虫时,我们需要初始化一个Request对象或者一组request对象。
初始化Request对象,必须初始化其请求方法method与请求url,其余可选 request = Request('get','http://www.baidu.com')
带有请求头的Request对象 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'} request = Request('get','http://www.baidu.com', headers=headers)
-
class
sspider.commons.
RequestManager
(limit_level=1)[source]¶ Bases:
sspider.spider.AbstractRequestManager
请求管理器
管理所有的请求
-
get_new_request
(level=None)[source]¶ 获取一个未被请求过的请求
@param :: level : 从指定层数中提取,默认为None
return : request
-
has_new_request
(level=None)[source]¶ 判断是否还有待爬取的请求,当传入level后,判断相应的层数中是否还有待爬取的请求。默认判断所有层数
@param :: level : 待添加请求集合
return : Bool
-
level
¶ 获取当前层级
return : int
-
-
class
sspider.commons.
Response
(request, cls=None, **kwargs)[source]¶ Bases:
requests.models.Response
响应对象
@member :: _content : 响应二进制数据 @member :: _content_consumed : _content_consumed @member :: _next : _next @member :: status_code : 响应状态码 @member :: headers : 响应头 @member :: raw : raw @member :: url : 请求url @member :: encoding : 响应编码 @member :: history : 响应历史 @member :: reason : reason @member :: cookies : 响应cookies @member :: elapsed : elapsed @member :: request : request @member :: level : 对应request的level
-
html
(encoding=None, **kwargs)[source]¶ 将response解析为HtmlElement对象,可通过css选择器或者xpath语法获取数据
- 如:
doc = response.html() 通过xpath获取a元素里的href links = doc.xpath('//a/@href') 通过xpath获取span元素中的text spans = doc.xpath('//span/text()') 更多用法,请自行查询css选择器与xpath语法进行使用
常用方法: find, findall, findtext, get, getchildren, getiterator, getnext, getparent, getprevious, getroottree, index, insert, items, iter, iterancestors, iterchildren, iterdescendants, iterfind, itersiblings, itertext, keys, makeelement, remove, replace, values, xpath >> .drop_tree(): Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element. >> .drop_tag(): Drops the tag, but keeps its children and text. >> .find_class(class_name): Returns a list of all the elements with the given CSS class name. Note that class names are space separated in HTML, so doc.find_class_name('highlight') will find an element like <div class="sidebar highlight">. Class names are case sensitive. >> .find_rel_links(rel): Returns a list of all the <a rel="{rel}"> elements. E.g., doc.find_rel_links('tag') returns all the links marked as tags. >> .get_element_by_id(id, default=None): Return the element with the given id, or the default if none is found. If there are multiple elements with the same id (which there shouldn't be, but there often is), this returns only the first. >> .text_content(): Returns the text content of the element, including the text content of its children, with no markup. >> .cssselect(expr): Select elements from this element and its children, using a CSS selector expression. (Note that .xpath(expr) is also available as on all lxml elements.) >> .label: Returns the corresponding <label> element for this element, if any exists (None if there is none). Label elements have a label.for_element attribute that points back to the element. >> .base_url: The base URL for this element, if one was saved from the parsing. This attribute is not settable. Is None when no base URL was saved. >> .classes: Returns a set-like object that allows accessing and modifying the names in the 'class' attribute of the element. (New in lxml 3.5). >> .set(key, value=None): Sets an HTML attribute. If no value is given, or if the value is None, it creates a boolean attribute like <form novalidate></form> or <div custom-attribute></div>. In XML, attributes must have at least the empty string as their value like <form novalidate=""></form>, but HTML boolean attributes can also be just present or absent from an element without having a value.
-
-
class
sspider.commons.
Spider
(downloader=<sspider.commons.HtmlDownloader object>, parser=<sspider.commons.HtmlParser object>, requestManager=<sspider.commons.RequestManager object>, writter=<sspider.commons.CommonWritter object>, logger=None)[source]¶