自己写python爬虫框架(三)-解析器

前言

解析器，是对下载器请求下载得到的响应数据进行解析，以获取我们需要的数据。网络中各个网站有着不同的结构，不同的前端实现等等，这些早就了解析器的复杂，针对这些复杂的情况，我们可以设计实现通用的解析器以及针对特定网站定制化的解析器。通用的解析器则是对网页中文本数据的批量抓取，显得比较粗暴，实现比较简洁，需要注意的是现在很多网页是异步加载的，在初期是不会有过多的文本数据显示到网页中，所以这儿需要考虑对 JavaScript 的渲染；针对特定网站的定制化解析器可以继承通用解析器进行进一步的解析或者对通用解析器解析获取到的数据进行数据清洗以获得需要的数据。

我们的解析器依赖于 lxml 框架，一个解析 xml 文档的 python 库。

lxml 文档

在 response 的内部，我们增加了 response 对象转化 htmlElment 对象方法，其根本上使用用 lxml 库进行解析为 HtmlElement 对象的。

response 对象支持的属性有：

text 文本类型的 response ，如：

1
2
3

>>> `response = HtmlDownloader().downloader(request)`
>>> response.text
...

content 二进制的 response 数据，如

1
2
3

>>> `response = HtmlDownloader().downloader(request)`
>>> response.content
...

json() 将 response 转化为 json 对象，解析失败抛出异常

1
2
3

>>> `response = HtmlDownloader().downloader(request)`
>>> response.json()
...

html() 将 response 转化为 HtmlElement 对象，解析失败抛出异常

1
2
3

>>> `response = HtmlDownloader().downloader(request)`
>>> response.html()
...

因此我们解析器可以依赖 response 内部的属性以及方法进行解析。

解析器实现

作为解析器，则是从传入的 response 中获取可以再次爬取的 request 以及需要的数据，所以我们的返回数据应为一个 map 结构的数据，key 为 requests 与 data，分别表示可深入爬取的 request 与解析到的文本

通用实现

那么，如何实现一个通用的解析器呢，通用的解析器则是抽象广大爬虫解析共有的特诊：获取可深入爬取的 request、获取该文档中的非标签文本，如：

class HtmlParser(AbstractParser):

    '''
    解析器

    对传入的文本进行解析

    在爬取网页中，这部分时很难统一的，各个网站有不同的特色，所以此部分是一般需要用户自己独立重写的
    '''

    @typeassert(response=Response)
    def parse(self, response):
        htm = response.html()
        requests = []
        for item in htm.iter('a'):
            request = copy.copy(response.request)
            request.method = 'get'
            request.url = item.get('href')
            request.level = response.level+1
            requests.append(request)
        datas = [text.strip() for text in htm.itertext()]
        return requests, datas

具体实现的解析器继承(实现)之前定义好的解析器接口，@typeassert 对传入的参数类型进行校验，传入参数 response 必须为 Response 类，Response 内部一系列的属性与方法支撑了我们的解析器。

定制化解析器

假设：我们需要 html 元素中 span 标签中的文本以及只抓取下一页表示的 url 作为下一次爬取的 request

可以选择以有的通用爬虫进行数据进行或者进程已有的通用爬虫重写其parse方法，此处我们选择继承通用解析器并重写其parse方法：

# 继承HtmlParser
class MyParser(HtmlParser):

    #重写parse方法
    def parse(self,response):
        # 通过response内置方法获得HtmlElement对象
        doc = response.html()

        '''
        HtmlElement对象，可通过css选择器或者xpath语法获取数据

        如：
            >>> doc = response.html()
            >>> # 通过xpath获取a元素里的href
            >>> links = doc.xpath('//a/@href')
            >>> # 通过xpath获取span元素中的text
            >>> spans = doc.xpath('//span/text()')
            >>> # 更多用法，请自行查询css选择器与xpath语法进行使用

            常用方法：
            find, findall, findtext, get, getchildren, getiterator, getnext, getparent, getprevious, getroottree, index, insert, items, iter, iterancestors, iterchildren, iterdescendants, iterfind, itersiblings, itertext, keys, makeelement, remove, replace, values, xpath

            >>> .drop_tree():
            Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.
            >>> .drop_tag():
            Drops the tag, but keeps its children and text.
            >>> .find_class(class_name):
            Returns a list of all the elements with the given CSS class name. Note that class names are space separated in HTML, so doc.find_class_name('highlight') will find an element like <div class="sidebar highlight">. Class names are case sensitive.
            >>> .find_rel_links(rel):
            Returns a list of all the <a rel="{rel}"> elements. E.g., doc.find_rel_links('tag') returns all the links marked as tags.
            >>> .get_element_by_id(id, default=None):
            Return the element with the given id, or the default if none is found. If there are multiple elements with the same id (which there shouldn't be, but there often is), this returns only the first.
            >>> .text_content():
            Returns the text content of the element, including the text content of its children, with no markup.
            >>> .cssselect(expr):
            Select elements from this element and its children, using a CSS selector expression. (Note that .xpath(expr) is also available as on all lxml elements.)
            >>> .label:
            Returns the corresponding <label> element for this element, if any exists (None if there is none). Label elements have a label.for_element attribute that points back to the element.
            >> .base_url:
            The base URL for this element, if one was saved from the parsing. This attribute is not settable. Is None when no base URL was saved.
            >>> .classes:
            Returns a set-like object that allows accessing and modifying the names in the 'class' attribute of the element. (New in lxml 3.5).
            >>> .set(key, value=None):
            Sets an HTML attribute. If no value is given, or if the value is None, it creates a boolean attribute like <form novalidate></form> or <div custom-attribute></div>. In XML, attributes must have at least the empty string as their value like <form novalidate=""></form>, but HTML boolean attributes can also be just present or absent from an element without having a value.
        '''

        # 我们选择使用xpath选取span元素中的文本
        spans = doc.xpath('//span/text()')
        然后选择表示下一页的url并构建request对象
        seed_url = 'www.***.com/reviews/'
        requests = [Request('get',seed_url+url) for url in doc.xpath('//*[@id='content']/div[4]/div[1]/a/@href)]
        #然后返回requests与data
        return requests,spans

可扩展思考

同样的，在对 response 解析前后，我们可以添加很多操作以满足我们复杂的爬虫需求，比如在解析后对解析得到的数据进行清洗。在此框架中，可以运用装饰者模式对其进行扩展，也可以自己继承实现 AbstractParser 接口中的方法。