抓取及配置的流程重构 #6

zidoshare · 2018-11-27T16:49:23Z

覆盖配置机制更改 #4 。

整体抓取流程如下

构建Spider对象
定义抓取器
添加入口

整体流程变化不大，细节变化较大

之前有对请求的单独配置，新机制分摊配置，一部分有抓取器提供，一部分由全局Spider构建时提供

构建Spider对象

组件

Downloader:下载器
Scheduler:任务调度器
DuplicationProcessor:任务去重处理器
CountManager:计数器
ProxyProvider:代理提供者
Saver:存储器

事件监听

以下支持的事件列表

onSaveSuccess:持久化成功回调
onSaveError:持久化失败回调
onDownloadSuccess:下载成功回调
onDownloadError:下载失败回调
onSuccess:任务成功回调（任务没有失败状态，这一点值得讨论）
onPause(Task task):单个任务暂停回调
onRecover(Task task):单个任务恢复成功回调
onCancel(Task task):单个任务取消回调
onCancel:爬虫取消回调
onPause:爬虫暂停回调
onRecover:爬虫恢复回调

配置项

userAgent:用户代理，浏览器标示
cookie:默认的cookie
charset:字符编码
sleepTime:每次爬取后的等待时间
retryTimes:每次请求失败重试次数
outTime:默认请求超时时间
downloadMode:默认下载模式，分别为auto(自动),httpClient,htmlunit，根据扩展可能加入新的下载模式
successCode:默认请求成功标示code，使用数字匹配表达式
disableCookie:是否禁用cookie
headers:默认请求头
proxiable:是否使用代理

定义抓取器

提供以下方式实现抓取配置：

ConfigurationExtractor:使用java定义pojo
ResponseHandler:提供处理器接口，使用java api抓取

将会实现的抓取配置：

AnnotationSupport:使用注解+pojo类的形式定义一个抓取器

可能性比较小的抓取配置方式：

scriptEngine:脚本引擎提供其他api支持抓取，如果实现可能会在javascript，lua，groovy中选择实现
xmlSupport:xml文档定义抓取器

这里等待建议与讨论...

添加入口

直接使用Spider对象相关的添加url入口的方法

The text was updated successfully, but these errors were encountered:

zidoshare · 2018-12-04T15:13:28Z

抓取流程重构进度:

使用SpiderBuilder构造器构造Spider对象。
Spider类变为接口，提供of方法返回操作句柄，对单个任务进行进一步操作。例如：spider.of(extractor).addEventListener(new SingleEventListener()).execute(url).pause()这样的流畅的链式操作方式。
extractor抓取器提供新ExtractorBuilder构造器，用于快速构造复杂的抓取对象。extractor增加config属性，用于配置本次抓取所需配置，可覆盖大部分全局配置。config提供ConfigBuilder类用于构造配置。

✏️　fix run error.

Dev

zidoshare · 2018-12-18T05:28:42Z

完成重构之后的api如下：

spider.of(response -> {
    response.modelName("blog");
    response.asTarget().matchUrl("zido.site/?$");
    response.asContent().url().save("source_url");
    PartitionDescriptor partition = response.asPartition(new CssSelector(".page-container>.blog"));
    partition.field().css("h2.blog-header-title").text().save("title");
    partition.field().css("p.blog-content").text().save("description");
    response.asContent().url().save("url").nullable(false);
    //获取任务操作句柄后添加一个事件监听器
})

具体更新：

完全重构了selector机制
使用Model作为pojo在内存/网络中传输
完成了SelectableResponse语义化的Model构造api实现

zidoshare added a commit that referenced this issue Dec 4, 2018

✨ operator feature support #6

7ef9fd6

zidoshare added a commit that referenced this issue Dec 4, 2018

✨ operator feature support #6

b936d33

zidoshare added a commit that referenced this issue Dec 5, 2018

🔨 ✨ SpiderBuilder and ConfigBuilder support #6

6d3c270

zidoshare added a commit that referenced this issue Dec 6, 2018

✨ ConfigBuilder finish #6

086f6e3

zidoshare added a commit that referenced this issue Dec 6, 2018

✨ ExtractorBuilder finish #6

3a37b07

✏️　fix run error.

zidoshare added a commit that referenced this issue Dec 6, 2018

✏️ api is stable #6

fd277b6

zidoshare added a commit that referenced this issue Dec 7, 2018

✏️ add RequestBuilder #6

9c96a2b

zidoshare added enhancement New feature or request help wanted Extra attention is needed wontfix This will not be worked on labels Dec 8, 2018

zidoshare closed this as completed Dec 9, 2018

zidoshare mentioned this issue Dec 18, 2018

Dev #13

Merged

zidoshare added a commit that referenced this issue Dec 18, 2018

hammer refactor response selectors #6

cb2099c

Dev

zidoshare reopened this Dec 18, 2018

zidoshare closed this as completed Jan 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

抓取及配置的流程重构 #6

抓取及配置的流程重构 #6

zidoshare commented Nov 27, 2018

zidoshare commented Dec 4, 2018

zidoshare commented Dec 18, 2018

抓取及配置的流程重构 #6

抓取及配置的流程重构 #6

Comments

zidoshare commented Nov 27, 2018

整体抓取流程如下

构建Spider对象

组件

事件监听

配置项

定义抓取器

添加入口

zidoshare commented Dec 4, 2018

zidoshare commented Dec 18, 2018