Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

抓取及配置的流程重构 #6

Closed
zidoshare opened this issue Nov 27, 2018 · 2 comments
Closed

抓取及配置的流程重构 #6

zidoshare opened this issue Nov 27, 2018 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed wontfix This will not be worked on

Comments

@zidoshare
Copy link
Member

覆盖配置机制更改 #4

整体抓取流程如下

  1. 构建Spider对象
  2. 定义抓取器
  3. 添加入口

整体流程变化不大,细节变化较大

之前有对请求的单独配置,新机制分摊配置,一部分有抓取器提供,一部分由全局Spider构建时提供

构建Spider对象

组件

  • Downloader:下载器
  • Scheduler:任务调度器
  • DuplicationProcessor:任务去重处理器
  • CountManager:计数器
  • ProxyProvider:代理提供者
  • Saver:存储器

事件监听

以下支持的事件列表

  • onSaveSuccess:持久化成功回调
  • onSaveError:持久化失败回调
  • onDownloadSuccess:下载成功回调
  • onDownloadError:下载失败回调
  • onSuccess:任务成功回调(任务没有失败状态,这一点值得讨论)
  • onPause(Task task):单个任务暂停回调
  • onRecover(Task task):单个任务恢复成功回调
  • onCancel(Task task):单个任务取消回调
  • onCancel:爬虫取消回调
  • onPause:爬虫暂停回调
  • onRecover:爬虫恢复回调

配置项

  • userAgent:用户代理,浏览器标示
  • cookie:默认的cookie
  • charset:字符编码
  • sleepTime:每次爬取后的等待时间
  • retryTimes:每次请求失败重试次数
  • outTime:默认请求超时时间
  • downloadMode:默认下载模式,分别为auto(自动),httpClient,htmlunit,根据扩展可能加入新的下载模式
  • successCode:默认请求成功标示code,使用数字匹配表达式
  • disableCookie:是否禁用cookie
  • headers:默认请求头
  • proxiable:是否使用代理

定义抓取器

提供以下方式实现抓取配置:

  • ConfigurationExtractor:使用java定义pojo
  • ResponseHandler:提供处理器接口,使用java api抓取

将会实现的抓取配置:

  • AnnotationSupport:使用注解+pojo类的形式定义一个抓取器

可能性比较小的抓取配置方式:

  • scriptEngine:脚本引擎提供其他api支持抓取,如果实现可能会在javascriptluagroovy中选择实现
  • xmlSupport:xml文档定义抓取器

这里等待建议与讨论...

添加入口

直接使用Spider对象相关的添加url入口的方法

@zidoshare
Copy link
Member Author

抓取流程重构进度:

  • 使用SpiderBuilder构造器构造Spider对象。
  • Spider类变为接口,提供of方法返回操作句柄,对单个任务进行进一步操作。例如:spider.of(extractor).addEventListener(new SingleEventListener()).execute(url).pause()这样的流畅的链式操作方式。
  • extractor抓取器提供新ExtractorBuilder构造器,用于快速构造复杂的抓取对象。extractor增加config属性,用于配置本次抓取所需配置,可覆盖大部分全局配置。config提供ConfigBuilder类用于构造配置。

zidoshare added a commit that referenced this issue Dec 4, 2018
zidoshare added a commit that referenced this issue Dec 4, 2018
zidoshare added a commit that referenced this issue Dec 6, 2018
zidoshare added a commit that referenced this issue Dec 6, 2018
✏️ fix run error.
zidoshare added a commit that referenced this issue Dec 6, 2018
zidoshare added a commit that referenced this issue Dec 7, 2018
@zidoshare zidoshare added enhancement New feature or request help wanted Extra attention is needed wontfix This will not be worked on labels Dec 8, 2018
@zidoshare zidoshare mentioned this issue Dec 18, 2018
Merged
zidoshare added a commit that referenced this issue Dec 18, 2018
@zidoshare
Copy link
Member Author

完成重构之后的api如下:

spider.of(response -> {
    response.modelName("blog");
    response.asTarget().matchUrl("zido.site/?$");
    response.asContent().url().save("source_url");
    PartitionDescriptor partition = response.asPartition(new CssSelector(".page-container>.blog"));
    partition.field().css("h2.blog-header-title").text().save("title");
    partition.field().css("p.blog-content").text().save("description");
    response.asContent().url().save("url").nullable(false);
    //获取任务操作句柄后添加一个事件监听器
})

具体更新:

  • 完全重构了selector机制
  • 使用Model作为pojo在内存/网络中传输
  • 完成了SelectableResponse语义化的Model构造api实现

@zidoshare zidoshare reopened this Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant