{"id":13464489,"url":"https://github.com/code4craft/webmagic","last_synced_at":"2025-05-12T03:41:14.302Z","repository":{"id":8193523,"uuid":"9623064","full_name":"code4craft/webmagic","owner":"code4craft","description":"A scalable web crawler framework for Java.","archived":false,"fork":false,"pushed_at":"2025-05-10T12:07:26.000Z","size":17430,"stargazers_count":11551,"open_issues_count":365,"forks_count":4169,"subscribers_count":762,"default_branch":"develop","last_synced_at":"2025-05-10T13:20:51.394Z","etag":null,"topics":["crawler","framework","java","scraping"],"latest_commit_sha":null,"homepage":"http://webmagic.io/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/code4craft.png","metadata":{"files":{"readme":"README-zh.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2013-04-23T12:57:36.000Z","updated_at":"2025-05-09T12:44:27.000Z","dependencies_parsed_at":"2023-01-13T14:40:31.443Z","dependency_job_id":"920e30cf-ab19-4a75-89f5-474a7d8cf064","html_url":"https://github.com/code4craft/webmagic","commit_stats":{"total_commits":1044,"total_committers":60,"mean_commits":17.4,"dds":0.210727969348659,"last_synced_commit":"244ade7b4c88d21bd676a5ea128a8ac2a8f53456"},"previous_names":[],"tags_count":37,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code4craft%2Fwebmagic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code4craft%2Fwebmagic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code4craft%2Fwebmagic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code4craft%2Fwebmagic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/code4craft","download_url":"https://codeload.github.com/code4craft/webmagic/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253670244,"owners_count":21945239,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","framework","java","scraping"],"created_at":"2024-07-31T14:00:44.515Z","updated_at":"2025-05-12T03:41:14.278Z","avatar_url":"https://github.com/code4craft.png","language":"Java","funding_links":[],"categories":["All","Java","Projects","Core Libraries","scraping","项目","III. Network and Integration"],"sub_categories":["Web Crawling","Java","Web爬行","7. Web Crawling and HTML parsering"],"readme":"![logo](http://webmagic.io/images/logo.jpeg)\n\n\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/us.codecraft/webmagic-parent/badge.svg?subject=Maven%20Central)](https://maven-badges.herokuapp.com/maven-central/us.codecraft/webmagic-parent/)\n[![License](https://img.shields.io/badge/License-Apache%20License%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0.html)\n[![Build Status](https://travis-ci.org/code4craft/webmagic.png?branch=master)](https://travis-ci.org/code4craft/webmagic)\n\n官方网站[http://webmagic.io/](http://webmagic.io/)\n\n\u003ewebmagic是一个开源的Java垂直爬虫框架，目标是简化爬虫的开发流程，让开发者专注于逻辑功能的开发。webmagic的核心非常简单，但是覆盖爬虫的整个流程，也是很好的学习爬虫开发的材料。\n\n\nwebmagic的主要特色：\n\n* 完全模块化的设计，强大的可扩展性。\n* 核心简单但是涵盖爬虫的全部流程，灵活而强大，也是学习爬虫入门的好材料。\n* 提供丰富的抽取页面API。\n* 无配置，但是可通过POJO+注解形式实现一个爬虫。\n* 支持多线程。\n* 支持分布式。\n* 支持爬取js动态渲染的页面。\n* 无框架依赖，可以灵活的嵌入到项目中去。\n\nwebmagic的架构和设计参考了以下两个项目，感谢以下两个项目的作者：\n\npython爬虫 **scrapy** [https://github.com/scrapy/scrapy](https://github.com/scrapy/scrapy)\n\nJava爬虫 **Spiderman** [http://git.oschina.net/l-weiwei/spiderman](http://git.oschina.net/l-weiwei/spiderman)\n\nwebmagic的github地址：[https://github.com/code4craft/webmagic](https://github.com/code4craft/webmagic)。\n\n## 快速开始\n\n### 使用maven\n\nwebmagic使用maven管理依赖，在项目中添加对应的依赖即可使用webmagic：\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eus.codecraft\u003c/groupId\u003e\n    \u003cartifactId\u003ewebmagic-core\u003c/artifactId\u003e\n    \u003cversion\u003e${webmagic.version}\u003c/version\u003e\n\u003c/dependency\u003e\n\u003cdependency\u003e\n    \u003cgroupId\u003eus.codecraft\u003c/groupId\u003e\n    \u003cartifactId\u003ewebmagic-extension\u003c/artifactId\u003e\n    \u003cversion\u003e${webmagic.version}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n        \nWebMagic 使用slf4j-log4j12作为slf4j的实现.如果你自己定制了slf4j的实现，请在项目中去掉此依赖。\n\n```xml\n\u003cexclusions\u003e\n    \u003cexclusion\u003e\n        \u003cgroupId\u003eorg.slf4j\u003c/groupId\u003e\n        \u003cartifactId\u003eslf4j-log4j12\u003c/artifactId\u003e\n    \u003c/exclusion\u003e\n\u003c/exclusions\u003e\n```\n\n#### 项目结构\n\t\nwebmagic主要包括两个包：\n\n* **webmagic-core**\n\t\n\twebmagic核心部分，只包含爬虫基本模块和基本抽取器。webmagic-core的目标是成为网页爬虫的一个教科书般的实现。\n\t\n* **webmagic-extension**\n\t\n\twebmagic的扩展模块，提供一些更方便的编写爬虫的工具。包括注解格式定义爬虫、JSON、分布式等支持。\n\t\nwebmagic还包含两个可用的扩展包，因为这两个包都依赖了比较重量级的工具，所以从主要包中抽离出来，这些包需要下载源码后自己编译：：\n\n* **webmagic-saxon**\n\n\twebmagic与Saxon结合的模块。Saxon是一个XPath、XSLT的解析工具，webmagic依赖Saxon来进行XPath2.0语法解析支持。\n\n* **webmagic-selenium**\n\n\twebmagic与Selenium结合的模块。Selenium是一个模拟浏览器进行页面渲染的工具，webmagic依赖Selenium进行动态页面的抓取。\n\t\n在项目中，你可以根据需要依赖不同的包。\n\n### 不使用maven\n\n在项目的**lib**目录下，有依赖的所有jar包，直接在IDE里import即可。\n\n### 第一个爬虫\n\n#### 定制PageProcessor\n\nPageProcessor是webmagic-core的一部分，定制一个PageProcessor即可实现自己的爬虫逻辑。以下是抓取osc博客的一段代码：\n\n```java\npublic class OschinaBlogPageProcessor implements PageProcessor {\n\n    private Site site = Site.me().setDomain(\"my.oschina.net\");\n\n    @Override\n    public void process(Page page) {\n        List\u003cString\u003e links = page.getHtml().links().regex(\"http://my\\\\.oschina\\\\.net/flashsword/blog/\\\\d+\").all();\n        page.addTargetRequests(links);\n        page.putField(\"title\", page.getHtml().xpath(\"//div[@class='BlogEntity']/div[@class='BlogTitle']/h1\").toString());\n        page.putField(\"content\", page.getHtml().$(\"div.content\").toString());\n        page.putField(\"tags\",page.getHtml().xpath(\"//div[@class='BlogTags']/a/text()\").all());\n    }\n\n    @Override\n    public Site getSite() {\n        return site;\n\n    }\n\n    public static void main(String[] args) {\n        Spider.create(new OschinaBlogPageProcessor()).addUrl(\"http://my.oschina.net/flashsword/blog\")\n             .addPipeline(new ConsolePipeline()).run();\n    }\n}\n```\n\n\n这里通过page.addTargetRequests()方法来增加要抓取的URL，并通过page.putField()来保存抽取结果。page.getHtml().xpath()则是按照某个规则对结果进行抽取，这里抽取支持链式调用。调用结束后，toString()表示转化为单个String，all()则转化为一个String列表。\n\nSpider是爬虫的入口类。Pipeline是结果输出和持久化的接口，这里ConsolePipeline表示结果输出到控制台。\n\n执行这个main方法，即可在控制台看到抓取结果。webmagic默认有3秒抓取间隔，请耐心等待。\n\n#### 使用注解\n\nwebmagic-extension包括了注解方式编写爬虫的方法，只需基于一个POJO增加注解即可完成一个爬虫。以下仍然是抓取oschina博客的一段代码，功能与OschinaBlogPageProcesser完全相同：\n\n```java\n@TargetUrl(\"http://my.oschina.net/flashsword/blog/\\\\d+\")\npublic class OschinaBlog {\n\n    @ExtractBy(\"//title\")\n    private String title;\n\n    @ExtractBy(value = \"div.BlogContent\",type = ExtractBy.Type.Css)\n    private String content;\n\n    @ExtractBy(value = \"//div[@class='BlogTags']/a/text()\", multi = true)\n    private List\u003cString\u003e tags;\n\n    public static void main(String[] args) {\n        OOSpider.create(\n        \tSite.me(),\n\t\t\tnew ConsolePageModelPipeline(), OschinaBlog.class).addUrl(\"http://my.oschina.net/flashsword/blog\").run();\n    }\n}\n```\n\n这个例子定义了一个Model类，Model类的字段'title'、'content'、'tags'均为要抽取的属性。这个类在Pipeline里是可以复用的。\n\n### 详细文档\n\n见[http://webmagic.io/docs/](http://webmagic.io/docs/)。\n\n### 示例\n\nwebmagic-samples目录里有一些定制PageProcessor以抽取不同站点的例子。\n\nwebmagic的使用可以参考：[oschina openapi 应用：博客搬家](https://git.oschina.net/yashin/MoveBlog)\n\n\n### 协议\n\nwebmagic遵循[Apache 2.0协议](http://opensource.org/licenses/Apache-2.0)\n\n### 邮件组:\n\nGmail：\n[https://groups.google.com/forum/#!forum/webmagic-java](https://groups.google.com/forum/#!forum/webmagic-java)\n\nQQ:\n[http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988](http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988)\n\n### QQ群：\n\n373225642(已满) 542327088\n\n### 相关项目：\n\n[Gather Platform](https://github.com/gsh199449/spider)\n\nGather Platform 数据抓取平台是一套基于Webmagic内核的,具有Web任务配置和任务管理界面的数据采集与搜索平台。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode4craft%2Fwebmagic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcode4craft%2Fwebmagic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode4craft%2Fwebmagic/lists"}