{"id":17091573,"url":"https://github.com/wuxudong/rxcrawler","last_synced_at":"2025-08-02T03:11:51.688Z","repository":{"id":149183855,"uuid":"76010737","full_name":"wuxudong/rxcrawler","owner":"wuxudong","description":"a java crawler base on rx-java","archived":false,"fork":false,"pushed_at":"2016-12-18T02:48:13.000Z","size":304,"stargazers_count":13,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-12T22:41:36.488Z","etag":null,"topics":["crawler","nio","rxjava"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wuxudong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-12-09T07:06:14.000Z","updated_at":"2024-08-20T04:01:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"c4682f55-f663-4054-acba-5d180f4b65f8","html_url":"https://github.com/wuxudong/rxcrawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/wuxudong/rxcrawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wuxudong%2Frxcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wuxudong%2Frxcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wuxudong%2Frxcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wuxudong%2Frxcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wuxudong","download_url":"https://codeload.github.com/wuxudong/rxcrawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wuxudong%2Frxcrawler/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268330932,"owners_count":24233152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-02T02:00:12.353Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","nio","rxjava"],"created_at":"2024-10-14T13:59:02.473Z","updated_at":"2025-08-02T03:11:51.676Z","avatar_url":"https://github.com/wuxudong.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# rxcrawler\n\n##常见的爬虫\n\n* crawler4j   \n在抓取IP代理的项目[proxy-checker](https://github.com/wuxudong/proxy-checker) 使用。\n\t* 优点:\n\t\t* 简单易用\n\t\t* 支持 resume (停止服务后，重启继续之前的任务)\n\t* 缺点:\n\t\t* 仅处理GET\n\t\t* 单机\n\n* webmagic  \n\t* 优点:\n\t\t* 结构清晰\n\t\t* 扩展性好\n\t\t* 开箱即用\n\t* 缺点:\n\n\t\t1. 对POST请求resume有缺陷(缺省保存url,没有保存post body)\n\t\t1. 当服务被终止时，可能丢失正在运行的请求(一般情况下，这不是什么问题， 但例如分类下的商品抓取，一页接一页，当服务被重启时，丢失了一个请求可能使得整个分类丢失)\n\t\t1. 基于bio，在高并发抓取下会消耗大量的线程。  \n\t\t\n\t\t前2个缺点基本可以通过扩展修改，但bio属于核心结构，无法修改。\n\n\n\n## webmagic的rx-java改造\n借鉴webmagic的结构和接口，但对核心的spider，downloader 使用 nio，rx-java 进行重写， 可以支持少量线程支持上千的并发抓取，配合squid和[proxy-checker](https://github.com/wuxudong/proxy-checker)获取的代理ip，极大提升抓取效率。\n\n##DEMO\n\n* [京东爬虫](https://github.com/wuxudong/jdcrawler)\n\n基于rxcralwer的例子，抓取京东的移动端接口。\n\n* [IT桔子爬虫](https://github.com/wuxudong/itjuzi_crawler)\n\n基于rxcralwer的例子，抓取_IT桔子_的移动端接口。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwuxudong%2Frxcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwuxudong%2Frxcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwuxudong%2Frxcrawler/lists"}