{"id":20783317,"url":"https://github.com/crawlab-team/webspot","last_synced_at":"2025-05-11T11:36:00.064Z","repository":{"id":180960206,"uuid":"588420914","full_name":"crawlab-team/webspot","owner":"crawlab-team","description":"An intelligent web service to automatically detect web content and extract information from it.","archived":false,"fork":false,"pushed_at":"2023-07-13T13:35:22.000Z","size":2829,"stargazers_count":84,"open_issues_count":1,"forks_count":12,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-09T00:52:50.288Z","etag":null,"topics":["crawlab","crawler","spider","web"],"latest_commit_sha":null,"homepage":"https://webspot.crawlab.net","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crawlab-team.png","metadata":{"files":{"readme":"README-zh.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-01-13T04:21:18.000Z","updated_at":"2024-08-03T20:57:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"24dbd6b7-43b7-40ed-8de3-737595e31785","html_url":"https://github.com/crawlab-team/webspot","commit_stats":null,"previous_names":["crawlab-team/webspot","tikazyq/webspot"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crawlab-team%2Fwebspot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crawlab-team%2Fwebspot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crawlab-team%2Fwebspot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crawlab-team%2Fwebspot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crawlab-team","download_url":"https://codeload.github.com/crawlab-team/webspot/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225046970,"owners_count":17412552,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawlab","crawler","spider","web"],"created_at":"2024-11-17T14:18:15.647Z","updated_at":"2024-11-17T14:18:16.274Z","avatar_url":"https://github.com/crawlab-team.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Webspot\n\nWebspot 是一个智能识别提取网页内容的服务。\n\n[演示](https://webspot.crawlab.net)\n\n[English](https://github.com/crawlab-team/webspot)\n\n## 截图\n\n### 识别结果\n\n![](./docs/screenshots/screenshot-result-list.png)\n\n### 提取字段\n\n![](./docs/screenshots/screenshot-extracted-fields.png)\n\n### 提取数据\n\n![](./docs/screenshots/screenshot-extracted-data.png)\n\n## 快速开始\n\n### Docker\n\n请保证已安装了 [Docker](https://docs.docker.com/) 和 [Docker Compose](https://docs.docker.com/compose/).\n\n```bash\n# clone git repo\ngit clone https://github.com/crawlab-team/webspot\n\n# start docker containers\ndocker-compose up -d\n```\n\n然后你可以访问网页界面 http://localhost:9999。\n\n## API 参考文档\n\n启动好 Webspot 后，你可以到 http://localhost:9999/redoc 查看 API 文档.\n\n## 架构\n\n下图反映了 Webspot 的整体工作流程，以及相关的重要元素。\n\n```mermaid\ngraph LR\n    hr[HtmlRequester]\n    gl[GraphLoader]\n    d[Detector]\n    r[Results]\n\n    hr --\"html + json\"--\u003e gl --\"graph\"--\u003e d --\"output\"--\u003e r\n```\n\n## 开发\n\n您可以参考如下指南来开始开发。\n\n### 环境要求\n\n- Python \u003e=3.8 and \u003c=3.10\n- Go 1.16 or higher\n- MongoDB 4.2 or higher\n\n### 安装依赖\n\n```bash\n# dependencies\npip install -r requirements.txt\n```\n\n### 配置环境变量\n\n数据库配置在 `.env` 文件中。你可以拷贝示例文件并修改。\n\n```bash\ncp .env.example .env\n```\n\n### 开启 Web 服务\n\n```bash\n# start development server\npython main.py web\n```\n\n### 代码结构\n\n核心代码在 `webspot` 目录下。`main.py` 是 Web 服务的入口文件。\n\n```\nwebspot\n├── cmd     # command line tools\n├── crawler # web crawler\n├── data    # data files (html, json, etc.)\n├── db      # database\n├── detect  # web content detection\n├── graph   # graph module\n├── models  # models\n├── request # request helper\n├── test    # test cases\n├── utils   # utilities\n└── web     # web server\n```\n\n## TODOs\n\nWebspot 旨在自动化网页内容的识别和提取。目前还处于早期阶段，还有很多功能需要实现。\n\n- [ ] Table detection\n- [ ] Nested list detection\n- [ ] Export to spiders\n- [ ] Advanced browser request\n\n## Disclaimer\n\n请遵循当地的法律法规使用 Webspot。作者不对因使用 Webspot 而引起的任何法律问题负责。请阅读 [免责声明](./DISCLAIMER-zh.md)\n了解详情。\n\n## 社区\n\n如果你对 Webspot 感兴趣，请加作者微信账号 \"tikazyq1\" 并备注 \"Webspot\" 进入讨论群。\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://crawlab.oss-cn-hangzhou.aliyuncs.com/gitbook/qrcode.png\" height=\"360\"\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrawlab-team%2Fwebspot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrawlab-team%2Fwebspot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrawlab-team%2Fwebspot/lists"}