{"id":19990609,"url":"https://github.com/ShichaoMa/structure_spider","last_synced_at":"2025-05-04T09:36:19.221Z","repository":{"id":24733832,"uuid":"100699889","full_name":"ShichaoMa/structure_spider","owner":"ShichaoMa","description":"组合多请求，抓取结构化数据，基于scrapy组件","archived":false,"fork":false,"pushed_at":"2022-12-08T06:39:13.000Z","size":1117,"stargazers_count":29,"open_issues_count":10,"forks_count":13,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-03T18:07:12.492Z","etag":null,"topics":["crawl","scrapy","spider","structure"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ShichaoMa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-08-18T10:08:10.000Z","updated_at":"2022-11-25T05:31:44.000Z","dependencies_parsed_at":"2023-01-14T01:30:51.479Z","dependency_job_id":null,"html_url":"https://github.com/ShichaoMa/structure_spider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShichaoMa%2Fstructure_spider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShichaoMa%2Fstructure_spider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShichaoMa%2Fstructure_spider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShichaoMa%2Fstructure_spider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ShichaoMa","download_url":"https://codeload.github.com/ShichaoMa/structure_spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252273411,"owners_count":21721912,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawl","scrapy","spider","structure"],"created_at":"2024-11-13T04:51:21.973Z","updated_at":"2025-05-04T09:36:18.926Z","avatar_url":"https://github.com/ShichaoMa.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# 结构化爬虫\n通过组建Item请求树抓取结构化数据\n\n![](https://github.com/ShichaoMa/structure_spider/blob/master/resources/item-collector.jpg)\n# USAGE\n### 安装structure_spider\n```\ndev@ubuntu:~$ pip install structure-spider\n```\n### 生成项目\n```\ndev@ubuntu:~$ structure-spider create project -n myapp\nNew structure-spider project 'myapp', using template directory '/home/dev/.pyenv/versions/3.6.0/lib/python3.6/site-packages/structor/templates/project', created in:\n    /home/dev/myapp\n\nYou can start the spider with:\n    cd myapp\n    custom-redis-server -ll INFO -lf\n    scrapy crawl douban\n```\n### 开始简单redis，可以使用正式版redis，只需把settings.py中的`CUSTOM_REDIS=True`注释掉即可\n```\ndev@ubuntu:~$ custom-redis-server -ll INFO -lf\n```\n### 生成自定义spider及item\n使用createspider可以生成直接可用的spider，-s指定spider名称，随后创建要抓取的字段及其规则\n，使用=连接。规则可以是正则表达式，xpath, css。\n\n如需进一步增加复杂规则或进行数据清洗，请参考wiki。\n```\ndev@ubuntu:~$ cd myapp/myapp/\ndev@ubuntu:~/myapp/myapp$ ls\nitems  settings.py  spiders\ndev@ubuntu:~/myapp/myapp$ structure-spider create spider -n zhaopin \"product_id=/(\\d+)\\\\.htm\" \"job=//h1/text()\" \"salary=//a/../../strong/text()\" 'city=//ul[@class=\"terminal-ul clearfix\"]//strong/a/text()' 'education=//span[contains(text(), \"学历\")]/following-sibling::strong/text()' \"company=h2 \u003e a\" -ip '//td[@class=\"zwmc\"]/div/a[1]/@href' -pp '//li[@class=\"pagesDown-pos\"]/a/@href'\nZhaopinSpdier and ZhaopinItem have been created.\ndev@ubuntu:~/myapp/myapp$\n```\n\n参考资料：[使用structure_spider多请求组合抓取结构化数据](https://zhuanlan.zhihu.com/p/28636195)\n### 启动爬虫\n```\ndev@ubuntu:~/myapp/myapp$ scrapy crawl zhaopin\n```\n### 投入任务\n```\ndev@ubuntu:~/myapp$ structure-spider feed -s zhaopin -u \"https://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E6%B5%8E%E5%8D%97\u0026kw=%E9%94%80%E5%94%AE\u0026sm=0\u0026p=1\" -c zhaopin --custom # --custom代表使用的是简单redis\n```\n### 查看任务状态\n```\ndev@ubuntu:~/myapp$ structure-spider check zhaopin --custom\n```\n更多资源:\n\n[[structure_spider每周一练]：一键下载百度mp3](https://zhuanlan.zhihu.com/p/29076630)\n\n[个性化爬虫一键生成，想抓哪里点哪里！](https://zhuanlan.zhihu.com/p/33561576)\n\n[scrapy进阶，组合多请求抓取Item利器ItemCollector详解！](https://zhuanlan.zhihu.com/p/33699058)\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShichaoMa%2Fstructure_spider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FShichaoMa%2Fstructure_spider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShichaoMa%2Fstructure_spider/lists"}