{"id":19217026,"url":"https://github.com/shellvon/vp","last_synced_at":"2025-07-01T13:05:25.006Z","repository":{"id":37062790,"uuid":"136559973","full_name":"shellvon/vp","owner":"shellvon","description":"Very simple VideoPlayer","archived":false,"fork":false,"pushed_at":"2023-01-03T15:47:36.000Z","size":3182,"stargazers_count":10,"open_issues_count":29,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-06-16T08:11:47.008Z","etag":null,"topics":["scrapy-crawler"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shellvon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-06-08T03:11:20.000Z","updated_at":"2019-05-08T03:56:40.000Z","dependencies_parsed_at":"2023-02-01T07:30:37.205Z","dependency_job_id":null,"html_url":"https://github.com/shellvon/vp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/shellvon/vp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shellvon%2Fvp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shellvon%2Fvp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shellvon%2Fvp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shellvon%2Fvp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shellvon","download_url":"https://codeload.github.com/shellvon/vp/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shellvon%2Fvp/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262969871,"owners_count":23392526,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["scrapy-crawler"],"created_at":"2024-11-09T14:19:53.670Z","updated_at":"2025-07-01T13:05:24.947Z","avatar_url":"https://github.com/shellvon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vp\n\n用于练手 Scrapy \u0026 Vue \u0026 Flask 的一个小应用\n\n可以在手机上看电影的小站点。\n\n后端采用 Scrapy 爬一些资源采集网站\n前端计划 Vue 撸一个小 WebUI\n使用 Flask 提供 RESTful API\n\n# 说明\n\n - [x] 数据采集部分\n - [x] WebUI   \u003c--- By Vuetify \u0026 Vue CLI3\n - [x] ~~RESTful~~ API \u003c---- By flask\n\n# Step 1. 搜集站点\n\n寻找可以采集视频网站的站点: [资源采集](https://www.google.com/search?q=%E8%B5%84%E6%BA%90%E9%87%87%E9%9B%86\u0026oq=%E8%B5%84%E6%BA%90%E9%87%87%E9%9B%86+\u0026aqs=chrome..69i57j69i61l3.5331j0j1\u0026sourceid=chrome\u0026ie=UTF-8)\n获取到以下站点:\n\n![资源采集](./data/google.png)\n\n- https://www.yongjiuzy.com\n- http://www.131zyw.com/\n- http://www.caijizy.com/\n- http://www.wz80.com/\n\n# Step 2. 撰写代码\n\n撰写爬虫,  `scrapy startproject vp`\n\n根据 Scrapy 的文档，可以使用 [Scrapy.Item](http://doc.scrapy.org/en/latest/topics/items.html) 来存储爬去下来的文档，对于一部简单的电影通常包括如下属性:\n\n``` python\nclass FilmItem(scrapy.Item):\n    source = scrapy.Field()  # 电影来源\n    name = scrapy.Field()  # 电影名字\n    name_alias = scrapy.Field()  # 电影别名\n    note = scrapy.Field()  # 备注\n    category = scrapy.Field()  # 电影类型\n    region = scrapy.Field()  # 地区\n    cover = scrapy.Field()  # 封面图\n    poster = scrapy.Field()  # 播放时需用第一帧图.\n    url = scrapy.Field()   # 播放地址\n    actors = scrapy.Field()  # 领衔主演\n    director = scrapy.Field()  # 导演\n    synopsis = scrapy.Field()  # 简介\n    language = scrapy.Field()  # 语言\n    year = scrapy.Field()  # 上映年份\n```\n\n由于我们要采集不同网站，所以需要多个不同的 Spider， 观察不同的站点并且抽取出公共的部分作为所有 Spider 的基类，参见 `misc.spider.CommonSpider`\n\n# Step3. 存储\n\n本业务需求不涉及 URL 的去重，应该可以将参数 dont_filter 设置为 True 同时也无需引入 scrapy-redis 这种用于分布式爬虫的插件，不同站点的电影需要考虑去重，\n本文使用 Mongo 进行数据存储，根本原因在于 mongo 存储和查询的时候都超级简单，例子可以看后面的效果一栏。\n\n将电影名设置为唯一属性，存在则更新，不存在则插入。目前已经实现的功能 [Scrapy-Mongo](https://github.com/sebdah/scrapy-mongodb)\n配置如下：\n\n```Python\n{\n    'vp.pipelines.FilmPipeline': 300,\n    'scrapy_mongodb.MongoDBPipeline': 500, # 存入Mongo 之中\n}\n\n\nMONGODB_UNIQUE_KEY = 'name' # 电影名唯一\nMONGODB_DATABASE = 'vp'\nMONGODB_COLLECTION = 'films'\nMONGODB_ADD_TIMESTAMP = True\n```\n\n考虑到存入 Mongo 之前需有清理、 验证等过程，因此还需要引入 [ItemPipe](https://doc.scrapy.org/en/latest/topics/item-pipeline.html)\n如前文提到的 ` 'vp.pipelines.FilmPipeline`, `scrapy_mongodb.MongoDBPipeline`, 其配置等数字代表处理的优先级（先后顺序）\n\n\n# Step 4. 调试\n\n参见 [Scrapy shell](https://doc.scrapy.org/en/latest/topics/shell.html)\n比如指定爬虫爬取某一页URL并且指定方法进行处理:\n`python -m scrapy parse --spider=cjzy \"http://www.caijizy.com/?m=vod-detail-id-12738.html\" -c parse_film_detail`\n\n# Step 5. 效果\n\n```shell\n\n\u003e db.films.find().length()  # 查看电影总数\n9492\n\u003e db.films.aggregate([{$unwind: '$category'}, {$group: {_id: '$category', cnt: {$sum: 1}}}, {$sort: {cnt: -1}}]) # 统计各类型电影数量\n{ \"_id\" : \"剧情\", \"cnt\" : 3298 }\n{ \"_id\" : \"喜剧\", \"cnt\" : 2387 }\n{ \"_id\" : \"动作\", \"cnt\" : 1813 }\n{ \"_id\" : \"爱情\", \"cnt\" : 1782 }\n{ \"_id\" : \"网络大电影\", \"cnt\" : 1597 }\n{ \"_id\" : \"内地\", \"cnt\" : 1490 }\n{ \"_id\" : \"伦理片\", \"cnt\" : 1388 }\n{ \"_id\" : \"国语\", \"cnt\" : 1279 }\n\u003e db.films.find({actors: {$all: [/刘德华/, /周润发/]}}, {name: 1, actors: 1, _id: 0}) # 寻找 刘德华 和 周润发 合作的电影.\n{ \"name\" : \"精装追女仔3之狼之一族 \", \"actors\" : [ \"刘德华\", \"张敏\", \"邱淑贞\", \"冯淬帆\", \"王晶\", \"黄霑\", \"周慧敏\", \"吴君如\", \"周润发\", \"郑丹瑞\" ] }\n{ \"name\" : \"江湖情 \", \"actors\" : [ \"周润发\", \"刘德华\", \"谭咏麟\", \"刘嘉玲\", \"万梓良\", \"李修贤\", \"王小凤\" ] }\n{ \"name\" : \"江湖情2英雄好汉 英雄好漢 \", \"actors\" : [ \"周润发\", \"刘德华\", \"万梓良\", \"刘嘉玲\", \"李修贤\", \"杨群\", \"柯俊雄\", \"王小凤\", \"成奎安\" ] }\n\n```\n\n**截图**\n\n![截图](./data/films.png)\n\n# 其他\n\n简单跑了一下，发现以上网站都不需要使用Proxy就能直接爬完 也是惊讶....\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshellvon%2Fvp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshellvon%2Fvp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshellvon%2Fvp/lists"}