{"id":15028679,"url":"https://github.com/nghuyong/weibospider","last_synced_at":"2025-04-10T03:49:05.473Z","repository":{"id":38375119,"uuid":"108714485","full_name":"nghuyong/WeiboSpider","owner":"nghuyong","description":"持续维护的新浪微博采集工具🚀🚀🚀","archived":false,"fork":false,"pushed_at":"2024-08-02T05:57:30.000Z","size":16396,"stargazers_count":3785,"open_issues_count":23,"forks_count":838,"subscribers_count":67,"default_branch":"master","last_synced_at":"2025-04-03T02:08:45.035Z","etag":null,"topics":["python","scrapy","weibo","weibospider"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nghuyong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-29T07:37:08.000Z","updated_at":"2025-04-02T19:29:57.000Z","dependencies_parsed_at":"2023-11-10T12:48:31.258Z","dependency_job_id":"e1cc535e-6933-4a31-82f1-1aeb1765ec5c","html_url":"https://github.com/nghuyong/WeiboSpider","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nghuyong%2FWeiboSpider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nghuyong%2FWeiboSpider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nghuyong%2FWeiboSpider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nghuyong%2FWeiboSpider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nghuyong","download_url":"https://codeload.github.com/nghuyong/WeiboSpider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248154995,"owners_count":21056542,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","scrapy","weibo","weibospider"],"created_at":"2024-09-24T20:08:52.262Z","updated_at":"2025-04-10T03:49:05.455Z","avatar_url":"https://github.com/nghuyong.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"./.github/weibospider.png\" width=\"400\"/\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.codacy.com/gh/nghuyong/WeiboSpider/dashboard?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=nghuyong/WeiboSpider\u0026amp;utm_campaign=Badge_Grade\"\u003e\n    \u003cimg src=\"https://app.codacy.com/project/badge/Grade/cf88a8b1e6e44c5d993d2cbea7d44c85\"\n         alt=\"Codacy Badge\"\u003e\n  \u003c/a\u003e\n    \u003ca href=\"https://scan.coverity.com/projects/nghuyong-weibospider\"\u003e\n    \u003cimg alt=\"Coverity Scan Build Status\"\n       src=\"https://scan.coverity.com/projects/26928/badge.svg\"/\u003e\n  \u003c/a\u003e\n    \u003ca href=\"https://github.com/nghuyong/WeiboSpider/stargazers\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/stars/nghuyong/WeiboSpider.svg?colorA=orange\u0026colorB=orange\u0026logo=github\"\n         alt=\"GitHub stars\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/nghuyong/WeiboSpider/issues\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/issues/nghuyong/WeiboSpider.svg\"\n             alt=\"GitHub issues\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/nghuyong/WeiboSpider/forks\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/forks/nghuyong/WeiboSpider.svg\"\n             alt=\"GitHub forks\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/nghuyong/WeiboSpider/\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/last-commit/nghuyong/WeiboSpider.svg\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/nghuyong/WeiboSpider/blob/master/LICENSE\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/license/nghuyong/WeiboSpider.svg\"\n             alt=\"GitHub license\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003ch4 align=\"center\"\u003e\n    \u003cp\u003e持续维护的新浪微博采集工具🚀🚀🚀\u003c/p\u003e\n\u003c/h4\u003e\n\n\n## 项目特色\n\n- 基于weibo.com的新版API构建，拥有最丰富的字段信息\n- 多种采集模式，包含微博用户,推文,粉丝,关注,转发,评论,关键词搜索\n- 核心代码仅100行，代码可读性高，可快速按需进行定制化改造\n\n## 快速开始\n\n### 拉取\u0026\u0026安装\n\n```bash\ngit clone https://github.com/nghuyong/WeiboSpider.git --depth 1 \ncd WeiboSpider\npip install -r requirements.txt\n```\n\n### 替换Cookie\n\n访问[https://weibo.com/](https://weibo.com/)， 登陆账号，打开浏览器的开发者模式，再次刷新\n\n![](.github/cookie.png)\n\n复制`weibo.com`数据包，network中的cookie值。编辑`weibospider/cookie.txt`并替换成刚刚复制的Cookie\n\n### 添加代理IP(可选)\n\n重写[fetch_proxy](./weibospider/middlewares.py#6L)\n方法，该方法需要返回一个代理ip，具体参考[这里](https://github.com/nghuyong/WeiboSpider/issues/124#issuecomment-654335439)\n\n## 运行程序\n\n根据自己实际需要重写`./weibospider/spiders/*`中的`start_requests`函数\n\n采集的数据存在`output`文件中，命名为`{spider.name}_{datetime}.jsonl`\n\n### 用户信息采集\n\n```bash\ncd weibospider\npython run_spider.py user\n```\n\n```json\n{\n  \"crawl_time\": 1666863485,\n  \"_id\": \"1749127163\",\n  \"avatar_hd\": \"https://tvax4.sinaimg.cn/crop.0.0.1080.1080.1024/001Un9Srly8h3fpj11yjyj60u00u0q7f02.jpg?KID=imgbed,tva\u0026Expires=1666874283\u0026ssig=a%2FMfgFzvRo\",\n  \"nick_name\": \"雷军\",\n  \"verified\": true,\n  \"description\": \"小米董事长，金山软件董事长。业余爱好是天使投资。\",\n  \"followers_count\": 22756103,\n  \"friends_count\": 1373,\n  \"statuses_count\": 14923,\n  \"gender\": \"m\",\n  \"location\": \"北京 海淀区\",\n  \"mbrank\": 7,\n  \"mbtype\": 12,\n  \"verified_type\": 0,\n  \"verified_reason\": \"小米创办人，董事长兼CEO；金山软件董事长；天使投资人。\",\n  \"birthday\": \"\",\n  \"created_at\": \"2010-05-31 23:07:59\",\n  \"desc_text\": \"小米创办人，董事长兼CEO；金山软件董事长；天使投资人。\",\n  \"ip_location\": \"IP属地：北京\",\n  \"sunshine_credit\": \"信用极好\",\n  \"label_desc\": [\n    \"V指数 财经 75.30分\",\n    \"热门财经博主 数据飙升\",\n    \"昨日发博3，阅读数100万+，互动数1.9万\",\n    \"视频累计播放量9819.3万\",\n    \"群友 3132\"\n  ],\n  \"company\": \"金山软件\",\n  \"education\": {\n    \"school\": \"武汉大学\"\n  }\n}\n```\n\n### 用户粉丝列表采集\n\n```bash\npython run_spider.py fan\n```\n\n```json\n{\n  \"crawl_time\": 1666863563,\n  \"_id\": \"1087770692_5968044974\",\n  \"follower_id\": \"1087770692\",\n  \"fan_info\": {\n    \"_id\": \"5968044974\",\n    \"avatar_hd\": \"https://tvax1.sinaimg.cn/default/images/default_avatar_male_180.gif?KID=imgbed,tva\u0026Expires=1666874363\u0026ssig=UuzaeK437R\",\n    \"nick_name\": \"用户5968044974\",\n    \"verified\": false,\n    \"description\": \"\",\n    \"followers_count\": 0,\n    \"friends_count\": 195,\n    \"statuses_count\": 9,\n    \"gender\": \"m\",\n    \"location\": \"其他\",\n    \"mbrank\": 0,\n    \"mbtype\": 0,\n    \"credit_score\": 80,\n    \"created_at\": \"2016-06-25 22:30:13\"\n  }\n}\n...\n```\n\n### 用户关注列表采集\n\n```bash\npython run_spider.py follow\n```\n\n```json\n{\n  \"crawl_time\": 1666863679,\n  \"_id\": \"1087770692_7083568088\",\n  \"fan_id\": \"1087770692\",\n  \"follower_info\": {\n    \"_id\": \"7083568088\",\n    \"avatar_hd\": \"https://tvax4.sinaimg.cn/crop.0.0.1080.1080.1024/007JnVEcly8gyqd9jadjlj30u00u0gpn.jpg?KID=imgbed,tva\u0026Expires=1666874479\u0026ssig=9zhfeMPLzr\",\n    \"nick_name\": \"蒋昀霖\",\n    \"verified\": true,\n    \"description\": \"工作请联系：lijialun@kpictures.cn\",\n    \"followers_count\": 329216,\n    \"friends_count\": 58,\n    \"statuses_count\": 342,\n    \"gender\": \"m\",\n    \"location\": \"北京\",\n    \"mbrank\": 6,\n    \"mbtype\": 12,\n    \"credit_score\": 80,\n    \"created_at\": \"2019-04-17 16:25:43\",\n    \"verified_type\": 0,\n    \"verified_reason\": \"东申未来 演员\"\n  }\n}\n...\n```\n\n\n### 微博评论采集\n\n```bash\npython run_spider.py comment\n```\n\n```json\n{\n  \"crawl_time\": 1666863805,\n  \"_id\": 4826279188108038,\n  \"created_at\": \"2022-10-19 13:41:29\",\n  \"like_counts\": 1,\n  \"ip_location\": \"来自河南\",\n  \"content\": \"五周年快乐呀，请坤哥哥继续保持这份热爱，奔赴下一场山海\",\n  \"comment_user\": {\n    \"_id\": \"2380967841\",\n    \"avatar_hd\": \"https://tvax4.sinaimg.cn/crop.0.0.888.888.1024/002B8iv7ly8gv647ipgxvj60oo0oojtk02.jpg?KID=imgbed,tva\u0026Expires=1666874604\u0026ssig=%2FdGaaIRkhf\",\n    \"nick_name\": \"流年执念的二瓜娇\",\n    \"verified\": false,\n    \"description\": \"蓝桉已遇释怀鸟，不爱万物唯爱你。\",\n    \"followers_count\": 238,\n    \"friends_count\": 1655,\n    \"statuses_count\": 12546,\n    \"gender\": \"f\",\n    \"location\": \"河南\",\n    \"mbrank\": 6,\n    \"mbtype\": 11\n  }\n}\n...\n```\n\n### 微博转发采集\n\n```bash\npython run_spider.py repost\n```\n\n```json\n{\n  \"_id\": \"4826312651310475\",\n  \"mblogid\": \"Mb2vL5uUH\",\n  \"created_at\": \"2022-10-19 15:54:27\",\n  \"geo\": null,\n  \"ip_location\": \"发布于 德国\",\n  \"reposts_count\": 0,\n  \"comments_count\": 0,\n  \"attitudes_count\": 0,\n  \"source\": \"iPhone客户端\",\n  \"content\": \"共享[鼓掌][太开心][鼓掌]五周年快乐！//@陈坤:#山下学堂五周年# 五年， 感谢同行。\",\n  \"pic_urls\": [],\n  \"pic_num\": 0,\n  \"user\": {\n    \"_id\": \"2717869081\",\n    \"avatar_hd\": \"https://tvax1.sinaimg.cn/crop.0.0.160.160.1024/a1ff6419ly8gz1xoq9oolj204g04g745.jpg?KID=imgbed,tva\u0026Expires=1666876939\u0026ssig=Cl93CLjdB%2F\",\n    \"nick_name\": \"YuFeeC\",\n    \"verified\": false,\n    \"mbrank\": 0,\n    \"mbtype\": 0\n  },\n  \"url\": \"https://weibo.com/2717869081/Mb2vL5uUH\",\n  \"crawl_time\": 1666866139\n}\n...\n```\n\n### 基于微博ID的微博采集\n\n```bash\npython run_spider.py tweet_by_tweet_id\n```\n\n```json\n{\n    \"_id\": \"4762810834227120\",\n    \"mblogid\": \"LqlZNhJFm\",\n    \"created_at\": \"2022-04-27 10:20:54\",\n    \"geo\": null,\n    \"ip_location\": null,\n    \"reposts_count\": 1890,\n    \"comments_count\": 1924,\n    \"attitudes_count\": 12167,\n    \"source\": \"三星Galaxy S22 Ultra\",\n    \"content\": \"生于乱世纵横四海，义之所在不计生死，孤勇者陈恭一生当如是。#风起陇西今日开播# #风起陇西#  今晚，恭候你！\",\n    \"pic_urls\": [],\n    \"pic_num\": 0,\n    \"isLongText\": false,\n    \"user\": {\n        \"_id\": \"1087770692\",\n        \"avatar_hd\": \"https://tvax1.sinaimg.cn/crop.0.0.1080.1080.1024/40d61044ly8gbhxwgy419j20u00u0goc.jpg?KID=imgbed,tva\u0026Expires=1682768013\u0026ssig=r1QurGoc2L\",\n        \"nick_name\": \"陈坤\",\n        \"verified\": true,\n        \"mbrank\": 7,\n        \"mbtype\": 12,\n        \"verified_type\": 0\n    },\n    \"video\": \"http://f.video.weibocdn.com/o0/CmQEWK1ylx07VAm0nrxe01041200YDIc0E010.mp4?label=mp4_720p\u0026template=1280x720.25.0\u0026ori=0\u0026ps=1CwnkDw1GXwCQx\u0026Expires=1682760813\u0026ssig=26udcPSXFJ\u0026KID=unistore,video\",\n    \"url\": \"https://weibo.com/1087770692/LqlZNhJFm\",\n    \"crawl_time\": 1682757213\n}\n...\n```\n\n### 基于用户ID的微博采集\n\n```bash\npython run_spider.py tweet_by_user_id\n```\n\n```json\n{\n  \"crawl_time\": 1666864583,\n  \"_id\": \"4762810834227120\",\n  \"mblogid\": \"LqlZNhJFm\",\n  \"created_at\": \"2022-04-27 10:20:54\",\n  \"geo\": null,\n  \"ip_location\": null,\n  \"reposts_count\": 1907,\n  \"comments_count\": 1924,\n  \"attitudes_count\": 12169,\n  \"source\": \"三星Galaxy S22 Ultra\",\n  \"content\": \"生于乱世纵横四海，义之所在不计生死，孤勇者陈恭一生当如是。#风起陇西今日开播# #风起陇西#  今晚，恭候你！\",\n  \"pic_urls\": [],\n  \"pic_num\": 0,\n  \"video\": \"http://f.video.weibocdn.com/o0/CmQEWK1ylx07VAm0nrxe01041200YDIc0E010.mp4?label=mp4_720p\u0026template=1280x720.25.0\u0026ori=0\u0026ps=1CwnkDw1GXwCQx\u0026Expires=1666868183\u0026ssig=RlIeOt286i\u0026KID=unistore,video\",\n  \"url\": \"https://weibo.com/1087770692/LqlZNhJFm\"\n}\n...\n```\n\n\n### 基于关键词的微博采集\n\n```bash\npython run_spider.py tweet_by_keyword\n```\n\n```json\n{\n  \"crawl_time\": 1666869049,\n  \"keyword\": \"丽江\",\n  \"_id\": \"4829255386537989\",\n  \"mblogid\": \"Mch46rqPr\",\n  \"created_at\": \"2022-10-27 18:47:50\",\n  \"geo\": {\n    \"type\": \"Point\",\n    \"coordinates\": [\n      26.962427,\n      100.248299\n    ],\n    \"detail\": {\n      \"poiid\": \"B2094251D06FAAF44299\",\n      \"title\": \"山野文创旅拍圣地\",\n      \"type\": \"checkin\",\n      \"spot_type\": \"0\"\n    }\n  },\n  \"ip_location\": \"发布于 云南\",\n  \"reposts_count\": 0,\n  \"comments_count\": 0,\n  \"attitudes_count\": 1,\n  \"source\": \"iPhone1314iPhone客户端\",\n  \"content\": \"丽江小漾日出\\n推出户外移动餐桌\\n接受私人定制\\n让美食融入美景心情自然美丽了！\\n#小众宝藏旅行地##超出片的艺术街区#  \",\n  \"pic_urls\": [\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k1a56c4oj234022onph\",\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19eb2kxj22ts1vvb2a\",\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k1a0wzglj22ua1w7hdw\",\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19wsafnj231x21a7wj\",\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19jd1xkj22oh1sbkjo\",\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19mma74j22ru1ukx6q\",\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19tf1bfj234022oe85\",\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19pk37pj234022okjm\",\n    \"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19g6nzfj20wi0lo7my\"\n  ],\n  \"pic_num\": 9,\n  \"user\": {\n    \"_id\": \"1259570181\",\n    \"avatar_hd\": \"https://tvax1.sinaimg.cn/crop.0.0.1080.1080.1024/4b138405ly8gzfkfikyqvj20u00u0ag1.jpg?KID=imgbed,tva\u0026Expires=1666879848\u0026ssig=6PUDG5RonQ\",\n    \"nick_name\": \"飞鸟与鱼\",\n    \"verified\": true,\n    \"mbrank\": 7,\n    \"mbtype\": 12,\n    \"verified_type\": 0\n  },\n  \"url\": \"https://weibo.com/1259570181/Mch46rqPr\"\n}\n...\n```\n\n## 更新日志\n\n- 2024.02: 支持采集自己推文的阅读量 [#313](https://github.com/nghuyong/WeiboSpider/issues/313)\n- 2024.02: 支持采集视频的播放量 [#315](https://github.com/nghuyong/WeiboSpider/issues/315)\n- 2024.01: 支持转发推文溯源到原推文 [#314](https://github.com/nghuyong/WeiboSpider/issues/314)\n- 2023.12: 支持采集推文的二级评论 [#302](https://github.com/nghuyong/WeiboSpider/issues/302)\n- 2023.12: 支持采集指定时间段的用户推文 [#308](https://github.com/nghuyong/WeiboSpider/issues/308)\n- 2023.04: 支持针对推文id的推文采集 [#272](https://github.com/nghuyong/WeiboSpider/issues/272)\n- 2022.11: 支持针对单个关键词获取单天超过1200页的检索结果 [#257](https://github.com/nghuyong/WeiboSpider/issues/257)\n- 2022.11: 支持长微博全文的获取\n- 2022.11: 基于关键词微博搜索支持指定时间范围\n- 2022.10: 添加IP归属地信息的采集，包括用户数据，微博数据和微博评论数据\n- 2022.10: 基于weibo.com站点对项目进行重构\n\n## 引用\n```\n@inproceedings{hu-etal-2020-weibo,\n    title = \"{W}eibo-{COV}: A Large-Scale {COVID}-19 Social Media Dataset from {W}eibo\",\n    author = \"Hu, Yong  and\n      Huang, Heyan  and\n      Chen, Anfan  and\n      Mao, Xian-Ling\",\n    booktitle = \"Proceedings of the 1st Workshop on {NLP} for {COVID}-19 (Part 2) at {EMNLP} 2020\",\n    month = dec,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.nlpcovid19-2.34\",\n    doi = \"10.18653/v1/2020.nlpcovid19-2.34\",\n}\n```\n\n## 其他工作\n\n- 已构建超大规模数据集WeiboCOV，可免费申请，包含2千万微博活跃用户以及6千万推文数据，参见[这里](https://github.com/nghuyong/weibo-public-opinion-datasets)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnghuyong%2Fweibospider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnghuyong%2Fweibospider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnghuyong%2Fweibospider/lists"}