{"id":13617518,"url":"https://github.com/Ziazan/douyin_web","last_synced_at":"2025-04-14T06:34:09.536Z","repository":{"id":94783684,"uuid":"261616900","full_name":"Ziazan/douyin_web","owner":"Ziazan","description":"抖音用户分享页数据爬虫","archived":false,"fork":false,"pushed_at":"2020-05-24T16:08:52.000Z","size":2636,"stargazers_count":35,"open_issues_count":2,"forks_count":11,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-01T20:47:46.529Z","etag":null,"topics":["python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ziazan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-05-06T00:52:37.000Z","updated_at":"2024-03-16T10:05:19.000Z","dependencies_parsed_at":"2023-03-21T19:32:34.445Z","dependency_job_id":null,"html_url":"https://github.com/Ziazan/douyin_web","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ziazan%2Fdouyin_web","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ziazan%2Fdouyin_web/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ziazan%2Fdouyin_web/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ziazan%2Fdouyin_web/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ziazan","download_url":"https://codeload.github.com/Ziazan/douyin_web/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223621825,"owners_count":17174765,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python"],"created_at":"2024-08-01T20:01:43.041Z","updated_at":"2024-11-08T02:30:33.824Z","avatar_url":"https://github.com/Ziazan.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003c!--\n * @Author: your name\n * @Date: 2020-05-03 18:04:10\n * @LastEditTime: 2020-05-25 00:07:38\n * @LastEditors: Please set LastEditors\n * @Description: In User Settings Edit\n * @FilePath: /python/douyin_web/README.md\n --\u003e\n记录：（signature的规则更新了，无法获得视频列表数据）\n1. 使用selenium 生成的sign 和真实的sign有区别\n2. 怀疑是加密的js里面判断了webdriver,所以改用 firfox的webdriver  ----失败\n3. 既然是需要js生成signature 就在python中引入PyExecJS  ----失败 js中一些变量获取不到\n4. 使用pyppeteer  ------ 失败 依然被识别\n\n# 通过抖音的分享页抓取视频信息\n\n## 目标\n通过抖音用户主页的分享链接例如：[https://v.douyin.com/KhkbCq/](https://v.douyin.com/KhkbCq/)\n获取用户的基本信息，如：粉丝数/视频数/视频评论量/视频发布时间/视频点赞数\n\n## 思路记录\n 访问分享链接之后，获取用户的基本信息。抖音id/昵称/点赞数/关注数/粉丝数\n 分析视频列表接口的规则，生成视频列表访问的url,此url中的signature 需要生成一个html文件，`selenium`打开html文件能获取。（见 [signature分析.md](https://github.com/Ziazan/douyin_web/blob/master/doc/signature%E5%88%86%E6%9E%90.md)文件）\n\n 在视频列表接口返回的json中 可以拿到 视频的基本信息和视频播放地址\n 再用`requests`访问视频下载地址,下载视频到本地。\n 此项目使用mongodb 存储数据\n\n## 项目文件说明：\n### 方式一（未完成）\n1. 读需要爬取的抖音用户的分享页链接写在`share_task.txt`中\n2. 直接运行 `run.py` 文件\n\n### 方式二\n1. 读需要爬取的抖音用户的分享页链接写在`share_task.txt`中\n2. 运行 `handle_share.py` 获取`share_task.txt`配置的抖音用户的基本信息 点赞数/关注数/粉丝数\n3. 运行 `video_list_url.py`获取用户的视频列表信息： 点赞数/关注数/转发数/评论数\n4. 运行`video_download.py`下载指定用户的所有无水印视频保存到`video`文件夹\n\n\n## 运行截图\n![https://github.com/Ziazan/douyin_web/blob/master/doc/img/user_info.png](https://github.com/Ziazan/douyin_web/blob/master/doc/img/user_info.png)\n\n![https://github.com/Ziazan/douyin_web/blob/master/doc/img/download_video.png](https://github.com/Ziazan/douyin_web/blob/master/doc/img/download_video.png)\n\n![https://github.com/Ziazan/douyin_web/blob/master/doc/img/video_lsit.png](https://github.com/Ziazan/douyin_web/blob/master/doc/img/video_lsit.png)\n\n\n## 视频链接获取的 signature 分析\n见 [signature分析.md](https://github.com/Ziazan/douyin_web/blob/master/doc/signature%E5%88%86%E6%9E%90.md)文件\n\n## 遇到的报错\nQ:Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home\n\nA:\n使用 ChromeDriverManager 去下载对应chrome版本的 ChromeDriver\n```\npip install webdriver-manager\n```\n```python\nfrom selenium import webdriver\nfrom webdriver_manager.chrome import ChromeDriverManager\n\ndriver = webdriver.Chrome(ChromeDriverManager().install())\n\n```\n\n参考：[https://stackoverflow.com/questions/29858752/error-message-chromedriver-executable-needs-to-be-available-in-the-path](https://stackoverflow.com/questions/29858752/error-message-chromedriver-executable-needs-to-be-available-in-the-path)\n\nQ: `signature.html` 总是不能正确拿到视频列表的接口url \n![https://github.com/Ziazan/douyin_web/blob/master/doc/img/error1.png](https://github.com/Ziazan/douyin_web/blob/master/doc/img/error1.png)\n\nA：使用selenium 和正常打开的浏览器生成的sigenature不一样。 有可能是 在js代码中判别了浏览器的原因。\n[如何突破网站对selenium的屏蔽](https://blog.csdn.net/clf63082/article/details/100223126?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2\u0026depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromBaidu-2)\n\n方法一：想到js中删除相关关键词，但是js代码都混淆加密过了。**不可行**\n\n方法二：把selenium浏览器伪装成真实浏览器,还是和真是signature有出入**不可行**\n\n[如何正确移除Selenium中window.navigator.webdriver的值](https://cloud.tencent.com/developer/article/1397806)\n\n```python\nfrom selenium.webdriver import Chrome\nfrom selenium.webdriver import ChromeOptions\n\noption = ChromeOptions()\noption.add_experimental_option('excludeSwitches', ['enable-automation'])\ndriver = Chrome(options=option)\n```\n使用这个方法，生成的signature\nselenium 中生成的结果：\nQ94FRRAeHXEwKA8qaryWr0PeAV\n\n正常浏览器的结果：\nQ94FRRAeHXEwKA8qaryWr0PeBV\n\n目前只剩倒数第二位的数值是相差1的结果。\n见`video_list_url.py` 中`get_video_list_url()`方法\n\nQ:urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=62785): Max retries exceeded with url: /session/8a9ff6e4be66e9833b0a16750c5fe67e/url (Caused by NewConnectionError('\u003curllib3.connection.HTTPConnection object at 0x110acd7c0\u003e: Failed to establish a new connection: [Errno 61] Connection refused'))\nA:....好像ip被封了。要找代理了\n\nQ:request 配置代理的时候报错：urllib3.exceptions.ProxySchemeUnknown: Not supported proxy scheme None\nA：\n```python\n proxies = { \n        \"http\":'http://' + ip_list.get_http_ip(),\n        \"https\": 'https://' +ip_list.get_https_ip()}\n```\n格式需要是 http:// + ip + :端口\n\nQ：Message: 'chromedriver' executable needs to be in PATH\nA:\n[windows.解决方法](https://blog.csdn.net/su_2018/article/details/100127223)\n[mac 解决方法](https://blog.csdn.net/tymatlab/article/details/78649727)\n\n## 参考\n1. [xpath helper 插件](https://blog.csdn.net/love666666shen/article/details/72613143)\n2. [在线字体编辑器](https://kekee000.github.io/fonteditor/)\n3. [Python爬虫如何获取重定向后的url](https://blog.csdn.net/lclfeng/article/details/88647616)\n4. [2020抖音无水印视频解析真实地址](https://blog.csdn.net/qq_36737934/article/details/104127835)\n5. [Python selenium 模拟Chrome浏览器打开手机模式](https://www.cnblogs.com/yiwenrong/p/12664414.html)\n6. [（最新版）如何正确移除Selenium中的 window.navigator.webdriver](https://cloud.tencent.com/developer/article/1598082)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZiazan%2Fdouyin_web","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FZiazan%2Fdouyin_web","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FZiazan%2Fdouyin_web/lists"}