Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Cynthrial/butian_urls
补天公益厂商域名列表
https://github.com/Cynthrial/butian_urls
Last synced: about 2 months ago
JSON representation
补天公益厂商域名列表
- Host: GitHub
- URL: https://github.com/Cynthrial/butian_urls
- Owner: Cynthrial
- Created: 2020-06-22T05:44:57.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T10:47:13.000Z (about 2 years ago)
- Last Synced: 2024-08-05T17:35:45.755Z (5 months ago)
- Language: Python
- Size: 8.27 MB
- Stars: 8
- Watchers: 1
- Forks: 5
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-hacking-lists - Cynthrial/butian_urls - 补天公益厂商域名列表 (Python)
README
### butian_urls
20200619爬取的补天公益src厂商列表(厂商名、域名或者ip)
过程中遇到的主要问题就是发现补天好像对爬虫出新策略了?访问频率过快的话,server端会回复一段混淆处理过的JS代码让client端执行并返回执行结果。
原理大概就是client如果是浏览器的话自然就解析了JS并发送验证信息,但一般代码处理server回包无法自动解释执行JS,这样就区分了浏览器和爬虫代码。网上能找到相应的解决办法:https://blog.csdn.net/qq_36783371/article/details/90760914
当然,,,,也能time.sleep(xxx)..........
排除超时和异常的项,结果集总共爬到4919项,如下:
![数据样例](./数据样例.png)