Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sangaline/advanced-web-scraping-tutorial
The Zipru scraper developed in the Advanced Web Scraping Tutorial.
https://github.com/sangaline/advanced-web-scraping-tutorial
python scraper scrapy tutorial-code
Last synced: 28 days ago
JSON representation
The Zipru scraper developed in the Advanced Web Scraping Tutorial.
- Host: GitHub
- URL: https://github.com/sangaline/advanced-web-scraping-tutorial
- Owner: sangaline
- Created: 2017-03-16T11:54:59.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-03-19T14:36:31.000Z (over 7 years ago)
- Last Synced: 2024-11-07T05:02:50.092Z (about 1 month ago)
- Topics: python, scraper, scrapy, tutorial-code
- Language: Python
- Size: 9.77 KB
- Stars: 426
- Watchers: 21
- Forks: 96
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-security-collection - **359**星
- awesome-hacking-lists - sangaline/advanced-web-scraping-tutorial - The Zipru scraper developed in the Advanced Web Scraping Tutorial. (Python)
README
# Advanced Web Scraping Tutorial Project
*This repository is a companion to the article [Advanced Web Scraping: Bypassing captcha, "403 Forbidden," and more](http://sangaline.com/post/advanced-web-scraping-tutorial).
Please refer to the article for further details.*This is a [scrapy](https://scrapy.org/) web scraper for the fictional Zipru torrent site.
It is designed to bypass four distinct anti-scraping mechanisms:1. User agent filtering.
2. Obfuscated javascript redirects.
3. Captchas.
4. Header consistency checks.The scraper is not actually functional because Zipru is not a real site.
The code, however, is otherwise complete and can easily be adapted to work on other sites.