https://github.com/sangaline/advanced-web-scraping-tutorial

The Zipru scraper developed in the Advanced Web Scraping Tutorial.
https://github.com/sangaline/advanced-web-scraping-tutorial

python scraper scrapy tutorial-code

Last synced: 2 months ago
JSON representation

The Zipru scraper developed in the Advanced Web Scraping Tutorial.

Host: GitHub
URL: https://github.com/sangaline/advanced-web-scraping-tutorial
Owner: sangaline
Created: 2017-03-16T11:54:59.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-03-19T14:36:31.000Z (about 8 years ago)
Last Synced: 2025-03-31T05:07:23.000Z (2 months ago)
Topics: python, scraper, scrapy, tutorial-code
Language: Python
Size: 9.77 KB
Stars: 430
Watchers: 20
Forks: 95
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-security-collection - **359**星
awesome-hacking-lists - sangaline/advanced-web-scraping-tutorial - The Zipru scraper developed in the Advanced Web Scraping Tutorial. (Python)

README

# Advanced Web Scraping Tutorial Project

*This repository is a companion to the article [Advanced Web Scraping: Bypassing captcha, "403 Forbidden," and more](http://sangaline.com/post/advanced-web-scraping-tutorial).
Please refer to the article for further details.*

This is a [scrapy](https://scrapy.org/) web scraper for the fictional Zipru torrent site.
It is designed to bypass four distinct anti-scraping mechanisms:

1. User agent filtering.
2. Obfuscated javascript redirects.
3. Captchas.
4. Header consistency checks.

The scraper is not actually functional because Zipru is not a real site.
The code, however, is otherwise complete and can easily be adapted to work on other sites.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sangaline/advanced-web-scraping-tutorial

Awesome Lists containing this project

README