Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/orf/cyborg
Python web scraping framework
https://github.com/orf/cyborg
Last synced: about 2 months ago
JSON representation
Python web scraping framework
- Host: GitHub
- URL: https://github.com/orf/cyborg
- Owner: orf
- Created: 2015-06-01T13:36:06.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2017-11-12T01:02:01.000Z (almost 7 years ago)
- Last Synced: 2024-07-24T14:01:17.333Z (2 months ago)
- Language: Python
- Size: 6.84 KB
- Stars: 315
- Watchers: 40
- Forks: 66
- Open Issues: 2
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Cyborg
[![](https://travis-ci.org/orf/cyborg.svg)](https://travis-ci.org/orf/cyborg)
Cyborg is an asyncio Python 3 web scraping framework that helps you write programs to extract information
from websites by reading and inspecting their HTML.## What?
Scraping websites for data can be fairly complex when you are dealing with data across multiple pages, request limits
and error handling. Cyborg aims to handle all of this for you transparently, so that you can focus on the actual
extraction of data rather than all the stuff around it. It does this by helping you break the process down into
smaller chunks, which can be combined into a Pipeline, for example below is a Pipeline that scrapes takeaway
reviews from Just-Eat (the complete example can be found in examples/just-eat):with open("output.json", "w") as output_fd:
pipeline = Job("ReviewScraper") | scrape_places | unique("id") | scrape_reviews.parallel(5)
pipeline < string.ascii_lowercase
pipeline > output_fdpipeline.monitor() > sys.stdout
pipeline.run_until_complete()
The pipeline has several stages:
1. `scrape_places`
- This scrapes the list of takeaways from a particular area. The area is found by the first letter of the postcode, so we brute-force this by inputting a-z (`pipeline < string.ascii_lowercase`)2. `unique('id')`
- Takeaways may serve more than one area, this filters out any duplicate takeaways based on their ID3. `scrape_reviews.parallel(5)`
- This starts 5 parallel tasks to scrape the reviews from a particular takeaway.