https://github.com/itielshwartz/python-station-backend
A full pipeline for downloading, cleaning and enriching the history of planetpython.org
https://github.com/itielshwartz/python-station-backend
backend beautifulsoup pipeline praw python python-station
Last synced: about 1 year ago
JSON representation
A full pipeline for downloading, cleaning and enriching the history of planetpython.org
- Host: GitHub
- URL: https://github.com/itielshwartz/python-station-backend
- Owner: itielshwartz
- Created: 2017-07-08T10:05:51.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2017-09-05T18:49:59.000Z (almost 9 years ago)
- Last Synced: 2025-01-31T22:11:49.577Z (over 1 year ago)
- Topics: backend, beautifulsoup, pipeline, praw, python, python-station
- Language: Python
- Homepage: http://python-station.etlsh.com
- Size: 4.88 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Python station backend
# About
* The backend behind : [python-station]
* Full data pipeline to scrape
* Output: Every Github (Python) project featured on the history of planetpython.
* Also includes data enrichment using Github + Reddit + Hackernews APi.
## How does it work?
1. Download the pages from planetPython.org clone
2. Use [BeautifulSoup] to transform raw page into posts
2. Use [Github API] to get basic project data (And filter no python projects)
4. Use [Praw] (Reddit) + [HN Api] + [Github Trending] to enrich data
5. Show data using [Github pages + Vue.js]
# How to run?
- Clone the project
- `python3 -m venv ./venv && source venv/bin/activate && pip install -r requirements.txt`
- `venv/bin/python pipeline.py --pages-to-download 5`
- To download Reddit data you need to fill in your reddit creds in: `requests_utils.py`
- If you get limit on your Github requests you need to fill in your Github creds in: `requests_utils.py`
# Pipeline Flow chart
```
+-------------------+
| Download Pages |
+---------+---------+
|
+---------v---------+
|Transform to Posts |
+---------+---------+
|
+---------v---------+
|Extract projects |
+---------+---------+
|
+---------v---------+
|Enrich Using Apis |
+---------+---------+
|
+---------v----------+
|Deploy Using Github |
| Pages |
+--------------------+
```
### Development
Want to contribute? Great!
Feel free to open PR/Issue :)
License
----
MIT - **Free Software, Hell Yeah!**
[//]: #URLs
[python-station]:
[nginx]:
[BeautifulSoup]:
[Github API]:
[Praw]:
[HN Api]:
[Github Trending]:
[Github pages + Vue.js]: https://github.com/itielshwartz/python-station-website