An open API service indexing awesome lists of open source software.

https://github.com/itielshwartz/python-station-backend

A full pipeline for downloading, cleaning and enriching the history of planetpython.org
https://github.com/itielshwartz/python-station-backend

backend beautifulsoup pipeline praw python python-station

Last synced: about 1 year ago
JSON representation

A full pipeline for downloading, cleaning and enriching the history of planetpython.org

Awesome Lists containing this project

README

          

# Python station backend
# About
* The backend behind : [python-station]

* Full data pipeline to scrape

* Output: Every Github (Python) project featured on the history of planetpython.

* Also includes data enrichment using Github + Reddit + Hackernews APi.

## How does it work?
1. Download the pages from planetPython.org clone

2. Use [BeautifulSoup] to transform raw page into posts

2. Use [Github API] to get basic project data (And filter no python projects)

4. Use [Praw] (Reddit) + [HN Api] + [Github Trending] to enrich data

5. Show data using [Github pages + Vue.js]

# How to run?
- Clone the project
- `python3 -m venv ./venv && source venv/bin/activate && pip install -r requirements.txt`
- `venv/bin/python pipeline.py --pages-to-download 5`
- To download Reddit data you need to fill in your reddit creds in: `requests_utils.py`
- If you get limit on your Github requests you need to fill in your Github creds in: `requests_utils.py`

# Pipeline Flow chart
```
+-------------------+
| Download Pages |
+---------+---------+
|
+---------v---------+
|Transform to Posts |
+---------+---------+
|
+---------v---------+
|Extract projects |
+---------+---------+
|
+---------v---------+
|Enrich Using Apis |
+---------+---------+
|
+---------v----------+
|Deploy Using Github |
| Pages |
+--------------------+
```

### Development

Want to contribute? Great!
Feel free to open PR/Issue :)

License
----

MIT - **Free Software, Hell Yeah!**

[//]: #URLs

[python-station]:
[nginx]:
[BeautifulSoup]:
[Github API]:
[Praw]:
[HN Api]:
[Github Trending]:
[Github pages + Vue.js]: https://github.com/itielshwartz/python-station-website