https://github.com/shantoroy/medium_blog_post_scrape
A data-scraping project to build dataset using Medium.com blog posts
https://github.com/shantoroy/medium_blog_post_scrape
data-analytics data-scraping dataset-generation medium-article python scraping-websites selenium webdriver
Last synced: about 2 months ago
JSON representation
A data-scraping project to build dataset using Medium.com blog posts
- Host: GitHub
- URL: https://github.com/shantoroy/medium_blog_post_scrape
- Owner: shantoroy
- Created: 2022-04-20T17:52:34.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-05-02T15:46:49.000Z (about 4 years ago)
- Last Synced: 2025-07-07T08:08:16.648Z (12 months ago)
- Topics: data-analytics, data-scraping, dataset-generation, medium-article, python, scraping-websites, selenium, webdriver
- Language: Jupyter Notebook
- Homepage:
- Size: 7.15 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# Medium Post Scrape for Blog Dataset (Under Development)
## Initial Setup
* Download chromedriver
* Install Anaconda/ Miniconda (Recommended).
* Create/run in a python >v3.6 virtual environment
* run `pip install -r requirements.txt`
## Medium Post Analysis
### Download Blog Posts
First add tags and related file names. Then do the following.
```Bash
$ cd scrapping/scripts/v.Apr.2022/
$ python main.py
```
### Integrate all posts
```Bash
$ cd dataset_building
$ python all_posts_integration.py
```
### Remove Duplicate posts
```Bash
$ cd dataset_building
$ python remove_duplicate_items_json.py
```
### Post Analysis
```Bash
$ cd analysis/scripts
$ python post_analysis.py
```
## Presentations
* Final Poster -> [Overleaf](https://www.overleaf.com/read/hybfbykbprzx)
* M3 Presentation -> [Drive](https://docs.google.com/presentation/d/1XZrFzwOyDV_hjNPUZVAXu3OJBAryUY2T/edit?usp=sharing&ouid=102574097582335023736&rtpof=true&sd=true)
* Example Data Portion -> [Drive](https://tinyurl.com/MEDAA-Dataset)
## Related Blog Posts
* [Web Scrapping: Finding Necessary Contents from a Medium Dot Com Blog Post](https://shantoroy.com/webscrapping/web-scrap-a-medium-dot-com-blog-post/)
* [Web Scrapping: Clicking the ‘Show More’ Button Multiple times in Medium.com Blog via Selenium](https://shantoroy.com/webscrapping/click-button-show-more-on-medium-dot-com-via-selenium/)
## N.B.
* Do not forget to download the `chromedriver` of the similar version as of the chrome browser
* Miniconda is recommended as it is very lightweight