https://github.com/s1m0n38/medium-scraper

Simple scraper for posts on medium.com
https://github.com/s1m0n38/medium-scraper

dataset datasets kaggle medium nlp scraper scrapy

Last synced: 4 months ago
JSON representation

Simple scraper for posts on medium.com

Host: GitHub
URL: https://github.com/s1m0n38/medium-scraper
Owner: S1M0N38
License: mit
Created: 2020-02-12T20:37:55.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-11-04T19:44:28.000Z (over 3 years ago)
Last Synced: 2025-07-26T21:01:30.383Z (11 months ago)
Topics: dataset, datasets, kaggle, medium, nlp, scraper, scrapy
Language: Python
Homepage:
Size: 2.37 MB
Stars: 17
Watchers: 1
Forks: 5
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Medium Scraper

medium-scraper (*MS*) is a scraper build using [Scrapy](https://scrapy.org/)

framework for scrape [Medium](https://medium.com/) posts.

## :wrench: How it works

*MS* consist of two scrapy [spider](https://docs.scrapy.org/en/latest/topics/spiders.html):

**post_id** and **post**.

First *MS* the look for the *post_id* (a unique identifier of a post) inside the

Sitemap of medium website and store the found *post_id* in a

[SQLite](https://sqlite.org/index.html).

Then the second spider (**post**) takes the *post_id* from database and permform

a request for obstain the specific data for every post. Data about a post can

be divided in two groups: *posts*, and *paragraphs*. These are store in

different tables inside database.

## :books: Database structure

All scraped data are store in a SQLite file (.db).

To Create a new databese file create a duplicate of `example.db` and

then rename it to `medium.db` (medium.db is the default name for the database).

If you want use a grafical interface for interact with databese I suggest

[DB Browser for SQLite](https://sqlitebrowser.org/).

As I said before database consists of two tables: *post* and *paragraph*.

### post

| post_id        | available | creator_id     | language | first_published_at | title      | word_count | claps  | tags      |

| :------------: |:---------:| :-------------:| :------: | :----------------: | :--------: | :--------: | :----: | :-------: |

| `316d066db3d6` | `1`       | `245c7224d0ce` | `en`     | `1577865630099`    | `Intro...` | `4231`     | `341`  | `cow,dog` |

| `5edbf9af44af` | `0`       | `NULL`         | `NULL`   | `NULL`             | `NULL`     | `NULL`     | `NULL` | `NULL`    |

| `fec8331faa9d` | `NULL`    | `NULL`         | `NULL`   | `NULL`             | `NULL`     | `NULL`     | `NULL` | `NULL`    |

| `...`          | `...`     | `...`          | `...`    | `...`              | `...`      | `...`      | `...`  | `...`     |

- **post_id**

  - a unique identifier for the post

- **available**

  - `NULL` the post spider never try to scrape this post_id

  - `1` the post spider scrape succesfully this post_id

  - `0` the post spider faild to scrape this post_id

- **creator_id**

  - a unique identifier for the creator of the post

- **language**

  - the language of the content of the post (detected by Medium)

- **first_published_at**

  - timestamp (milliseconds) of the first pubblication of the post

- **title**

  - the title of the post (can be not unique)

- **word_count**

  - the number of words contain in the post content

- **claps**

  - the total number of claps (on medium.com claps == likes)

- **tags**

  - the tags related to the post (comma separated)

### paragraph

| post_id        | index | name   | type  | text                                 |

| :------------: |:-----:|:------:| :----:| :----------------------------------: |

| `316d066db3d6` | `0`   | `6f86` | `3`   | `One important thing productful ...` |

| `316d066db3d6` | `1`   | `eabd` | `1`   | `Quality ≠ Money`                    |

| `3526667dacfb` | `0`   | `94db` | `1`   | `Income in Development ...`          |

| `...`          | `...` | `...`  | `...` | `...`                                |

- **post_id**

  - a unique identifier for the post

- **index**

  - the order of the paragraphs of a post (starting from 0)

- **name**

  - a unique identifier for the paragraph (inside post)

- **type**

  - `1` normal

  - `3` big bold header

  - `6` quote

  - `7` quote bigger and in the center

  - `9` bullet list

  - `10` ordered list

  - `13` small bold header

- **text**

  - the text inside the paragraph

*Information about italic, bold, code and link stored in the markup list,

currently not scraped by MS*

## :arrow_down: Installation

1. Clone this repo: `git clone https://github.com/S1M0N38/medium-scraper.git`

2. Move inside the cloned repo: `cd scraper-medium`

3. Install dependecies with [pipenv](https://pipenv.readthedocs.io/en/latest/):

   `pipenv install`

4. Enter the virtualenv: `pipenv shell`

5. Check the installation: `scrapy version`

## :zap: Usage

First you need ad .db where store data read

[Database Structure](https://github.com/S1M0N38/medium-scraper#books-database-structure).

Then be sure to be at the root level of medium-scraper repo and activate

the virtualenv with `pipenv shell`

### post_id spider

- **Description** this spider populate the post_id column of the post table

- **Arguments** if no arguemnt is provide, this spider start scraping the whole

  site starting from the foundation year of Medium and save all the data in

  the `medium.db` file. With spider arguments (`-a`) you can specify year,

  month and day. With settings arguments (`-s`) you can specify the name of

  the SQLite database. Of course you have to create the .db

  (e.g. `cp example.db another_database.db`)

- **Examples**

  - `scrapy crawl post_id`

    scarpe post_id of whole website (not recommended)

  - `scrapy crawl post_id -a year=2020`

    scarpe post_id of posts published in 2020

  - `scrapy crawl post_id -a year=2020 -a month=01`

    scarpe post_id of posts published in Jan 2020

  - `scrapy crawl post_id -a year=2020 -a month=01 -a day=01`

    scarpe post_id of posts published on 1st of Jan 2020

  - `scrapy crawl post_id -a year=2020 -a month=01 -s DB=another_database.db`

    scarpe post_id of posts published in Jan 2020 and save on another_database.db

### post spider

- **Description** Look in the database for post_id with `NULL` available and

  collect more information saved in post and paragraph tables.

- **Arguments** You can specify on which db store data

- **Examples**

  - `scrapy crawl post`

  scrape post data and save on `medium.db`

  - `scrapy crawl post -s DB=another_database.db`

  scrape post data and save on `another_database.db`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/s1m0n38/medium-scraper

Awesome Lists containing this project

README