{"id":22180544,"url":"https://github.com/s1m0n38/medium-scraper","last_synced_at":"2026-02-26T18:36:33.609Z","repository":{"id":42426756,"uuid":"240110065","full_name":"S1M0N38/medium-scraper","owner":"S1M0N38","description":"Simple scraper for posts on  medium.com","archived":false,"fork":false,"pushed_at":"2022-11-04T19:44:28.000Z","size":2490,"stargazers_count":17,"open_issues_count":6,"forks_count":5,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-26T21:01:30.383Z","etag":null,"topics":["dataset","datasets","kaggle","medium","nlp","scraper","scrapy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/S1M0N38.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-12T20:37:55.000Z","updated_at":"2025-03-26T17:25:02.000Z","dependencies_parsed_at":"2023-01-21T00:47:45.280Z","dependency_job_id":null,"html_url":"https://github.com/S1M0N38/medium-scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/S1M0N38/medium-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/S1M0N38%2Fmedium-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/S1M0N38%2Fmedium-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/S1M0N38%2Fmedium-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/S1M0N38%2Fmedium-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/S1M0N38","download_url":"https://codeload.github.com/S1M0N38/medium-scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/S1M0N38%2Fmedium-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279186950,"owners_count":26122169,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-16T02:00:06.019Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","datasets","kaggle","medium","nlp","scraper","scrapy"],"created_at":"2024-12-02T09:18:38.089Z","updated_at":"2025-10-16T11:54:40.856Z","avatar_url":"https://github.com/S1M0N38.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Medium Scraper\n\nmedium-scraper (*MS*) is a scraper build using [Scrapy](https://scrapy.org/)\nframework for scrape [Medium](https://medium.com/) posts.\n\n## :wrench: How it works\n\n*MS* consist of two scrapy [spider](https://docs.scrapy.org/en/latest/topics/spiders.html):\n**post_id** and **post**.\n\nFirst *MS* the look for the *post_id* (a unique identifier of a post) inside the\nSitemap of medium website and store the found *post_id* in a\n[SQLite](https://sqlite.org/index.html).\n\nThen the second spider (**post**) takes the *post_id* from database and permform\na request for obstain the specific data for every post. Data about a post can\nbe divided in two groups: *posts*, and *paragraphs*. These are store in\ndifferent tables inside database.\n\n## :books: Database structure\n\nAll scraped data are store in a SQLite file (.db).\nTo Create a new databese file create a duplicate of `example.db` and\nthen rename it to `medium.db` (medium.db is the default name for the database).\nIf you want use a grafical interface for interact with databese I suggest\n[DB Browser for SQLite](https://sqlitebrowser.org/).\n\nAs I said before database consists of two tables: *post* and *paragraph*.\n\n### post\n\n| post_id        | available | creator_id     | language | first_published_at | title      | word_count | claps  | tags      |\n| :------------: |:---------:| :-------------:| :------: | :----------------: | :--------: | :--------: | :----: | :-------: |\n| `316d066db3d6` | `1`       | `245c7224d0ce` | `en`     | `1577865630099`    | `Intro...` | `4231`     | `341`  | `cow,dog` |\n| `5edbf9af44af` | `0`       | `NULL`         | `NULL`   | `NULL`             | `NULL`     | `NULL`     | `NULL` | `NULL`    |\n| `fec8331faa9d` | `NULL`    | `NULL`         | `NULL`   | `NULL`             | `NULL`     | `NULL`     | `NULL` | `NULL`    |\n| `...`          | `...`     | `...`          | `...`    | `...`              | `...`      | `...`      | `...`  | `...`     |\n\n- **post_id**\n  - a unique identifier for the post\n- **available**\n  - `NULL` the post spider never try to scrape this post_id\n  - `1` the post spider scrape succesfully this post_id\n  - `0` the post spider faild to scrape this post_id\n- **creator_id**\n  - a unique identifier for the creator of the post\n- **language**\n  - the language of the content of the post (detected by Medium)\n- **first_published_at**\n  - timestamp (milliseconds) of the first pubblication of the post\n- **title**\n  - the title of the post (can be not unique)\n- **word_count**\n  - the number of words contain in the post content\n- **claps**\n  - the total number of claps (on medium.com claps == likes)\n- **tags**\n  - the tags related to the post (comma separated)\n\n### paragraph\n\n| post_id        | index | name   | type  | text                                 |\n| :------------: |:-----:|:------:| :----:| :----------------------------------: |\n| `316d066db3d6` | `0`   | `6f86` | `3`   | `One important thing productful ...` |\n| `316d066db3d6` | `1`   | `eabd` | `1`   | `Quality ≠ Money`                    |\n| `3526667dacfb` | `0`   | `94db` | `1`   | `Income in Development ...`          |\n| `...`          | `...` | `...`  | `...` | `...`                                |\n\n- **post_id**\n  - a unique identifier for the post\n- **index**\n  - the order of the paragraphs of a post (starting from 0)\n- **name**\n  - a unique identifier for the paragraph (inside post)\n- **type**\n  - `1` normal\n  - `3` big bold header\n  - `6` quote\n  - `7` quote bigger and in the center\n  - `9` bullet list\n  - `10` ordered list\n  - `13` small bold header\n- **text**\n  - the text inside the paragraph\n\n*Information about italic, bold, code and link stored in the markup list,\ncurrently not scraped by MS*\n\n## :arrow_down: Installation\n\n1. Clone this repo: `git clone https://github.com/S1M0N38/medium-scraper.git`\n2. Move inside the cloned repo: `cd scraper-medium`\n3. Install dependecies with [pipenv](https://pipenv.readthedocs.io/en/latest/):\n   `pipenv install`\n4. Enter the virtualenv: `pipenv shell`\n5. Check the installation: `scrapy version`\n\n## :zap: Usage\n\nFirst you need ad .db where store data read\n[Database Structure](https://github.com/S1M0N38/medium-scraper#books-database-structure).\nThen be sure to be at the root level of medium-scraper repo and activate\nthe virtualenv with `pipenv shell`\n\n### post_id spider\n\n- **Description** this spider populate the post_id column of the post table\n\n- **Arguments** if no arguemnt is provide, this spider start scraping the whole\n  site starting from the foundation year of Medium and save all the data in\n  the `medium.db` file. With spider arguments (`-a`) you can specify year,\n  month and day. With settings arguments (`-s`) you can specify the name of\n  the SQLite database. Of course you have to create the .db\n  (e.g. `cp example.db another_database.db`)\n\n- **Examples**\n  - `scrapy crawl post_id`\n    scarpe post_id of whole website (not recommended)\n  - `scrapy crawl post_id -a year=2020`\n    scarpe post_id of posts published in 2020\n  - `scrapy crawl post_id -a year=2020 -a month=01`\n    scarpe post_id of posts published in Jan 2020\n  - `scrapy crawl post_id -a year=2020 -a month=01 -a day=01`\n    scarpe post_id of posts published on 1st of Jan 2020\n  - `scrapy crawl post_id -a year=2020 -a month=01 -s DB=another_database.db`\n    scarpe post_id of posts published in Jan 2020 and save on another_database.db\n\n### post spider\n\n- **Description** Look in the database for post_id with `NULL` available and\n  collect more information saved in post and paragraph tables.\n\n- **Arguments** You can specify on which db store data\n\n- **Examples**\n  - `scrapy crawl post`\n  scrape post data and save on `medium.db`\n  - `scrapy crawl post -s DB=another_database.db`\n  scrape post data and save on `another_database.db`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fs1m0n38%2Fmedium-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fs1m0n38%2Fmedium-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fs1m0n38%2Fmedium-scraper/lists"}