Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vaibhavyadav-dev/codeforces-problemset-scrapper
Web Scrapper that scrap the whole problemset of Codeforces into csv or json file.
https://github.com/vaibhavyadav-dev/codeforces-problemset-scrapper
codeforces competative competative-programming crawler problemset programming python scrapy-crawler scrapy-spider
Last synced: about 13 hours ago
JSON representation
Web Scrapper that scrap the whole problemset of Codeforces into csv or json file.
- Host: GitHub
- URL: https://github.com/vaibhavyadav-dev/codeforces-problemset-scrapper
- Owner: vaibhavyadav-dev
- Created: 2024-07-23T12:04:03.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-08-03T06:38:07.000Z (4 months ago)
- Last Synced: 2024-10-09T17:43:18.218Z (about 1 month ago)
- Topics: codeforces, competative, competative-programming, crawler, problemset, programming, python, scrapy-crawler, scrapy-spider
- Language: Python
- Homepage:
- Size: 805 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Codeforces Problemset Scrapper
This is a Scrapy-based web scraper specifically designed for extracting detailed information about Codeforces problems. The Codeforces API lacks certain details such as URLs, number of people who have solved the problems, and problem ratings. This scraper solves that problem by extracting data for better problem analysis.### Insights from the data
Following topics are present (tags) in the problemset with number of problems.
```
{
"2-sat": 34,
"bitmasks": 559,
"data structures": 1693,
"dp": 2074,
"graphs": 1045,
"matrices": 119,
"two pointers": 528,
"strings": 714,
"brute force": 1638,
"constructive algorithms": 1706,
"greedy": 2782,
"implementation": 2681,
"math": 2817,
"number theory": 722,
"binary search": 1030,
"sortings": 1054,
"combinatorics": 659,
"games": 220,
"hashing": 206,
"interactive": 222,
"dfs and similar": 914,
"trees": 803,
"dsu": 346,
"divide and conquer": 283,
"fft": 92,
"geometry": 387,
"string suffix structures": 90,
"probabilities": 231,
"meet-in-the-middle": 49,
"ternary search": 53,
"shortest paths": 262,
"flows": 143,
"*special problem": 437,
"graph matchings": 89,
"schedules": 10,
"expression parsing": 36,
"chinese remainder theorem": 16
}Total problems:- 9678
Total types of problems:- 26744
```
use ```findproblems.py``` to find regarding your rating and topics
```main.py```file is for analysis of problemset.### Implemented Features
- Random headers for requests
- Storage in MySQL database
- Extraction of data from multiple pages
- Data cleaning and pre-processing for the following fields:
- URL
- Name
- Number of people who have solved the problem
- Problem rating
- Tags
- Rotating proxy support (You need to provide your own URL, sign up [here](https://proxy2.webshare.io/) for a dashboard)> [!NOTE]
> - Refer to the `output.json` file for the latest results as of 23rd July 2024.
> - To use this scraper, clone the repository and install the required dependencies using the `requirements.txt` file.
> - Run `scrapy crawl cf -O problems.json` for a JSON file or `problems.csv` for a CSV file.
> - Feel free to contribute additional features by creating pull requests.> [!WARNING]
> - Be cautious when using this scraper as Codeforces may block your IP address. Consider using a rotating proxy for each request by adding the line `meta={"proxy":"protocol://yourusername:yourpassword@domainname:port/"}` in the `spiders/cf.py` file, specifically on line 27.> [!TIP]
>