https://github.com/pourmand1376/persiancrawler
Open source crawler for Persian websites.
https://github.com/pourmand1376/persiancrawler
crawler machine-learning news python scrapy tasnim text-classification
Last synced: 8 months ago
JSON representation
Open source crawler for Persian websites.
- Host: GitHub
- URL: https://github.com/pourmand1376/persiancrawler
- Owner: pourmand1376
- License: mit
- Created: 2022-04-20T11:25:29.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2023-08-27T07:22:57.000Z (almost 3 years ago)
- Last Synced: 2025-02-01T16:47:46.176Z (over 1 year ago)
- Topics: crawler, machine-learning, news, python, scrapy, tasnim, text-classification
- Language: Python
- Homepage: https://www.kaggle.com/amirpourmand/datasets
- Size: 39.1 KB
- Stars: 19
- Watchers: 3
- Forks: 4
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
[](https://www.kaggle.com/amirpourmand/datasets)
# Crawler
Open source crawler for Persian websites. Crawled websites to now:
- [Asriran](https://www.kaggle.com/datasets/amirpourmand/asriran-news)
- [Fa-Wikipedia](https://www.kaggle.com/datasets/amirpourmand/fa-wikipedia)
- [Tasnim](https://www.kaggle.com/datasets/amirpourmand/tasnimdataset)
- [Isna](https://www.kaggle.com/datasets/amirpourmand/isna-news)
### Asriran
```bash
asriran/run_asriran.sh
```
> You can change some paramters in this crawler. See `run_asriran.sh`.
### Fa-Wikipedia
Due to some problems in crawling, I splitted this job into two stages. First crawling all index pages and second use those pages for crawling.
```bash
wikipedia/run_wikipedia.sh
```
### Tasnim News
This crawler saves [tasnim news](https://www.tasnimnews.com/) pages based on category. This is appopriate for text classification task as data is relatively balanced across all categories. I selected equal amount of page per category.
> We have a parameter Called `Number_of_pages` in `tasnim.py` which controls how many pages we should crawl in each category.
```bash
tasnim/run_tasnim.sh
```
Datasets are all available for download at [Kaggle](https://www.kaggle.com/amirpourmand/datasets).
CSS selectors are mostly extracted via [Copy Css Selector](https://chrome.google.com/webstore/detail/copy-css-selector/kemkenbgbgodoglfkkejbdcpojnodnkg?hl=en).
- https://stackoverflow.com/questions/73859249/attributeerror-module-openssl-ssl-has-no-attribute-sslv3-method
- https://stackoverflow.com/a/73867925/4201765