Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pourmand1376/persiancrawler
Open source crawler for Persian websites.
https://github.com/pourmand1376/persiancrawler
crawler machine-learning news python scrapy tasnim text-classification
Last synced: 3 months ago
JSON representation
Open source crawler for Persian websites.
- Host: GitHub
- URL: https://github.com/pourmand1376/persiancrawler
- Owner: pourmand1376
- License: mit
- Created: 2022-04-20T11:25:29.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-08-27T07:22:57.000Z (over 1 year ago)
- Last Synced: 2024-08-05T09:15:12.193Z (5 months ago)
- Topics: crawler, machine-learning, news, python, scrapy, tasnim, text-classification
- Language: Python
- Homepage: https://www.kaggle.com/amirpourmand/datasets
- Size: 39.1 KB
- Stars: 17
- Watchers: 3
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/amirpourmand/datasets)
# Crawler
Open source crawler for Persian websites. Crawled websites to now:
- [Asriran](https://www.kaggle.com/datasets/amirpourmand/asriran-news)
- [Fa-Wikipedia](https://www.kaggle.com/datasets/amirpourmand/fa-wikipedia)
- [Tasnim](https://www.kaggle.com/datasets/amirpourmand/tasnimdataset)
- [Isna](https://www.kaggle.com/datasets/amirpourmand/isna-news)### Asriran
```bash
asriran/run_asriran.sh
```> You can change some paramters in this crawler. See `run_asriran.sh`.
### Fa-Wikipedia
Due to some problems in crawling, I splitted this job into two stages. First crawling all index pages and second use those pages for crawling.
```bash
wikipedia/run_wikipedia.sh
```### Tasnim News
This crawler saves [tasnim news](https://www.tasnimnews.com/) pages based on category. This is appopriate for text classification task as data is relatively balanced across all categories. I selected equal amount of page per category.> We have a parameter Called `Number_of_pages` in `tasnim.py` which controls how many pages we should crawl in each category.
```bash
tasnim/run_tasnim.sh
```Datasets are all available for download at [Kaggle](https://www.kaggle.com/amirpourmand/datasets).
CSS selectors are mostly extracted via [Copy Css Selector](https://chrome.google.com/webstore/detail/copy-css-selector/kemkenbgbgodoglfkkejbdcpojnodnkg?hl=en).
- https://stackoverflow.com/questions/73859249/attributeerror-module-openssl-ssl-has-no-attribute-sslv3-method
- https://stackoverflow.com/a/73867925/4201765