Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/pourmand1376/persiancrawler

Open source crawler for Persian websites.
https://github.com/pourmand1376/persiancrawler

crawler machine-learning news python scrapy tasnim text-classification

Last synced: 3 months ago
JSON representation

Open source crawler for Persian websites.

Host: GitHub
URL: https://github.com/pourmand1376/persiancrawler
Owner: pourmand1376
License: mit
Created: 2022-04-20T11:25:29.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2023-08-27T07:22:57.000Z (over 1 year ago)
Last Synced: 2024-08-05T09:15:12.193Z (5 months ago)
Topics: crawler, machine-learning, news, python, scrapy, tasnim, text-classification
Language: Python
Homepage: https://www.kaggle.com/amirpourmand/datasets
Size: 39.1 KB
Stars: 17
Watchers: 3
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/amirpourmand/datasets)

# Crawler
Open source crawler for Persian websites. Crawled websites to now:
- [Asriran](https://www.kaggle.com/datasets/amirpourmand/asriran-news)
- [Fa-Wikipedia](https://www.kaggle.com/datasets/amirpourmand/fa-wikipedia)
- [Tasnim](https://www.kaggle.com/datasets/amirpourmand/tasnimdataset)
- [Isna](https://www.kaggle.com/datasets/amirpourmand/isna-news)

### Asriran

```bash
asriran/run_asriran.sh
```

> You can change some paramters in this crawler. See `run_asriran.sh`.

### Fa-Wikipedia

Due to some problems in crawling, I splitted this job into two stages. First crawling all index pages and second use those pages for crawling.
```bash
wikipedia/run_wikipedia.sh
```

### Tasnim News
This crawler saves [tasnim news](https://www.tasnimnews.com/) pages based on category. This is appopriate for text classification task as data is relatively balanced across all categories. I selected equal amount of page per category.

> We have a parameter Called `Number_of_pages` in `tasnim.py` which controls how many pages we should crawl in each category.

```bash
tasnim/run_tasnim.sh
```

Datasets are all available for download at [Kaggle](https://www.kaggle.com/amirpourmand/datasets).

CSS selectors are mostly extracted via [Copy Css Selector](https://chrome.google.com/webstore/detail/copy-css-selector/kemkenbgbgodoglfkkejbdcpojnodnkg?hl=en).

- https://stackoverflow.com/questions/73859249/attributeerror-module-openssl-ssl-has-no-attribute-sslv3-method
- https://stackoverflow.com/a/73867925/4201765