https://github.com/elsayed-issa/the-moroccan-news-corpus
https://github.com/elsayed-issa/the-moroccan-news-corpus
corpus crawlers news-websites scrapy spider
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/elsayed-issa/the-moroccan-news-corpus
- Owner: elsayed-issa
- Created: 2019-11-06T02:25:17.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-11-06T19:58:04.000Z (almost 6 years ago)
- Last Synced: 2025-06-19T22:46:06.609Z (5 months ago)
- Topics: corpus, crawlers, news-websites, scrapy, spider
- Language: Python
- Size: 680 KB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# The-Moroccan-News-Corpus
The present corpus was part of a summer internship. We use scrapy spiders/crawlers to crawl the Moroccan newspaper websites and save all the scraped data to either json or txt files. We built spiders/crawlers for the following news websites:
Moroccan News websites
How to use spiders/crawlers?
Every folder represents the project folder for every newspaper.
To scrape any data from any of the newspapers above,
scrapy crawl < name of the spider > -o < name of the file >.json
Note
Every spider/crawler automatically saves a text file in addition to either json files or xml files that you determine when you run your spider in the command line.
This is the link to download about 2 gigabytes of texts. https://drive.google.com/open?id=1w2-DTJF2phU3fVf4XkDh1tsN-O3N_baF