https://github.com/elsayed-issa/the-moroccan-news-corpus

corpus crawlers news-websites scrapy spider

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/elsayed-issa/the-moroccan-news-corpus
Owner: elsayed-issa
Created: 2019-11-06T02:25:17.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-11-06T19:58:04.000Z (almost 6 years ago)
Last Synced: 2025-06-19T22:46:06.609Z (5 months ago)
Topics: corpus, crawlers, news-websites, scrapy, spider
Language: Python
Size: 680 KB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# The-Moroccan-News-Corpus

The present corpus was part of a summer internship. We use scrapy spiders/crawlers to crawl the Moroccan newspaper websites and save all the scraped data to either json or txt files. We built spiders/crawlers for the following news websites:

Moroccan News websites

http://ahdath.info/

https://www.akhbarona.com/

https://www.alayam24.com/

https://www.almaghribtoday.net/

https://www.barlamane.com/

https://dalil-rif.com/

https://www.febrayer.com/

https://www.goud.ma/

https://www.hespress.com/

https://ar.hibapress.com/

http://kifache.com/

www.maghress.com

https://www.menara.ma/

https://www.almaghreb24.com/

https://maroctelegraph.com/

https://www.nadorcity.com/

https://tanja24.com/

http://telexpresse.com/

http://ar.le360.ma/

http://www.alyaoum24.com/

http://www.2m.ma/ar/

https://ar.yabiladi.com/

How to use spiders/crawlers?

Every folder represents the project folder for every newspaper.

To scrape any data from any of the newspapers above,

Download its project folder.

On the command line, change directory to the project folder

Invoke the following command to start scrabing the website: scrapy crawl < name of the spider > -o < name of the file >.json

Note

Every spider/crawler automatically saves a text file in addition to either json files or xml files that you determine when you run your spider in the command line.

This is the link to download about 2 gigabytes of texts. https://drive.google.com/open?id=1w2-DTJF2phU3fVf4XkDh1tsN-O3N_baF

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome