An open API service indexing awesome lists of open source software.

https://github.com/elsayed-issa/the-moroccan-news-corpus


https://github.com/elsayed-issa/the-moroccan-news-corpus

corpus crawlers news-websites scrapy spider

Last synced: 4 months ago
JSON representation

Awesome Lists containing this project

README

          

# The-Moroccan-News-Corpus

The present corpus was part of a summer internship. We use scrapy spiders/crawlers to crawl the Moroccan newspaper websites and save all the scraped data to either json or txt files. We built spiders/crawlers for the following news websites:

Moroccan News websites


  • http://ahdath.info/

  • https://www.akhbarona.com/

  • https://www.alayam24.com/

  • https://www.almaghribtoday.net/

  • https://www.barlamane.com/

  • https://dalil-rif.com/

  • https://www.febrayer.com/

  • https://www.goud.ma/

  • https://www.hespress.com/

  • https://ar.hibapress.com/

  • http://kifache.com/

  • www.maghress.com

  • https://www.menara.ma/

  • https://www.almaghreb24.com/

  • https://maroctelegraph.com/

  • https://www.nadorcity.com/

  • https://tanja24.com/

  • http://telexpresse.com/

  • http://ar.le360.ma/

  • http://www.alyaoum24.com/

  • http://www.2m.ma/ar/

  • https://ar.yabiladi.com/
  • How to use spiders/crawlers?


    Every folder represents the project folder for every newspaper.

    To scrape any data from any of the newspapers above,

  • Download its project folder.

  • On the command line, change directory to the project folder

  • Invoke the following command to start scrabing the website: scrapy crawl < name of the spider > -o < name of the file >.json
  • Note


    Every spider/crawler automatically saves a text file in addition to either json files or xml files that you determine when you run your spider in the command line.

    This is the link to download about 2 gigabytes of texts. https://drive.google.com/open?id=1w2-DTJF2phU3fVf4XkDh1tsN-O3N_baF