Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/livingbio/news_scraper
Scrape news and keywords to evaluate different Chinese segmenter
https://github.com/livingbio/news_scraper
Last synced: 7 days ago
JSON representation
Scrape news and keywords to evaluate different Chinese segmenter
- Host: GitHub
- URL: https://github.com/livingbio/news_scraper
- Owner: livingbio
- License: mit
- Created: 2016-06-23T02:46:50.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-11-25T05:32:10.000Z (12 months ago)
- Last Synced: 2024-05-15T20:14:05.997Z (6 months ago)
- Language: Python
- Size: 769 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
[![Build Status](https://travis-ci.org/livingbio/django-template.svg?branch=master)](https://travis-ci.org/livingbio/django-template)
# Evaluation of Chinese Segmenters
Evaluate four Chinese segmenters: [jieba](https://github.com/fxsjy/jieba), jieba with pre-defined dictionary, 大師's dictionary-based segmenter, [Stanford segmenter](http://nlp.stanford.edu/software/segmenter.shtml).
### Setting Virtualenv
At first, you should make sure you have [virtualenv](http://www.virtualenv.org/) installed.
cd news_scraper
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt### Setting up local environment variables
Settings are stored in environment variables via [django-environ](http://django-environ.readthedocs.org/en/latest/). The quickiest way to start is to copy and rename `local.sample.env` into `local.env`:
cp src/news_scraper/settings/local.sample.env src/news_scraper/settings/local.env
Then edit the SECRET_KEY in local.env file, replace `q+#ae^hxgz7o*lvdatnsu76365uwmspc$(vac%9(b8gck-(l^z` into any [Django Secret Key](http://www.miniwebtool.com/django-secret-key-generator/), for example:
SECRET_KEY=twvg)o_=u&@6^*cbi9nfswwh=(&hd$bhxh9iq&h-kn-pff0&&3
### Run web server
After that, just cd to `src` folder:
cd src
And run migrate and http server:
python manage.py migrate
python manage.py runserver### Scrape news for evaluation
It takes about 10 minutes to scrape news from LTN news.
cd src
scrapy crawl news.ltn.com.tw### Evaluate segmenters
python manage.py eval_segmenter