Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tchaikov/open-gram
collect lexicon and build n-gram dataset for NLP in Chinese
https://github.com/tchaikov/open-gram
Last synced: 3 months ago
JSON representation
collect lexicon and build n-gram dataset for NLP in Chinese
- Host: GitHub
- URL: https://github.com/tchaikov/open-gram
- Owner: tchaikov
- Created: 2009-11-01T12:22:03.000Z (about 15 years ago)
- Default Branch: master
- Last Pushed: 2010-07-21T15:38:36.000Z (over 14 years ago)
- Last Synced: 2024-10-08T04:02:39.182Z (3 months ago)
- Language: Python
- Homepage:
- Size: 4.49 MB
- Stars: 9
- Watchers: 5
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
open-gram
=========open-gram is a project tries to collect lexicon and build n-gram dataset for NLP in Chinese. This project tries to leverage existing open source resources like crfpp and CC-CEDICT.
open-gram includes 4 parts
- corpus collection
- segmentation
- (new) word extraction
- n-gram info countingcorpus collection
=================1. crawl Chinese web sites using scrapy, grab the body HTML pages of them
2. proprocess the pages
- detect the encoding
- remove HTML tags and other stuff we are not interested in
- split the text into sentencessegmentation
============there two ways to segment tokens into words
* tagging
* matchingword extraction
===============n-gram info counting
====================