https://github.com/jonghwanhyeon/open-korean-text-python
Python interface to Open Korean Text Processor inspired by KoNLPy
https://github.com/jonghwanhyeon/open-korean-text-python
Last synced: 4 months ago
JSON representation
Python interface to Open Korean Text Processor inspired by KoNLPy
- Host: GitHub
- URL: https://github.com/jonghwanhyeon/open-korean-text-python
- Owner: jonghwanhyeon
- License: apache-2.0
- Created: 2017-03-06T14:48:24.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2017-12-06T16:25:21.000Z (about 8 years ago)
- Last Synced: 2024-10-16T09:26:15.362Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 6.29 MB
- Stars: 4
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Open Korean Text Python
Python interface to [Open Korean Text Processor](http://openkoreantext.org) inspired by [KoNLPy](http://konlpy.org)
## Requirements
- Python 3+
- Java 8+
## Installation
pip install open-korean-text-python
## Usage
### Normalizing text
openkoreantext.normalize('안녕하세욬ㅋㅋㅋㅋ') # 안녕하세요ㅋㅋㅋ
### Tagging part-of-speech
openkoreantext.pos('대한민국은 민주공화국이다.')
# [ ('대한민국', 'Noun'), ('은', 'Josa'), ('민주공화국', 'Noun'), ('이다', 'Josa'), ('.', 'Punctuation') ]
### Extracting morphemes
openkoreantext.morphs('대한민국의 주권은 국민에게 있고, 모든 권력은 국민으로부터 나온다.')
# [ '대한민국', '의', '주권', '은', '국민', '에게', '있고', ',', '모든', '권력', '은', '국민', '으로부터', '나온다', '.' ]
### Extracting nouns
openkoreantext.nouns('대한민국은 민주공화국이다.')
# [ '대한민국', '민주공화국' ]
### Extracting phrases
openkoreantext.phrases('불법 토토 신고하는 방법 #포상금', filter_spam=False, include_hashtags=True)
# [ '불법', '불법 토토', '불법 토토 신고', '불법 토토 신고하는 방법', '토토', '신고', '방법', '#포상금' ]
### Splitting sentences
openkoreantext.sentences('대한민국은 민주공화국이다. 대한민국의 주권은 국민에게 있고, 모든 권력은 국민으로부터 나온다.')
# [ '대한민국은 민주공화국이다.', '대한민국의 주권은 국민에게 있고, 모든 권력은 국민으로부터 나온다.' ]
### Adding words to dictionary
openkoreantext.add_words_to_dictionary('Noun', [ '앎읾슮', '앎멞릶칾놂' ])
openkoreantext.add_words_to_dictionary('Adverb', '살랑설렁')
## API
### openkoreantext.normalize(text)
Normalizes `text`. Returns a normalized `text`
#### Parameter
- `text`: text to normalize
### openkoreantext.pos(text, stem=False)
Tokenizes `text` into morphemes and tags their part-of-speech. Returns a list of pairs of morpheme and part-of-speech.
#### Parameters
- `text`: text to tokenize
- `stem`: stem morphemes if True
### openkoreantext.morphs(text, stem=False)
Extract morphemes from text. Returns a list of morphemes.
#### Parameters
- `text`: text to extract morphemes
- `stem`: stem morphemes if True
### openkoreantext.nouns(text)
Extracts nouns from `text`. Returns a list of nouns.
#### Parameter
- `text`: text to extract nouns
### openkoreantext.phrases(text, filter_spam=True, include_hashtags=True)
Extracts phrases from `text`. Returns a list of phrases.
#### Parameters
- `text`: text to extract phrases
- `filter_spam`: ignore spam words if True
- `include_hashtags`: include hashtags if True
### openkoreantext.sentences(text)
Splits `text` into sentences. Returns a list of sentences
#### Parameter
- `text`: text to split into sentences
### openkoreantext.add_words_to_dictionary(pos, words)
Adds user-defined `words` to the dictionary
#### Parameters
- `pos`: part-of-speech of words (Noun, Verb, Adjective, Adverb, Determiner, Exclamation, Josa, Eomi, PreEomi, Conjunction, Modifier, VerbPrefix, Suffix)
- `words`: list of words to add