https://github.com/timkam/compound-word-splitter
A compound word splitter for Python
https://github.com/timkam/compound-word-splitter
natural-language-processing python
Last synced: about 1 year ago
JSON representation
A compound word splitter for Python
- Host: GitHub
- URL: https://github.com/timkam/compound-word-splitter
- Owner: TimKam
- License: mit
- Created: 2017-01-07T12:37:37.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2021-08-18T13:03:10.000Z (almost 5 years ago)
- Last Synced: 2025-02-27T19:29:44.338Z (over 1 year ago)
- Topics: natural-language-processing, python
- Language: Python
- Size: 15.6 KB
- Stars: 48
- Watchers: 5
- Forks: 15
- Open Issues: 4
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
compound-word-splitter
======================
.. image:: https://travis-ci.org/TimKam/compound-word-splitter.svg?branch=master
:target: https://travis-ci.org/TimKam/compound-word-splitter
Splits words that are not recognized by *pyenchant* (spell checker) into largest possible compounds.
Installation
------------
Make sure you have `enchant `_ installed before proceeding.
Now run
::
pip install compound-word-splitter
Note that the languages that are available by default depend on your operating system's configuration and could be, for
example::
['en', 'en_CA', 'en_GB', 'en_US']
If you would like to use a different language, like ``de_de`` in the example below, you will have to install the
`myspell `_
dictionary for it (*myspell-de-de*).
Usage
-----
.. code:: python
import splitter
splitter.split('artfactory')
returns
.. code:: python
['art', 'factory']
.
.. code:: python
split('Glossarelement', 'de_de')
returns
.. code:: python
['Glossar', 'Element']
.
If the word cannot be split into compounds pyenchant recognizes as words, the splitter returns an empty string.