Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tc64/spacyss
Sentence Segmentation for Spacy
https://github.com/tc64/spacyss
sentence-boundary-detection sentence-segmentation spacy spacy-pipeline
Last synced: 2 months ago
JSON representation
Sentence Segmentation for Spacy
- Host: GitHub
- URL: https://github.com/tc64/spacyss
- Owner: tc64
- License: mit
- Created: 2017-12-22T17:03:31.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-07-26T22:38:43.000Z (over 6 years ago)
- Last Synced: 2024-10-01T14:50:04.278Z (3 months ago)
- Topics: sentence-boundary-detection, sentence-segmentation, spacy, spacy-pipeline
- Language: Python
- Size: 12.7 KB
- Stars: 9
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Description
Custom sentence segmentation for spacy.```bash
pip install spacyss
```*Tested on spacy 2.0.5*
# Segmenters Implemented
## Newline Segmenter
sentences in text are separated by one or more newline characters.## Example
```python
from seg.newline.segmenter import NewLineSegmenter # note that pip package is called spacyss
import spacynlseg = NewLineSegmenter()
nlp = spacy.load('en')
nlp.add_pipe(nlseg.set_sent_starts, name='sentence_segmenter', before='parser')doc = nlp(my_doc_text)
```## Single Sentence (or "Trivial") Segmenter
the text is treated as a single sentence. may be better for tweets or other short informal texts where over segmentation may cause more problems than undersegmentation.# Implementing more segmenters
* create package under seg named for your sentence segmentation approach
* create segmenter.py under that package
* create a class for your segmenter with a method called set_sent_starts that takes a doc as the single argument.
* It may be that spacy api allows for more flexible argument profile here, feel free to correct...