https://github.com/craigtrim/fast-sentence-segment
Fast and Efficient Sentence Segmentation
https://github.com/craigtrim/fast-sentence-segment
natural-language-processing nlp python segmentation sentence-segmentation spacy text-processing text-segmentation
Last synced: 5 months ago
JSON representation
Fast and Efficient Sentence Segmentation
- Host: GitHub
- URL: https://github.com/craigtrim/fast-sentence-segment
- Owner: craigtrim
- License: mit
- Created: 2022-08-20T17:14:40.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2025-12-29T17:54:45.000Z (6 months ago)
- Last Synced: 2025-12-31T17:36:45.503Z (6 months ago)
- Topics: natural-language-processing, nlp, python, segmentation, sentence-segmentation, spacy, text-processing, text-segmentation
- Language: Python
- Homepage:
- Size: 12.1 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Fast Sentence Segmentation
[](https://pypi.org/project/fast-sentence-segment/)
[](https://pypi.org/project/fast-sentence-segment/)
[](https://opensource.org/licenses/MIT)
[](https://spacy.io/)
Fast and efficient sentence segmentation using spaCy with surgical post-processing fixes. Handles complex edge cases like abbreviations (Dr., Mr., etc.), ellipses, quoted text, and multi-paragraph documents.
## Why This Library?
1. **Keep it local**: LLM API calls cost money and send your data to third parties. Run sentence segmentation entirely on your machine.
2. **spaCy perfected**: spaCy is a great local model, but it makes mistakes. This library fixes most of spaCy's shortcomings.
## Features
- **Paragraph-aware segmentation**: Returns sentences grouped by paragraph
- **Abbreviation handling**: Correctly handles "Dr.", "Mr.", "etc.", "p.m.", "a.m." without false splits
- **Ellipsis preservation**: Keeps `...` intact while detecting sentence boundaries
- **Question/exclamation splitting**: Properly splits on `?` and `!` followed by capital letters
- **Cached processing**: LRU cache for repeated text processing
- **Flexible output**: Nested lists (by paragraph) or flattened list of sentences
- **Bullet point & numbered list normalization**: Cleans common list formats
- **CLI tool**: Command-line interface for quick segmentation
## Installation
```bash
pip install fast-sentence-segment
```
After installation, download the spaCy model:
```bash
python -m spacy download en_core_web_sm
```
## Quick Start
```python
from fast_sentence_segment import segment_text
text = "Do you like Dr. Who? I prefer Dr. Strange! Mr. T is also cool."
results = segment_text(text, flatten=True)
```
```json
[
"Do you like Dr. Who?",
"I prefer Dr. Strange!",
"Mr. T is also cool."
]
```
Notice how "Dr. Who?" stays together as a single sentence—the library correctly recognizes that a title followed by a single-word name ending in `?` or `!` is a name reference, not a sentence boundary.
## Usage
### Basic Segmentation
The `segment_text` function returns a list of lists, where each inner list represents a paragraph containing its sentences:
```python
from fast_sentence_segment import segment_text
text = """Gandalf spoke softly. "All we have to decide is what to do with the time given us."
Frodo nodded. The weight of the Ring pressed against his chest."""
results = segment_text(text)
```
```json
[
[
"Gandalf spoke softly.",
"\"All we have to decide is what to do with the time given us.\"."
],
[
"Frodo nodded.",
"The weight of the Ring pressed against his chest."
]
]
```
### Flattened Output
If you don't need paragraph boundaries, use the `flatten` parameter:
```python
text = "At 9 a.m. the hobbits set out. By 3 p.m. they reached Rivendell. Mr. Frodo was exhausted."
results = segment_text(text, flatten=True)
```
```json
[
"At 9 a.m. the hobbits set out.",
"By 3 p.m. they reached Rivendell.",
"Mr. Frodo was exhausted."
]
```
### Direct Segmenter Access
For more control, use the `Segmenter` class directly:
```python
from fast_sentence_segment import Segmenter
segmenter = Segmenter()
results = segmenter.input_text("Your text here.")
```
### Command Line Interface
Segment text directly from the terminal:
```bash
# Direct text input
echo "Have you seen Dr. Who? It's brilliant!" | segment
```
```
Have you seen Dr. Who?
It's brilliant!
```
```bash
# Numbered output
segment -n "Gandalf paused... You shall not pass! The Balrog roared."
```
```
1. Gandalf paused...
2. You shall not pass!
3. The Balrog roared.
```
```bash
# From file
segment -f silmarillion.txt
```
## API Reference
| Function | Parameters | Returns | Description |
|----------|------------|---------|-------------|
| `segment_text()` | `input_text: str`, `flatten: bool = False` | `list` | Main entry point for segmentation |
| `Segmenter.input_text()` | `input_text: str` | `list[list[str]]` | Cached paragraph-aware segmentation |
### CLI Options
| Option | Description |
|--------|-------------|
| `text` | Text to segment (positional argument) |
| `-f, --file` | Read text from file |
| `-n, --numbered` | Number output lines |
## Why Nested Lists?
The segmentation process preserves document structure by segmenting into both paragraphs and sentences. Each outer list represents a paragraph, and each inner list contains that paragraph's sentences. This is useful for:
- Document structure analysis
- Paragraph-level processing
- Maintaining original text organization
Use `flatten=True` when you only need sentences without paragraph context.
## Requirements
- Python 3.9+
- spaCy 3.8+
- en_core_web_sm spaCy model
## How It Works
This library uses spaCy for initial sentence segmentation, then applies surgical post-processing fixes for cases where spaCy's default behavior is incorrect:
1. **Pre-processing**: Normalize numbered lists, preserve ellipses with placeholders
2. **spaCy segmentation**: Use spaCy's sentence boundary detection
3. **Post-processing**: Split on abbreviation boundaries, handle `?`/`!` + capital patterns
4. **Denormalization**: Restore placeholders to original text
## License
MIT License - see [LICENSE](LICENSE) for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests (`make test`)
4. Commit your changes
5. Push to the branch
6. Open a Pull Request