https://github.com/craigtrim/fast-sentence-segment

Fast and Efficient Sentence Segmentation
https://github.com/craigtrim/fast-sentence-segment

natural-language-processing nlp python segmentation sentence-segmentation spacy text-processing text-segmentation

Last synced: 5 months ago
JSON representation

Fast and Efficient Sentence Segmentation

Host: GitHub
URL: https://github.com/craigtrim/fast-sentence-segment
Owner: craigtrim
License: mit
Created: 2022-08-20T17:14:40.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2025-12-29T17:54:45.000Z (6 months ago)
Last Synced: 2025-12-31T17:36:45.503Z (6 months ago)
Topics: natural-language-processing, nlp, python, segmentation, sentence-segmentation, spacy, text-processing, text-segmentation
Language: Python
Homepage:
Size: 12.1 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Fast Sentence Segmentation

[![PyPI version](https://img.shields.io/pypi/v/fast-sentence-segment.svg)](https://pypi.org/project/fast-sentence-segment/)
[![Python versions](https://img.shields.io/pypi/pyversions/fast-sentence-segment.svg)](https://pypi.org/project/fast-sentence-segment/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![spaCy](https://img.shields.io/badge/spaCy-3.8-blue.svg)](https://spacy.io/)

Fast and efficient sentence segmentation using spaCy with surgical post-processing fixes. Handles complex edge cases like abbreviations (Dr., Mr., etc.), ellipses, quoted text, and multi-paragraph documents.

## Why This Library?

1. **Keep it local**: LLM API calls cost money and send your data to third parties. Run sentence segmentation entirely on your machine.
2. **spaCy perfected**: spaCy is a great local model, but it makes mistakes. This library fixes most of spaCy's shortcomings.

## Features

- **Paragraph-aware segmentation**: Returns sentences grouped by paragraph
- **Abbreviation handling**: Correctly handles "Dr.", "Mr.", "etc.", "p.m.", "a.m." without false splits
- **Ellipsis preservation**: Keeps `...` intact while detecting sentence boundaries
- **Question/exclamation splitting**: Properly splits on `?` and `!` followed by capital letters
- **Cached processing**: LRU cache for repeated text processing
- **Flexible output**: Nested lists (by paragraph) or flattened list of sentences
- **Bullet point & numbered list normalization**: Cleans common list formats
- **CLI tool**: Command-line interface for quick segmentation

## Installation

```bash
pip install fast-sentence-segment
```

After installation, download the spaCy model:

```bash
python -m spacy download en_core_web_sm
```

## Quick Start

```python
from fast_sentence_segment import segment_text

text = "Do you like Dr. Who? I prefer Dr. Strange! Mr. T is also cool."

results = segment_text(text, flatten=True)
```

```json
[
"Do you like Dr. Who?",
"I prefer Dr. Strange!",
"Mr. T is also cool."
]
```

Notice how "Dr. Who?" stays together as a single sentence—the library correctly recognizes that a title followed by a single-word name ending in `?` or `!` is a name reference, not a sentence boundary.

## Usage

### Basic Segmentation

The `segment_text` function returns a list of lists, where each inner list represents a paragraph containing its sentences:

```python
from fast_sentence_segment import segment_text

text = """Gandalf spoke softly. "All we have to decide is what to do with the time given us."

Frodo nodded. The weight of the Ring pressed against his chest."""

results = segment_text(text)
```

```json
[
[
"Gandalf spoke softly.",
"\"All we have to decide is what to do with the time given us.\"."
],
[
"Frodo nodded.",
"The weight of the Ring pressed against his chest."
]
]
```

### Flattened Output

If you don't need paragraph boundaries, use the `flatten` parameter:

```python
text = "At 9 a.m. the hobbits set out. By 3 p.m. they reached Rivendell. Mr. Frodo was exhausted."

results = segment_text(text, flatten=True)
```

```json
[
"At 9 a.m. the hobbits set out.",
"By 3 p.m. they reached Rivendell.",
"Mr. Frodo was exhausted."
]
```

### Direct Segmenter Access

For more control, use the `Segmenter` class directly:

```python
from fast_sentence_segment import Segmenter

segmenter = Segmenter()
results = segmenter.input_text("Your text here.")
```

### Command Line Interface

Segment text directly from the terminal:

```bash
# Direct text input
echo "Have you seen Dr. Who? It's brilliant!" | segment
```

```
Have you seen Dr. Who?
It's brilliant!
```

```bash
# Numbered output
segment -n "Gandalf paused... You shall not pass! The Balrog roared."
```

```
1. Gandalf paused...
2. You shall not pass!
3. The Balrog roared.
```

```bash
# From file
segment -f silmarillion.txt
```

## API Reference

| Function | Parameters | Returns | Description |
|----------|------------|---------|-------------|
| `segment_text()` | `input_text: str`, `flatten: bool = False` | `list` | Main entry point for segmentation |
| `Segmenter.input_text()` | `input_text: str` | `list[list[str]]` | Cached paragraph-aware segmentation |

### CLI Options

| Option | Description |
|--------|-------------|
| `text` | Text to segment (positional argument) |
| `-f, --file` | Read text from file |
| `-n, --numbered` | Number output lines |

## Why Nested Lists?

The segmentation process preserves document structure by segmenting into both paragraphs and sentences. Each outer list represents a paragraph, and each inner list contains that paragraph's sentences. This is useful for:

- Document structure analysis
- Paragraph-level processing
- Maintaining original text organization

Use `flatten=True` when you only need sentences without paragraph context.

## Requirements

- Python 3.9+
- spaCy 3.8+
- en_core_web_sm spaCy model

## How It Works

This library uses spaCy for initial sentence segmentation, then applies surgical post-processing fixes for cases where spaCy's default behavior is incorrect:

1. **Pre-processing**: Normalize numbered lists, preserve ellipses with placeholders
2. **spaCy segmentation**: Use spaCy's sentence boundary detection
3. **Post-processing**: Split on abbreviation boundaries, handle `?`/`!` + capital patterns
4. **Denormalization**: Restore placeholders to original text

## License

MIT License - see [LICENSE](LICENSE) for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests (`make test`)
4. Commit your changes
5. Push to the branch
6. Open a Pull Request

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/craigtrim/fast-sentence-segment

Awesome Lists containing this project

README