https://github.com/pre63/super-maximal-repeats
https://github.com/pre63/super-maximal-repeats
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/pre63/super-maximal-repeats
- Owner: pre63
- License: apache-2.0
- Created: 2025-11-16T17:42:26.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-11-16T19:14:18.000Z (8 months ago)
- Last Synced: 2025-11-16T21:10:20.720Z (8 months ago)
- Language: Python
- Size: 12.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Super Maximal Repeats
A Python library for computing super-maximal repeats in a string or collection of documents, implemented with C++ under the hood using pybind11. Based on the enhanced suffix array algorithm for linear-time computation (in practice, O(n log² n) due to sorting in the suffix array construction).
Super-maximal repeats are maximal repeats that are not substrings of any longer maximal repeat. This is useful for tasks like detecting machine-generated text, as described in the [paper "Unsupervised and Distributional Detection of Machine-Generated Text" by Mathias Gallé et al](https://arxiv.org/abs/2111.02878).
## Installation
```
make install
```
## Usage
```python
import smr
# Compute super-maximal repeats for a single string (character-level by default)
repeats = smr.find_supermaximal_repeats("your_example_string_here", min_len=20, min_occ=3)
# Each repeat is a Repeat object with 'doc_idx' (always 0 for single string), 'start' (starting position), 'len' (length), and 'text' (the repeat substring)
for r in repeats:
print(f"Repeat: {r.text} (doc_idx: {r.doc_idx}, start: {r.start}, len: {r.len})")
# Compute super-maximal repeats across multiple documents
docs = ["your_first_document_here", "your_second_document_here"]
repeats_docs = smr.find_supermaximal_repeats_docs(docs, min_len=20, min_occ=3, mode="char")
for r in repeats_docs:
print(f"Repeat: {r.text} (doc_idx: {r.doc_idx}, start: {r.start}, len: {r.len})")
```
- `min_len`: Minimum length of repeats (default: 1).
- `min_occ`: Minimum number of occurrences (default: 2).
- `mode` (for documents only): "char" for character-level repeats (default), or "word" for word-level repeats (splits on whitespace, positions and lengths are in terms of words, text is space-joined).
## Notes
- For large strings (n > 10^6), performance may degrade due to the O(n log² n) suffix array construction, but it is efficient for typical text sizes.
- Outputs a list of `Repeat` objects, each representing a unique super-maximal repeat with one example occurrence (document index, starting position, length, and text).