https://github.com/jhnwnstd/suxotin
Python script that distinguishes vowels from consonants using Suxotin's algorithm.
https://github.com/jhnwnstd/suxotin
cryptography decipherment suxotin
Last synced: 23 days ago
JSON representation
Python script that distinguishes vowels from consonants using Suxotin's algorithm.
- Host: GitHub
- URL: https://github.com/jhnwnstd/suxotin
- Owner: jhnwnstd
- License: gpl-3.0
- Created: 2024-04-27T15:54:01.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2026-03-27T05:47:17.000Z (3 months ago)
- Last Synced: 2026-03-27T17:42:05.348Z (3 months ago)
- Topics: cryptography, decipherment, suxotin
- Language: Python
- Homepage:
- Size: 10.5 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Vowel Identification Algorithms
Unsupervised, language-agnostic classification of letters as vowels or consonants from raw text. No linguistic priors — purely statistical. Tested on 502 languages.
## Algorithms
| Script | Method | F1 |
|---|---|---|
| `classify.py` | Ensemble of SVD + Sukhotin + accent propagation | **0.965** |
| `algorithm1.py` | SVD spectral decomposition ([Thaine & Penn 2017](https://aclanthology.org/W17-4109/)) | 0.942 |
| `suxotin.py` | Sukhotin's adjacency algorithm (1962) | 0.919 |
**`classify.py`** is the recommended entry point. It uses SVD as the primary classifier, Sukhotin to detect and correct SVD label flips, and Unicode decomposition to propagate vowel status to accented variants.
## Results
Across 502 languages, the combined classifier:
- Correctly identifies all 5 basic vowels in **85%** of languages
- Corrects **20** SVD label-flip errors using Sukhotin's orientation
- Adds accent-propagated vowels in **140** languages
- Filters out Suxotin false positives in **270** languages
- Achieves perfect F1 on English, German, Spanish, Finnish, Portuguese, Czech, Polish, Indonesian, Latin, Estonian, and 10+ others
## Usage
```bash
pip install -r requirements.txt
python classify.py # recommended — writes classification_output.txt
python algorithm1.py # SVD only — writes algorithm1_output.txt
python suxotin.py # Sukhotin only — writes suxotin_algorithm_output.txt
```
Evaluate against ground truth (32 languages):
```bash
python -c "from classify import run; from evaluate import evaluate; evaluate(run)"
```
## How It Works
**SVD** builds a binary matrix of letters vs. trigram contexts (p-frames), then uses the second right singular vector to split letters into two clusters. The cluster with higher mean frequency is labeled vowels.
**Sukhotin's** builds a character adjacency matrix from within-word bigrams (excluding whitespace — critical for avoiding false positives). It iteratively selects the highest-sum character as a vowel and subtracts its adjacency contributions until no positive sums remain.
**Ensemble** runs both, uses Sukhotin to correct SVD when the two disagree on cluster orientation, then propagates vowel status to accented variants via Unicode NFKD decomposition.
## Repository Structure
```
classify.py Combined classifier (recommended)
algorithm1.py SVD spectral decomposition
suxotin.py Sukhotin's adjacency algorithm
evaluate.py Evaluation harness (32 languages with ground truth)
utils.py Shared I/O and preprocessing
lang_code.csv Language code → name mapping
Test/data/ Text corpus per language (502 files)
visualizations/ Per-language classification score charts
Literature/ Reference papers
```
## References
- Sukhotin, B.V. (1962). *Eksperimental'noe vydelenie klassov bukv s pomoshch'ju elektronnoj vychislitel'noj mashiny*.
- Thaine, P. & Penn, G. (2017). [Vowel and Consonant Classification through Spectral Decomposition](https://aclanthology.org/W17-4109/). Workshop on Subword and Character Level Models in NLP, pp. 82–91.
## License
GPL-3.0 — see [LICENSE](LICENSE).