{"id":44537482,"url":"https://github.com/ysdede/trnorm","last_synced_at":"2026-02-13T18:51:45.341Z","repository":{"id":280501264,"uuid":"941104979","full_name":"ysdede/trnorm","owner":"ysdede","description":"Turkish text normalization tools for ASR (Automatic Speech Recognition) benchmarking and evaluation.","archived":false,"fork":false,"pushed_at":"2025-04-07T16:59:33.000Z","size":253,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-07T17:42:59.071Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ysdede.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-01T13:56:43.000Z","updated_at":"2025-04-07T16:59:37.000Z","dependencies_parsed_at":"2025-03-03T19:49:35.582Z","dependency_job_id":null,"html_url":"https://github.com/ysdede/trnorm","commit_stats":null,"previous_names":["ysdede/trnorm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ysdede/trnorm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ysdede%2Ftrnorm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ysdede%2Ftrnorm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ysdede%2Ftrnorm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ysdede%2Ftrnorm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ysdede","download_url":"https://codeload.github.com/ysdede/trnorm/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ysdede%2Ftrnorm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29414285,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-13T06:24:03.484Z","status":"ssl_error","status_checked_at":"2026-02-13T06:23:12.830Z","response_time":78,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-13T18:51:44.789Z","updated_at":"2026-02-13T18:51:45.304Z","avatar_url":"https://github.com/ysdede.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TRNorm\n\nTurkish text normalization tools for ASR (Automatic Speech Recognition) benchmarking and evaluation.\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\n## Overview\n\nTRNorm is a specialized Python package designed for normalizing Turkish text in ASR evaluation contexts. It provides tools to standardize text representations of numbers, ordinals, and symbols to ensure fair comparison between ASR system outputs and reference transcriptions.\n\nThis package is specifically created for fairer ASR benchmarking, not as a comprehensive Turkish NLP solution. The primary goal is to normalize references and predictions to enable more accurate evaluation of ASR models and to mitigate errors from weakly labeled audio datasets.\n\n## Features\n\n- **Number to Text Conversion**: Convert numeric values to their Turkish text representation\n- **Ordinal Number Normalization**: Convert ordinal numbers to their Turkish text representation\n- **Roman Numeral Processing**: Convert Roman numerals to Arabic numbers and optionally normalize Roman ordinals in text\n- **Symbol Conversion**: Convert special symbols (like %, €, $) to their text representation\n- **Turkish Suffix Handling**: Add Turkish suffixes (ile, ise, iken) to words following vowel harmony rules\n- **Text Utilities**: Various text utility functions for Turkish language processing\n- **Metrics**: Text similarity metrics (WER, CER, Levenshtein distance)\n- **Legacy Normalizer**: Backward compatibility with previous normalizer implementation\n\n## Installation\n\n```bash\npip install trnorm\n```\n\n## Usage\n\n### Number to Text Conversion\n\n```python\nfrom trnorm import NumberToTextConverter, convert_numbers_to_words_wrapper\n\n# Convert a single number\nconverter = NumberToTextConverter()\nprint(converter.convert(\"42\"))  # \"kırk iki\"\n\n# Convert numbers in a text\ntext = \"Bugün 25 Nisan 2025 tarihinde 42 kişi katıldı.\"\nnormalized = convert_numbers_to_words_wrapper(text)\nprint(normalized)  # \"Bugün yirmi beş Nisan iki bin yirmi beş tarihinde kırk iki kişi katıldı.\"\n\n# Convert numbers with apostrophes\ntext = \"1960'lı yıllarda 100'lerce insan katıldı.\"\nnormalized = convert_numbers_to_words_wrapper(text)\nprint(normalized)  # \"bin dokuz yüz altmış'lı yıllarda yüz'lerce insan katıldı.\"\n\n# Convert numbers with divide symbols\ntext = \"7/24 hizmet veriyoruz ve işin 2/3'ü tamamlandı.\"\nnormalized = convert_numbers_to_words_wrapper(text)\nprint(normalized)  # \"yedi/yirmi dört hizmet veriyoruz ve işin iki/üç'ü tamamlandı.\"\n```\n\n### Ordinal Number Normalization\n\n```python\nfrom trnorm import normalize_ordinals\n\n# Arabic ordinals\ntext = \"1. sırada 2'nci kişi ve 3'üncü grup\"\nnormalized = normalize_ordinals(text)\nprint(normalized)  # \"birinci sırada ikinci kişi ve üçüncü grup\"\n\n# Roman ordinals (disabled by default)\ntext = \"XX. yüzyılda II. Dünya Savaşı yaşandı.\"\n# Default behavior - Roman ordinals are not converted\nnormalized = normalize_ordinals(text)\nprint(normalized)  # \"XX. yüzyılda II. Dünya Savaşı yaşandı.\"\n\n# Enable Roman ordinals conversion\nnormalized = normalize_ordinals(text, convert_roman_ordinals=True)\nprint(normalized)  # \"yirminci yüzyılda ikinci Dünya Savaşı yaşandı.\"\n```\n\n### Roman Numeral Processing\n\n```python\nfrom trnorm import roman_to_arabic, is_roman_numeral, find_roman_ordinals\n\n# Convert Roman numerals to Arabic numbers\nprint(roman_to_arabic(\"XIV\"))  # 14\nprint(roman_to_arabic(\"MCMXCIX\"))  # 1999\n\n# Check if a string is a valid Roman numeral\nprint(is_roman_numeral(\"XIV\"))  # True\nprint(is_roman_numeral(\"ABC\"))  # False\n\n# Find Roman ordinals in text\ntext = \"XX. yüzyılda II. Dünya Savaşı yaşandı.\"\nordinals = find_roman_ordinals(text)\nprint(ordinals)  # [('XX', 'yüzyılda', 0), ('II', 'Dünya', 12)]\n```\n\n### Symbol Conversion\n\n```python\nfrom trnorm import convert_symbols\n\n# Convert symbols to text\ntext = \"Ürün %20 indirimli ve fiyatı 50€.\"\nnormalized = convert_symbols(text)\nprint(normalized)  # \"Ürün yüzde yirmi indirimli ve fiyatı elli avro.\"\n```\n\n### Turkish Suffix Handling\n\n```python\nfrom trnorm import ekle\n\n# Add \"ile\" suffix (with)\nprint(ekle(\"Ankara\", \"ile\"))  # \"Ankarayla\"\nprint(ekle(\"İstanbul\", \"ile\"))  # \"İstanbulla\"\n\n# Add \"ise\" suffix (if)\nprint(ekle(\"Ankara\", \"ise\"))  # \"Ankaraysa\"\nprint(ekle(\"İstanbul\", \"ise\"))  # \"İstanbulsa\"\n\n# Add \"iken\" suffix (while/when)\nprint(ekle(\"çalışıyor\", \"iken\"))  # \"çalışıyorken\"\nprint(ekle(\"evde\", \"iken\"))  # \"evdeyken\"\n```\n\n### Text Utilities\n\n```python\nfrom trnorm import turkish_lower, turkish_upper, turkish_capitalize\n\nprint(turkish_lower(\"İSTANBUL\"))  # \"istanbul\"\nprint(turkish_upper(\"istanbul\"))  # \"İSTANBUL\"\nprint(turkish_capitalize(\"istanbul\"))  # \"İstanbul\"\n```\n\n### Metrics\n\n```python\nfrom trnorm import wer, cer, levenshtein_distance\n\nreference = \"bu bir test cümlesidir\"\nhypothesis = \"bu bir deneme cümlesi\"\n\nprint(wer(reference, hypothesis))  # Word Error Rate\nprint(cer(reference, hypothesis))  # Character Error Rate\nprint(levenshtein_distance(reference, hypothesis))  # Levenshtein Distance\n```\n\n### Legacy Normalizer\n\n```python\nfrom trnorm import normalize_text, replace_hatted_characters\n\n# Basic normalization\ntext = \"âîôû Çok iyi ve nazik biriydi. Prusya'daki ilk karşılaşmamızda onu konuşturmayı başarmıştım.\"\nnormalized = normalize_text(text)\nprint(normalized)  # \"aiou çok iyi ve nazik biriydi prusyadaki ilk karşılaşmamızda onu konuşturmayı başarmıştım\"\n\n# Only replace hatted characters\ntext_with_hats = \"âîôû Çok iyi\"\nprint(replace_hatted_characters(text_with_hats))  # \"aiou Çok iyi\"\n\n# Process a list of texts\ntexts = [\"Turner'ın 'Köle Gemisi' isimli tablosuna bakıyoruz.\", \"Turner'ın Köle Gemisi isimli tablosuna bakıyoruz.\"]\nnormalized_texts = normalize_text(texts)\n```\n\n## Examples\n\nThe package includes several example scripts in the `examples` directory:\n\n- `demo_ordinals.py`: Demonstrates ordinal number normalization\n- `demo_iken.py`: Shows Turkish suffix handling\n- `demo_text_utils.py`: Illustrates text utility functions\n- `demo_num_to_text.py`: Shows number to text conversion\n- `demo_metrics.py`: Demonstrates text similarity metrics\n- `demo_legacy_normalizer.py`: Shows legacy normalizer functionality\n\n## Development\n\n### Running Tests\n\n```bash\npython -m unittest discover\n```\n\n### Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nApache License 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fysdede%2Ftrnorm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fysdede%2Ftrnorm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fysdede%2Ftrnorm/lists"}