https://github.com/theskyc/l10n-expansion-data
Data-driven text expansion ratios for software localization (l10n) and UI planning.
https://github.com/theskyc/l10n-expansion-data
dataset expansion-ratio i18n l10n localization opus-100
Last synced: 2 months ago
JSON representation
Data-driven text expansion ratios for software localization (l10n) and UI planning.
- Host: GitHub
- URL: https://github.com/theskyc/l10n-expansion-data
- Owner: TheSkyC
- License: cc0-1.0
- Created: 2025-12-12T15:26:45.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2025-12-13T14:26:23.000Z (7 months ago)
- Last Synced: 2025-12-14T06:58:22.320Z (7 months ago)
- Topics: dataset, expansion-ratio, i18n, l10n, localization, opus-100
- Homepage:
- Size: 143 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# l10n-expansion-data



**Data-driven text expansion ratios for robust software localization and UI planning.**
Stop guessing how much space to leave for German or how much Vietnamese text will expand. This repository provides highly granular statistical expansion ratios derived from massive parallel corpora (currently **OPUS-100**), helping developers and designers prevent UI overflows before translation begins.
All data is generated using the [Expansion Rate Generator (ERG)](https://github.com/TheSkyC/expansion-ratio-generator) tool.
## π Why Use This Data?
When translating software, text length changes unpredictably.
* **English β Chinese:** Text usually shrinks (~0.6x).
* **English β German:** Text often expands (~1.3x).
* **Short Strings:** UI labels (e.g., "OK", "New") behave very differently from long paragraphs.
This dataset offers:
1. **Granularity:** Ratios are grouped by source string length (0-20 chars, 20-50 chars, etc.).
2. **Reliability:** Based on millions of sentence pairs, not just heuristics.
3. **Flexibility:** Provides multiple statistical metrics (mean, median, percentiles) to fit different risk tolerances.
## π Directory Structure
The data is organized by **Source Corpus** -> **Export Strategy** -> **File**.
```text
data/
βββ opus-100/ # Derived from the Helsinki-NLP OPUS-100 corpus
βββ detailed/ # Bucketed data with full statistics
β βββ opus-100_detailed_mean.json
β βββ opus-100_detailed_med.json
β βββ ... (and CSV/YAML variants)
β
βββ simple/ # Single global value per language pair
βββ opus-100_simple_wgt_mean.json
βββ opus-100_simple_wgt_p75.json
βββ ... (and CSV/YAML variants)
```
## π Data Formats & Strategies
We provide two main export strategies: **Simple** and **Detailed**.
### 1. Simple Strategy (`/simple`)
Provides a **single value** per language pair. Ideal for quick estimates or runtime checks.
* **File Naming:** `corpus_simple__.json`
* ``: `wgt` (weighted average), `max` (worst-case), `bkt0-20` (specific bucket).
* ``: `mean`, `med` (median), `p25`, `p75`, `rng25-75` (range).
**Example (`opus-100_simple_bkt0-20_mean.json`):**
```json
{
"en-sh": 1.1321,
"en-se": 1.9319,
"en-ro": 1.1275,
"en-zh": 0.5633
}
```
**Example (`opus-100_simple_wgt_rng25-75.json`):**
```json
{
"en-sh": "0.88-1.09",
"en-se": "0.97-1.98",
"en-ro": "0.83-1.15",
"en-zh": "0.26-0.40"
}
```
### 2. Detailed Strategy (`/detailed`)
Provides **bucketed data** with rich statistics. Ideal for dynamic UI layout engines or in-depth analysis.
* **File Naming:** `corpus_detailed_.json`
* ``: The primary metric used for the `val` key (e.g., `mean`, `med`).
**Example (`opus-100_detailed_mean.json`):**
```json
{
"en-zh": {
"0.0-20.0": {
"val": 0,
"count": 176354,
"std": 0.5426569950687576,
"min": 0.1,
"max": 10.0,
"mean": 0.563295776173526,
"median": 0.42857142857142855,
"p25": 0.3333333333333333,
"p75": 0.6
},
"20.0-50.0": {
"val": 0,
"count": 286113,
"std": 0.2895320290537651,
"min": 0.1,
"max": 9.954545454545455,
"mean": 0.38877682122354107,
"median": 0.32558139534883723,
"p25": 0.25925925925925924,
"p75": 0.41379310344827586
}
}
}
```
## π€ Which File Should I Use?
With so many files, hereβs a quick guide:
* **For general UI development (buttons, labels):**
* Use `simple/opus-100_simple_bkt0-20.json`. This gives you a based on short strings (0-20 characters), which is the most common scenario for UI text.
* **For a single, balanced ratio for your entire app:**
* Use `simple/opus-100_simple_wgt_mean.json`. This provides a weighted average across all text lengths.
* **For dynamic layout engines that adapt to text length:**
* Use `detailed/opus-100_detailed_mean.json`. This allows you to look up the appropriate ratio based on the source text's character count.
* **For data analysis or academic research:**
* Use the `.csv` files in the `detailed/` directory, which can be easily imported into Excel or Pandas.
## π Usage Example
### Python
```python
import json
# For simple, quick checks, use the 'simple' weighted mean data.
with open('data/opus-100/simple/opus-100_simple_wgt_mean.json', 'r', encoding='utf-8') as f:
ratios = json.load(f)
def get_estimated_length(text, target_lang_code):
"""Get a simple estimated length."""
ratio = ratios.get(f"en-{target_lang_code}", 1.0) # Fallback to 1.0
return len(text) * ratio
print(f"Estimated length for 'Save File' in German: {get_estimated_length('Save File', 'de'):.2f}")
```
## π¬ Methodology
1. **Source:** We utilize the `train` split from the [OPUS-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) corpus, which contains 1 million sentence pairs per language.
2. **Filtering:**
* Pairs with extreme expansion ratios (<0.1 or >10.0) are discarded as likely alignment errors.
* Source strings with zero length are excluded.
3. **Calculation:**
* `Ratio = CharacterLength(Target) / CharacterLength(Source)`
* All calculations are based on character counts.
## βοΈ License & Attribution
### Dataset License
The statistical data in this repository is released under the **CC0 1.0 Universal (Public Domain Dedication)**. You are free to use, modify, and distribute it in any commercial or open-source software without restriction.
### Disclaimer
This repository contains **statistical metadata only**. It does **not** contain any original sentence pairs or text content from the source corpora.
### Source Attribution
This data is derived from the **OPUS-100** corpus. If you use this data in academic work, please cite the original paper:
> **OPUS-100**
> *Zhang, Biao et al. "Improving Massively Multilingual Neural Machine Translation in Zero-Shot Scenarios." ACL (2020).*
```bibtex
@inproceedings{zhang-etal-2020-improving,
title = "Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation",
author = "Zhang, Biao and
Williams, Philip and
Titov, Ivan and
Sennrich, Rico",
editor = "Jurafsky, Dan and
Chai, Joyce and
Schluter, Natalie and
Tetreault, Joel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.148",
doi = "10.18653/v1/2020.acl-main.148",
pages = "1628--1639",
}
```