https://github.com/thunderpoot/isogloss
ISO 639 and IETF Language Code Lookup Tool
https://github.com/thunderpoot/isogloss
bcp47 command-line command-line-tool ietf-language-tag ietf-language-tags iso-3166-1 iso639 iso639-1 iso639-2 iso639-3 language-classification languages locales localization python shell-script
Last synced: about 2 months ago
JSON representation
ISO 639 and IETF Language Code Lookup Tool
- Host: GitHub
- URL: https://github.com/thunderpoot/isogloss
- Owner: thunderpoot
- License: mit
- Created: 2024-01-31T15:35:48.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-07T14:47:56.000Z (10 months ago)
- Last Synced: 2024-07-07T15:31:25.784Z (10 months ago)
- Topics: bcp47, command-line, command-line-tool, ietf-language-tag, ietf-language-tags, iso-3166-1, iso639, iso639-1, iso639-2, iso639-3, language-classification, languages, locales, localization, python, shell-script
- Language: Python
- Homepage: https://thunderpoot.github.io/isogloss/
- Size: 1.58 MB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

# 🌐 iso·gloss

### ISO 639 and IETF Language Code Lookup Tool
`isogloss` is a Python–based command–line tool designed for looking up language details based on [ISO 639](https://www.iso.org/iso-639-language-code) codes and IETF ([BCP-47](https://www.rfc-editor.org/info/bcp47)) language tags. It provides comprehensive information about languages, including their names, native names, and additional details associated with each code or tag.
There is also a [web–based version here](https://thunderpoot.github.io/isogloss). The [BCP47 parser](https://thunderpoot.github.io/isogloss/bcp-index.html) has some known issues, documented below in the "Errata" section.
Elsewhere, [the word isogloss](https://en.wikipedia.org/wiki/Isogloss) means a boundary line on a map denoting the regional use of a particular linguistic characteristic, but in this case it just seemed to fit.
## Features
- Lookup language details using ISO 639-1, 639-2/B, 639-2/T, or 639-3 codes.
- Lookup language details by language name.
- Lookup language details using IETF BCP-47 language tags
- Examples: `en-GB`, `en-US`, `sv-SE`, `zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1`, and so on.## Installation
Clone the repository to your local machine:
```
git clone https://github.com/thunderpoot/isogloss.git
```Create a virtual environment and install requirements
```
python3.11 -m venv venv
source venv/bin/activate
pip install unidecode
```## Usage
The script can be run directly from the command line. Below are some examples of how to use it:
To look up information by ISO 639 code:
```
$ isogloss/isogloss.py -c swe
{
"639-1": "sv",
"Scope": "Individual",
"Type": "Living",
"Native name(s)": "svenska",
"Other name(s)": "",
"639-2/T": "swe",
"639-2/B": "",
"639-3": "swe",
"Name(s)": "Swedish"
}
```To look up information by language name:
```
$ isogloss/isogloss.py -n "egyptian arabic"
{
"Egyptian Arabic": "arz"
}
```Example of lookup via native name:
```
$ isogloss/isogloss.py -n 日本語
{
"\u65e5\u672c\u8a9e Nihongo": "jpn"
}
```Example of multiple results being found:
```
$ isogloss/isogloss.py -n norwegian
{
"Norwegian Nynorsk": "nno",
"Nynorsk, Norwegian": "nno",
"Bokm\u00e5l, Norwegian": "nob",
"Norwegian Bokm\u00e5l": "nob",
"Norwegian": "nor",
"Norwegian Sign Language": "nsl",
"Traveller Norwegian": "rmg"
}
```Language names are normalised, allowing for case–insensitive and accent–insensitive matching when searching:
```
$ isogloss/isogloss.py -n espanol
{
"Judeo-espa\u00f1ol": "lad",
"espa\u00f1ol": "spa"
}
```To look up information by IETF language tag:
```
$ isogloss/isogloss.py -i fr-FR
{
"Language": {
"639-1": "fr",
"Scope": "Individual",
"Type": "Living",
"Native name(s)": "fran\u00e7ais",
"Other name(s)": "",
"639-2/T": "fra",
"639-2/B": "fre",
"639-3": "fra",
"Name(s)": "French"
},
"Region": "France"
}
``````
$ isogloss/isogloss.py -i zh-cmn-Hans-CN-pinyin-ud1-p9t4-x-private1
{
"Primary Language": {
"639-1": "zh",
"639-2/B": "chi",
"639-2/T": "zho",
"639-3": "zho",
"Deprecated": false,
"Name(s)": "Chinese",
"Native name(s)": "\u4e2d\u6587 Zh\u014dngw\u00e9n; \u6c49\u8bed; \u6f22\u8a9e H\u00e0ny\u01d4",
"Other name(s)": "",
"Scope": "Macrolanguage",
"Type": "Living"
},
"Extended Languages": [
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "cmn",
"Deprecated": false,
"Name(s)": "Mandarin Chinese",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
}
],
"Script": "Han (Simplified variant)",
"Region": "China",
"Variant": "pinyin",
"Extension": "ud1-p9t4",
"Private Use": "x-private1"
}
``````
$ isogloss/isogloss.py -i ar-ajp-apc-apd-Arab-CV-arevela-g-231243-r-sdarre-x-private-x-private1 | jq
{
"Primary Language": {
"639-1": "ar",
"639-2/B": "",
"639-2/T": "ara",
"639-3": "ara",
"Deprecated": false,
"Name(s)": "Arabic",
"Native name(s)": "العربية; al'Arabiyyeẗ",
"Other name(s)": "",
"Scope": "Macrolanguage",
"Type": "Living"
},
"Extended Languages": [
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"Deprecated": true,
"Language Name(s)": "South Levantine Arabic",
"Language Type": "Living",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual"
},
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "apc",
"Deprecated": false,
"Name(s)": "Levantine Arabic",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
},
{
"639-1": "",
"639-2/B": "",
"639-2/T": "",
"639-3": "apd",
"Deprecated": false,
"Name(s)": "Sudanese Arabic",
"Native name(s)": "",
"Other name(s)": "",
"Scope": "Individual",
"Type": "Living"
}
],
"Script": "Arabic",
"Region": "Cabo Verde",
"Variant": "arevela",
"Extension": "g-231243-r-sdarre",
"Private Use": "x-private-x-private1"
}
```## Files
- `data/consolidated_langs.json`: Contains language data in JSON format used for the lookup.
- `data/region_names.json`: Contains region data in JSON format used for the BCP47 lookup.
- `data/script_codes.json`: Contains script code data in JSON format used for the BCP47 lookup.
- `data/deprecated-639-3.csv`: Contains deprecated ISO 639-3 codes in CSV format, for quick reference.## Errata
There are known issues with the BCP47 parser in the web interface. It uses regular expressions to validate input, such that:
### Examples of valid tags:
- `en`
- `fr-CA`
- `i-klingon`
- `az-Arab-IR`
- `sr-Cyrl-RS`
- `zh-cmn-Hans`
- `ja-JP-x-tokyo`
- `uz-Cyrl-UZ-1992`
- `bo-Tibt-x-dialect`
- `zh-cmn-Hans-CN-x-private1`
- `hy-Latn-IT-arevela-x-test`
### Examples of invalid tags (malformed):
- `en-GB-oed-x-private`
- `de-CH-1901-co-phonebk-sc-gothic-x-bavaria`
(and more)
### Examples of inputs that reveal parsing bugs:
- `ca-valencia-nedis`
(Highlighted input section is missing "valencia")- `en-US-u-islamcal`
(Variant "u" and Extension "islamcal", Extension section says "u - islamcal")- `es-419-fonipa`
(Extended languages blank)- `de-Latf-1901`
(Region undefined)- `sl-rozaj`
(rozaj is coloured differently in the result container to how it is in the highlighted input section)## Contributing
Contributions, issues, and feature requests are welcome!
## Author
Written by T E Vaughan
## Sponsorship
[](https://github.com/sponsors/thunderpoot)
If you find this project useful, please consider sponsoring my work. <3
## Related Standards and RFCs
The codes used in this program conform to the following ISO standards:
### Standards
- [ISO 639](https://www.iso.org/iso-639-language-code) Language codes
- [ISO 3166-1 alpha-2](https://www.iso.org/iso-3166-country-codes.html) Country codes
- [ISO 15924](https://www.unicode.org/iso15924/) Script codes### RFCs
- [RFC 1766](https://www.ietf.org/rfc/rfc1766.txt) Tags for the Identification of Languages
- [RFC 4646](https://www.ietf.org/rfc/rfc4646.txt) Tags for Identifying Languages
- [RFC 4647](https://www.ietf.org/rfc/rfc4647.txt) Matching of Language Tags## License
This project is [MIT licensed](https://opensource.org/licenses/MIT).