Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stypox/dicio-numbers

Dicio's multilingual number parsing and formatting library, inspired by Mycroft's lingua-franca
https://github.com/stypox/dicio-numbers

date dicio dicio-assistant i18n multilanguage number-formatter number-parser numbers

Last synced: 9 days ago
JSON representation

Dicio's multilingual number parsing and formatting library, inspired by Mycroft's lingua-franca

Awesome Lists containing this project

README

        

# Number parser and formatter for Dicio assistant
This *multilingual* library implements methods to **extract** numbers, dates or durations from text and **format** numbers, dates or dutations into human-readable words. It is inspired by [Mycroft's lingua-franca](https://github.com/MycroftAI/lingua-franca), from which it borrows some resource files. Currently only Italian (it-it) and English (en-us) are supported, and methods to extract dates or durations are still unavailable (though formatting is).

This repository is part of the **Dicio** project. Also check out [`dicio-android`](https://github.com/Stypox/dicio-android), [`dicio-sentences-compiler`](https://github.com/Stypox/dicio-sentences-compiler/) and [`dicio-skill`](https://github.com/Stypox/dicio-skill/). *Open to contributions :-D*

## Adding a language

You will need to translate some resource files, containing words but also regex, and then adapt some Java code, so be prepared for that.

First of all you need to obtain the `language-country` pair for the language you want to add. This is important so that who is using the library can choose the language to use. For example, English is `en-us` and Italian is `it-it`. Let's call this `LANGUAGE_COUNTRY` from now on.

### Resources

Copy the whole folder `numbers/src/main/resources/config/en-us` into a new folder `numbers/src/main/resources/config/LANGUAGE_COUNTRY`. All of the resource files in the new folder should be translated into the new language. **DO NOT** rename any file, just edit their contents.

### `*.word` files

The files named like `ENGLISH_WORD.word` should contain only one line with the lowercase translation of the word `ENGLISH_WORD` in the new language. For example, the English `second.word` contains "second", the Italian one contains "secondo" and the new language one should contain the translation of "second" into that language. These files are **also present in Mycroft's [lingua-franca](https://github.com/MycroftAI/lingua-franca/blob/master/lingua_franca/res/text/), so copy them from there** to save time! Then just check if everything is fine.

### `date_time.json` file

This file contains the data needed to properly *format* dates, and in part also times, though that's also handled elsewhere. This file is **also present in Mycroft's [lingua-franca](https://github.com/MycroftAI/lingua-franca/blob/master/lingua_franca/res/text/), so copy it from there** to save time! Then just check if everything is fine.

Each of the formats provided in this file can (and shall) contain references to other already-formatted strings to be substituted, in the form `{FORMATTED_STRING_NAME}`. For example, English uses `"{formatted_date} at {formatted_time}"` as the way to format together both the date and the time.

#### `"weekday"`, `"date"`, `"month"`, `"number"`

These JSON objects should contain:
- `"weekday"`: a numbered list of how the 7 days in the week are pronounced, where monday is at position **0**
- `"date"`: a numbered list of how the at-most-31 days in the month are pronounced, where the first day is at position **1**
- `"month"`: a numbered list of how the 12 months are pronounced, where january is at position **1**
- `"number"`: a list of `"NUMBER": "PRONOUNCIATION"` pairs. It should contain the pronounciation of all numbers that you want to use in [`"decade_format"`, `"hundreds_format"` and `"thousand_format"`](#decade_format-hundreds_format-thousand_format).

#### `"decade_format"`, `"hundreds_format"`, `"thousand_format"`

These JSON objects should contain a numbered list (starting from 1, but it doesn't matter) of `(regex, format)` pairs, along with a `"default"` format for when none of the regexes match. Each `(regex, format)` pair should be a JSON object:
- `"match"` contains the JSON-escaped pattern to match, starting with `^` and ending with `$`. The pattern will be matched against the part of the year corresponding to `"decade_format"` (only last two digits), `"hundreds_format"` (last three digits) **or** `"thousand_format"` (last four digits).
- `"format"` (and the `"default"` mentioned before) contains the format to apply. Refer to the table below for which substitutions you can do.

In these JSON object you should put partial year formatting results that will be finally used by `"year_format"`. It doesn't matter if for some language, instead of just formatting e.g. the thousands digit in `"thousand_format"`, you also sometimes format the hundreds digit (this is what happens for English). The important thing is that, at the end, `"year_format"` spits out correct results.

This is the table of possible `FORMATTED_STRING_NAME`s you can use when having to do with `"decade_format"`, `"hundreds_format"` or `"thousand_format"`. Check out [NiceYearSubstitutionTableBuilder.java](./numbers/src/main/java/org/dicio/numbers/formatter/datetime/NiceYearSubstitutionTableBuilder.java). The examples are in English and relative to the years "2019" and "3865 b.c.".


FORMATTED_STRING_NAME
Explanation
e.g. "2019"
e.g. "-3865"


x
the last digit of the year
"nine"
"five"


xx
the last two digits of the year, if there is a corresponding entry in the "number" list
"nineteen"
"" (65 is not special and thus has no entry in the "number" list)


x0
the tens of the year
"ten"
"sixty"


x_in_x0
the tens digit of the year
"one"
"six"


xxx
the last three digits of the year, if there is a corresponding entry in the "number" list
"nineteen" (there is no hundreds digit)
"" (865 is not special and thus has no entry in the "number" list)


x00
the hundreds of the year, if there is a corresponding entry in the "number" list
"zero" (probably doesn't make much sense)
"" (English pronounces hundreds by putting "hundred" after the unit, so there is no entry for 800 in the "number" list)


x_in_x00 and x_in_0x00
the hundreds digit of the year
"zero"
"eight"


xx00
the year with its tens and its units digits set to 0, if there is a corresponding entry in the "number" list
"" (2000 is not special and thus has no entry in the "number" list)
"" (3800 is not special and thus has no entry in the "number" list)


xx_in_xx00
the year divided by 100, if there is a corresponding entry in the "number" list
"twenty"
"" (38 is not special and thus has no entry in the "number" list)


x000
the thousands of the year, if there is a corresponding entry in the "number" list
"" (English pronounces thousands by putting "thousand" after the unit, so there is no entry for 2000 in the "number" list)
"" (same reason, for 3000)


x_in_x000
the thousands digit of the year
"two"
"three"


x0_in_x000
the thousands digit of the year, multiplied by 10
"twenty"
"thirty"


number
the non-formatted part of the year corresponding to "decade_format" (only last two digits), "hundreds_format" (last three digits) or "thousand_format" (last four digits), to be used for "default" as a fallback
"19", "19" or "2019"
"65", "865" or "3865"

#### `"year_format"`

This JSON object follows the same structure as [`"decade_format"`, `"hundreds_format"`, `"thousand_format"`](#decade_format-hundreds_format-thousand_format), but there is also a `"bc"` field that should contain the translation of the shortened "Before Christ" ("b.c."). In this JSON object you should put how to fully format a number as a year, using the formatted strings already calculated using `"decade_format"`, `"hundreds_format"` and `"thousand_format"`.

The formats have at their disposal the full table from above plus the following items (which are the ones that should actually be used).


FORMATTED_STRING_NAME
Explanation
e.g. "2019"
e.g. "-3865"


formatted_decade
the decade formatted using "decade_format"
"nineteen"
"sixty five"


formatted_hundreds
the hundreds formatted using "hundreds_format"
"zero hundred" (yeah, it doesn't make sense)
"eight hundred"


formatted_thousand
the thousands formatted using "thousand_format"
"twenty"
"thirty eight"


bc
the translation of "b.c." if the year is before Christ, otherwise an empty string
""
"b.c."


number
the non-formatted full-year, to be used for "default" as a fallback
"2019"
"3865"

#### `"date_format"`

This JSON object should contain a format in these fields: `"date_full"`, `"date_full_no_year"`, `"date_full_no_year_month"`; and a translation of the field name in these fields: `"today"`, `"tomorrow"`, `"yesterday"`.

The formats have at their disposal this limited table.


FORMATTED_STRING_NAME
Explanation
e.g. "Tuesday, 2022/05/03"


day
the name of the day in the month
"third"


weekday
the name of the day in the week
"tuesday"


month
the name of the month
"may"


formatted_year
the year fully formatted using "year_format"
"twenty twenty two"

#### `"date_time_format"`

This JSON object should contain a format in this only field: `"date_time"`.

The format has at its disposal this limited table.


FORMATTED_STRING_NAME
Explanation
e.g. "Tuesday, 2022/05/03 13:22"


formatted_date
the date fully formatted using "date_format"
"tuesday, may second, twenty twenty two"


formatted_time
the time formatted using the java method niceTime
"one twenty two p.m."

### `tokenizer.json`

This JSON file contains the information the tokenizer uses to generate the token stream corresponding to an input string.

- `"spaces"`: a string containing all characters that should be considered spaces. This usually does not need to be *translated*.
- `"characters_as_word"`: a string containing all characters that should be considered as single-character words, and not as letters possibly part of a bigger word. An example for this is `%`, since the percent sign has a meaning on its own.
- `"compound_word_piece_category"`: the category that you will give in `"number_mappings"` to words that can be part of a compound word. A compound word is a word made of other words all connected together, and the tokenizer will split such big words using only words that can actually be part of compound words. While the value of this could be arbitrary, given that you then use the same value in `"number_mappings"`, there is no reason to set this to something different than "compound_word_piece", which is what is used in Italian. If the language you want to add has no compound words, like English, just don't add this field.
- `"raw_number_categories"`: an array of all the categories that the tokenizer should give to raw numbers when it encounters them. A raw number is e.g. 2384, -392, ...
- `"plural_endings"`: an array of the endings that the tokenizer should try to trim off a word if it can't recognize it otherwise. For example, in English "tens" would match with "ten" because there is "s" in the `"plural_endings"` array.
- `"word_matches"`: an array of JSON objects of this form:
- `"categories"`: an array of the categories that should be assigned to the words that match any of the values
- `"values"`: an array of the words that should be assigned all of the categories
- `"number_mappings"`: an array of JSON objects of this form:
- `"categories"`: an array of the categories that should be assigned to the words that match any of the values
- `"values"`: an JSON object with the words that should be assigned all of the categories paired with their corresponding (integer or decimal) numerical value
- `"duration_words"`: a JSON object used to pair words with their corresponding duration. The keys in the object should be the durations (formatted as `"number UNIT"`, where `number` is an integer and `UNIT` (uppercase) is one of NANOS, MICROS, MILLIS, SECONDS, MINUTES, HOURS, HALF_DAYS, DAYS, WEEKS, MONTHS, YEARS, DECADES, CENTURIES, MILLENNIA, ERAS), and the values are JSON arrays of words.
- `"duration_restrict_after_number"`: a list of all of the words present in `"duration_words"` that should not be recognized as a duration if they don't come with a number before them in the input. For example, "hello ms" should **not** be interpreted as "hello (1 millisecond)", while "hello millisecond" and "hello 1 ms" should.

### Test resources

Copy the whole folder `numbers/src/test/resources/config/en-us` into a new folder `numbers/src/test/resources/config/LANGUAGE_COUNTRY`. All of the resource files in the new folder are used for testing purposes and should be translated into the new language. **DO NOT** rename any file, just edit their contents.

### `date_time_test.json`

This file contains some tests for the `date_time.json` file. This file is **also present in Mycroft's [lingua-franca](https://github.com/MycroftAI/lingua-franca/blob/master/lingua_franca/res/text/), so copy it from there** to save time! Then just check if everything is fine.

Each of the JSON objects described below contains a numbered list of tests to run.

#### `"test_nice_year"`

These tests are for [`"year_format"`](#year_format). Each test has:
- `"datetime_param"` the datetime in this form: year, month, day, hour, minute, second
- `"bc"` whether the datetime is before Christ (`True`) or not (`False`)
- `"assertEqual"` what the provided datetime should be formatted as

#### `"test_nice_date"`

These tests are for [`"date_format"`](#date_format). Each test has:
- `"datetime_param"` the datetime in this form: year, month, day, hour, minute, second
- `"now"` the datetime corresponding to the "now" time for which to test relative dates (e.g. if `datetime_param=now`, then `today` will be outputted, not the full date), in the same form, or `None` if "now" is unknown or unwanted
- `"assertEqual"` what the provided datetime should be formatted as

#### `"test_nice_date_time"`

These tests are for [`"date_time_format"`](#date_time_format). Each test has:
- `"datetime_param"` the datetime in this form: year, month, day, hour, minute, second
- `"now"` the datetime corresponding to the "now" time for which to test relative dates (e.g. if `datetime_param=now`, then `today` will be outputted, not the full date), in the same form, or `None` if "now" is unknown or unwanted
- `"use_24hour"` whether to use the 24-hour format (`True`) or use the 12-hour one (`False`) (parameter passed to the java method `niceTime`)
- `"use_ampm"` whether to show AM/PM or not (parameter passed to the java method `niceTime`)
- `"assertEqual"` what the provided datetime should be formatted as