{"id":37078426,"url":"https://github.com/pgolo/sic","last_synced_at":"2026-01-14T09:10:47.663Z","repository":{"id":47015584,"uuid":"297178277","full_name":"pgolo/sic","owner":"pgolo","description":"Utility for string normalization","archived":false,"fork":false,"pushed_at":"2022-10-20T00:59:34.000Z","size":9767,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-20T13:55:24.682Z","etag":null,"topics":["natural-language-processing","nlp","rule-based-nlp","string-normalization","text-normalization","tokenization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pgolo.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-20T22:57:29.000Z","updated_at":"2022-12-04T05:55:23.000Z","dependencies_parsed_at":"2022-09-16T13:46:19.768Z","dependency_job_id":null,"html_url":"https://github.com/pgolo/sic","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/pgolo/sic","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fsic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fsic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fsic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fsic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pgolo","download_url":"https://codeload.github.com/pgolo/sic/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fsic/sbom","scorecard":{"id":730176,"data":{"date":"2025-08-11","repo":{"name":"github.com/pgolo/sic","commit":"de5f660dec40eb349e59dc1382f41028a8a34936"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.1,"checks":[{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/11 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Binary-Artifacts","score":6,"reason":"binaries present in source code","details":["Warn: binary detected: dist/sic-1.3.3-cp36-cp36m-win_amd64.whl:1","Warn: binary detected: dist/sic-1.3.3-cp37-cp37m-win_amd64.whl:1","Warn: binary detected: dist/sic-1.3.3-cp38-cp38-win_amd64.whl:1","Warn: binary detected: dist/sic-1.3.3-cp39-cp39-win_amd64.whl:1"],"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/sic-test-linux.yml:1","Warn: no topLevel permission defined: .github/workflows/sic-test-windows.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/sic-test-linux.yml:15: update your workflow using https://app.stepsecurity.io/secureworkflow/pgolo/sic/sic-test-linux.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/sic-test-linux.yml:17: update your workflow using https://app.stepsecurity.io/secureworkflow/pgolo/sic/sic-test-linux.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/sic-test-windows.yml:15: update your workflow using https://app.stepsecurity.io/secureworkflow/pgolo/sic/sic-test-windows.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/sic-test-windows.yml:17: update your workflow using https://app.stepsecurity.io/secureworkflow/pgolo/sic/sic-test-windows.yml/master?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/sic-test-linux.yml:22","Warn: pipCommand not pinned by hash: .github/workflows/sic-test-linux.yml:23","Info:   0 out of   4 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   2 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 30 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-22T14:11:13.475Z","repository_id":47015584,"created_at":"2025-08-22T14:11:13.475Z","updated_at":"2025-08-22T14:11:13.475Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28414830,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","rule-based-nlp","string-normalization","text-normalization","tokenization"],"created_at":"2026-01-14T09:10:46.825Z","updated_at":"2026-01-14T09:10:47.658Z","avatar_url":"https://github.com/pgolo.png","language":"Python","readme":"# sic\n\n[![pypi][pypi-img]][pypi-url]\n\n[pypi-img]: https://img.shields.io/pypi/v/sic?style=plastic\n[pypi-url]: https://pypi.org/project/sic/\n\n###### _(Latin)_ so, thus, such, in such a way, in this way\n###### _(English)_ spelling is correct\n\n`sic` is a module for string normalization. Given a string, it separates\nsequences of alphabetical, numeric, and punctuation characters, as well\nas performs more complex transformation (i.e. separates or replaces specific\nwords or individual symbols). It comes with set of default normalization rules\nto transliterate and tokenize greek letters and replace accented characters\nwith their base form. It also allows using custom normalization configurations.\n\nBasic usage:\n\n```python\nimport sic\n\nbuilder = sic.Builder()\nmachine = builder.build_normalizer()\nx = machine.normalize('abc123xyzalphabetagammag')\nprint(x)\n```\n\nThe output will be:\n\n```bash\nabc 123 xyz alpha beta gamma g\n```\n\n## Installation\n\n- `sic` is designed to work in Python 3 environment.\n- `sic` only needs Python Standard Library (no other packages).\n\nTo get wheel for Windows (Python \u003e= 3.6) or source code package for Linux:\n\n```bash\npip install sic\n```\n\nTo get source code package regardless the OS:\n\n```bash\npip install sic --no-binary sic\n```\n\nWheels and .tar.gz can also be downloaded from the project's repository.\n\nWheels contain binaries compiled from cythonized code. Source code package is\npure Python. Cythonized version performs better on short strings, while\nnon-cythonized version performs better on long strings, so one may be preferred\nover another depending on usage scenario. The benchmark is below.\n\n| STRING LENGTH | REPEATS | VERSION | MEAN TIME (s) |\n|:-------------:|:-------:|:-------:|:-------------:|\n| 71            | 10000   | .tar.gz | 1.8           |\n| 71            | 10000   | wheel   | 0.5           |\n| 710000        | 1       | .tar.gz | 2.7           |\n| 710000        | 1       | wheel   | 15.9          |\n|||||||||||||||||||||||||||||||||||||||||||||||||||||\n\n## Tokenization configs\n\n`sic` implements tokenization, i.e. it splits a given string into tokens and\nprocesses those tokens according to the rules specified in a configuration\nfile. Basic tokenization includes separating groups of alphabetical, numerical,\nand punctuation characters within a string, thus turning them into separate\nwords (for future reference, we'll call such words `tokens`). For instance,\n`abc-123` will be transformed into `abc - 123`, having tokens `abc`, `-`, and\n`123`.\n\nWhat happens next to initially tokenized string must be defined using XML in\nconfiguration file(s). Entry point to default tokenizer applied to a string is\n`sic/tokenizer.standard.xml`.\n\nBelow is the template and description for tokenizer config.\n\n```xml\n\u003c!-- tokenizer.config.xml --\u003e\n\u003c!--\n  This is the description of config file for tokenizer.\n  General structure:\n  \u003ctokenizer\u003e\n  +-\u003cimport\u003e\n  +-...\n  +-\u003cimport\u003e\n  +-\u003csetting\u003e\n  +-...\n  +-\u003csetting\u003e\n  +-\u003csplit\u003e\n  +-...\n  +-\u003csplit\u003e\n  +-\u003ctoken\u003e\n  +-...\n  +-\u003ctoken\u003e\n  +-\u003ccharacter\u003e\n  +-...\n  +-\u003ccharacter\u003e\n--\u003e\n\n\u003c?xml version=\"1.0\" encoding=\"UTF-8\"?\u003e\n\n\u003c!-- There must be single root element, and it must be \u003ctokenizer\u003e: --\u003e\n\u003ctokenizer name=\"$name\"\u003e\n\u003c!-- $name: string label for this tokenizer --\u003e\n  \n  \u003c!--\n    Direct children of \u003ctokenizer\u003e are \u003cimport\u003e, \u003csetting\u003e, \u003csplit\u003e,\n      \u003ctoken\u003e, and/or \u003ccharacter\u003e elements (there can be zero to many\n      declarations of any of those)\n  --\u003e\n\n  \u003c!-- \u003cimport\u003e elements point at other tokenizer configs to merge with --\u003e\n  \u003cimport file=\"$file\" /\u003e\n  \u003c!-- $file: path to file with another tokenizer config --\u003e\n\n  \u003c!-- \u003csetting\u003e elements define high-level tokenizer settings --\u003e\n  \u003csetting name=\"$name\" value=\"$value\" /\u003e\n  \u003c!--\n    Names and value requirements for /tokenizer/setting elements:\n    $name=\"cs\": $value=\"0\"|\"1\" (if \"1\", this tokenizer will be case-sensitive)\n    $name=\"bypass\": $value=\"0\"|\"1\" (if \"1\", this tokenizer will do nothing,\n      regardless the rest content of this file)\n  --\u003e\n\n  \u003c!--\n    \u003csplit\u003e elements define substrings that should be separated from text as\n      tokens\n  --\u003e\n  \u003csplit where=\"$where\" value=\"$value\" /\u003e\n  \u003c!--\n    $where=\"l\"|\"r\"|\"m\" (\"l\" for left, \"r\" for right, \"m\" for middle)\n    $value: string that will be handled as token when it's either in the\n      beginning of word ($where=\"l\"), at the end of word ($where=\"r\"), or in\n      the middle ($where=\"m\")\n  --\u003e\n\n  \u003c!--\n    \u003ctoken\u003e elements define tokens that should be replaced with other tokens\n      (or with nothing =\u003e removed)\n  --\u003e\n  \u003ctoken to=\"$to\" from=\"$from\" /\u003e\n  \u003c!--\n    $to: string that should replace the token specified in $from\n    $from: string that is the token to be replaced by another string specified\n      in $to\n  --\u003e\n\n  \u003c!--\n    \u003ccharacter\u003e elements define single characters that should be replaced with\n      other single characters\n  --\u003e\n  \u003ccharacter to=\"$to\" from=\"$from\" /\u003e\n  \u003c!--\n    $to: character that should replace the another character specified in $from\n    $from: character that should be replaced by another character specified in\n      $to\n  --\u003e\n\n\u003c/tokenizer\u003e\n```\n\nBelow are descriptions and examples of tokenizer config elements.\n\n|    ELEMENT    |            ATTRIBUTES             |                                                              DESCRIPTION                                                              |                               EXAMPLE                                |\n|:-------------:|:---------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------:|\n| `\u003cimport\u003e`    | file=\"path/to/another/config.xml\" | Import tokenization rules from another tokenizer config.                                                                              |                                                                      |\n| `\u003csetting\u003e`   | name=\"bypass\" value=\"?\"           | If present and value=\"1\", all tokenization rules are ignored, as if there was no tokenization at all (left for debug purposes).       |                                                                      |\n| `\u003csetting\u003e`   | name=\"cs\" value=\"?\"               | If value=\"1\", string is processed case-sensitively; if value=\"0\" - case-insensitively; if not present, tokenizer is case-insensitive. |                                                                      |\n| `\u003csplit\u003e`     | where=\"l\" value=\"?\"               | Separates token specified in `value` from **left** part of a bigger token.                                                            | where=\"l\" value=\"kappa\": `nf kappab` --\u003e `nf kappa b`                |\n| `\u003csplit\u003e`     | where=\"m\" value=\"?\"               | Separates token specified in `value` when it is found in the **middle** of a bigger token.                                            | where=\"m\" value=\"kappa\": `nfkappab` --\u003e `nf kappa b`                 |\n| `\u003csplit\u003e`     | where=\"r\" value=\"?\"               | Separates token specified in `value` from **right** part of a bigger token.                                                           | where=\"r\" value=\"gamma\": `ifngamma` --\u003e `ifn gamma`                  |\n| `\u003ctoken\u003e`     | to=\"?\" from=\"?\"                   | Replaces token specified in `from` with another token specified in `to`.                                                              | to=\"protein\" from=\"gene\": `nf kappa b gene` --\u003e `nf kappa b protein` |\n| `\u003ccharacter\u003e` | to=\"?\" from=\"?\"                   | Replaces character specified in `from` with another character specified in `to`.                                                      | to=\"e\" from=\"ë\": `citroën` --\u003e `citroen`                             |\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n\nAttribute `where` of `\u003csplit\u003e` element may have any combination of `l`, `m`, or\n`r` literals if the specified substring is required to be separated in different\nplaces of a bigger string. So, instead of three different elements\n\n```xml\n\u003csplit where=\"l\" value=\"word\" /\u003e\n\u003csplit where=\"m\" value=\"word\" /\u003e\n\u003csplit where=\"r\" value=\"word\" /\u003e\n```\n\nusing the following single one\n\n```xml\n\u003csplit where=\"lmr\" value=\"word\" /\u003e\n```\n\nwill achieve the same result.\n\nTransformation is applied in the following order:\n\n1. Replacing characters\n2. Splitting tokens\n3. Replacing tokens\n\nWhen splitting tokens, longer ones shadow shorter ones. Token replacement\ninstructions may contradict each other locally, but in entire set they must\nconverge so that each token has only one replacement option (otherwise\nValueError exception will be thrown).\n\n## Usage\n\n```python\nimport sic\n```\n\nFor detailed description of all function and methods, see comments in the\nsource code.\n\n### Class `sic.Model`\n\nThis class is designed to instanly create tokenization rules directly in\nPython. It is neither convenient nor recommended for complex normalization\ntasks, but can be handy for small ones where using external XML config might\nseem an overkill.\n\n```python\n# instantiate Model\nmodel = sic.Model()\n\n# model is case-sensitive\nmodel.case_sensitive = True\n\n# model will do nothing\nmodel.bypass = True\n```\n\n**Method** `sic.Model.add_rule` adds single tokenization instruction to the\nModel instance:\n\n```python\n# equivalent to XML \u003csplit where=\"lmr\" value=\"beta\" /\u003e\nmodel.add_rule(sic.SplitToken('beta', 'lmr'))\n\n# equivalent to XML \u003ctoken to=\"good\" from=\"bad\" /\u003e\nmodel.add_rule(sic.ReplaceToken('bad', 'good'))\n\n# equivalent to XML \u003ccharacter to=\"z\" from=\"a\" /\u003e\nmodel.add_rule(sic.ReplaceCharacter('a', 'z'))\n```\n\n\u003e **NB**: in case new `sic.ReplaceToken` or `sic.ReplaceChar` instruction\n\u003e contradicts something that is already in the model, the newer instruction\n\u003e overrides older instruction:\n\u003e\n\u003e ```python\n\u003e model.add_rule(sic.ReplaceToken('bad', 'good'))\n\u003e model.add_rule(sic.ReplaceToken('bad', 'better'))\n\u003e ```\n\u003e\n\u003e \"bad\" --\u003e \"good\" will not be used; \"bad\" --\u003e \"better\" will be used instead\n\n**Method** `sic.Model.remove_rule` removes single tokenization instruction from\nModel instance if it is there:\n\n```python\nmodel.remove_rule(sic.ReplaceToken('bad', 'good'))\n# tokenization rule that fits definition above will be removed from model\n```\n\n### Class `sic.Builder`\n\n**Function** `sic.Builder.build_normalizer()` reads tokenization config,\ninstantiates `sic.Normalizer` class that would perform tokenization according\nto rules specified in a given config, and returns this `sic.Normalizer` class\ninstance.\n\n| ARGUMENT |    TYPE     | DEFAULT |              DESCRIPTION              |\n|:--------:|:-----------:|:-------:|:-------------------------------------:|\n| endpoint | str, Model  | None    | Path to tokenizer configuration file. |\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n\n```python\n# create Builder object\nbuilder = sic.Builder()\n\n# create Normalizer object with default set of rules\nmachine = builder.build_normalizer()\n\n# create Normalizer object with custom set of rules\nmachine = builder.build_normalizer('/path/to/config.xml')\n\n# create Normalizer object using ad hoc model\nmodel = sic.Model()\nmodel.add_rule(sic.SplitToken('beta', 'lmr'))\nmachine = builder.build_normalizer(model)\n```\n\n### Class `sic.Normalizer`\n\n**Method** `sic.Normalizer.save()` saves data structure from instance of\n`sic.Normalizer` class to a specified file (pickle).\n\n| ARGUMENT | TYPE | DEFAULT |           DESCRIPTION           |\n|:--------:|:----:|:-------:|:-------------------------------:|\n| filename | str  |   n/a   | Path and name of file to write. |\n|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n\n**Function** `sic.Normalizer.load()` reads specified file (pickle) and places\ndata structure in `sic.Normalizer` instance.\n\n| ARGUMENT | TYPE | DEFAULT |          DESCRIPTION           |\n|:--------:|:----:|:-------:|:------------------------------:|\n| filename | str  |   n/a   | Path and name of file to read. |\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n\n**Function** `sic.Normalizer.normalize()` performs string normalization\naccording to the rules ingested at the time of class initialization, and\nreturns normalized string.\n\n|     ARGUMENT      | TYPE | DEFAULT |            DESCRIPTION                              |\n|:-----------------:|:----:|:-------:|:---------------------------------------------------:|\n| source_string     | str  |   n/a   | String to normalize.                                |\n| word_separator    | str  |   ' '   | Word delimiter (single character).                  |\n| normalizer_option | int  |    0    | Mode of post-processing.                            |\n| control_character | str  | '\\x00'  | Character masking word delimiter (single character) |\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n\n`word_separator`: Specified character will be considered a boundary between\ntokens. The default value is `' '` (space) which seems reasonable choice for\nnatural language. However any character can be specified, which might be more\nuseful in certain context.\n\n`normalizer_option`: The value can be either one of `0`, `1`, `2`, or `3` and\ncontrols the way tokenized string is post-processed:\n\n| VALUE |                             MODE                              |\n|:-----:|:-------------------------------------------------------------:|\n|   0   | No post-processing.                                           |\n|   1   | Rearrange tokens in alphabetical order.                       |\n|   2   | Rearrange tokens in alphabetical order and remove duplicates. |\n|   3   | Remove all added word separators.                             |\n|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n\n`control_character`: Implementation detail - character that used as word\ndelimiter inserted in a parsed string at the run time. If parsed string\ninitially included this character somewhere, normalization will return error.\nThe value is set to `\\x00` by default.\n\n**Property** `sic.Normalizer.result` retains the result of last call for\n`sic.Normalizer.normalize` function as dict object with the following keys:\n\n|     KEY      |   VALUE TYPE    |                 DESCRIPTION                          |\n|:------------:|:---------------:|:----------------------------------------------------:|\n| 'original'   | str             | Original string value that was processed.            |\n| 'normalized' | str             | Returned normalized string value.                    |\n| 'map'        | list(int)       | Map between original and normalized strings.         |\n| 'r_map'      | list(list(int)) | Reverse map between original and normalized strings. |\n|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n\n`sic.Normalizer.result['map']`: Not only `sic.Normalizer.normalize()` generates\nnormalized string out of originally provided, it also tries to map character\nindexes in normalized string back on those in the original one. This map is\nrepresented as list of integers where item index is character position in\nnormalized string and item value is character position in original string. This\nis only valid when `normalizer_option` argument for `sic.Normalizer.normalize()`\ncall has been set to 0.\n\n`sic.Normalizer.result['r_map']`: Reverse map between character locations in\noriginal string and its normalized reflection (item index is character position\nin original string; item value is list [`x`, `y`] where `x` and `y` are\nrespectively lowest and highest indexes of mapped character in normalized\nstring).\n\n### Method `sic.build_normalizer()`\n\n`sic.build_normalizer()` implicitly creates single instance of `sic.Normalizer`\nclass accessible globally from `sic` namespace. Arguments are same as for\n`sic.Builder.build_normalizer()` function.\n\n### Method `sic.save()`\n\n`sic.save()` saves data structure stored in global instance of `sic.Normalizer`\nclass to a specified file (pickle). Arguments are same as for\n`sic.Normalizer.save()` method.\n\n### Function `sic.load()`\n\n`sic.load()` reads specified file (pickle) and places data structure in global\ninstance of `sic.Normalizer` class stored in that file. Arguments are same as\nfor `sic.Normalizer.load()` function.\n\n### Function `sic.normalize()`\n\n`sic.normalize(*args, **kwargs)` either uses global class `sic.Normalizer` or\ninstantly creates new local `sic.Normalizer` class, and uses it to perform\nrequested string normalization.\n\n|     ARGUMENT      | TYPE | DEFAULT |            DESCRIPTION                              |\n|:-----------------:|:----:|:-------:|:---------------------------------------------------:|\n| source_string     | str  |   n/a   | String to normalize.                                |\n| word_separator    | str  |   ' '   | Word delimiter (single character).                  |\n| normalizer_option | int  |    0    | Mode of post-processing.                            |\n| control_character | str  | '\\x00'  | Character masking word delimiter (single character) |\n| tokenizer_config  | str  |  None   | Path to tokenizer configuration file.               |\n||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||\n\nIf `tokenizer_config` argument is not provided, the function will use global\ninstance of `sic.Normalizer` class (will create it if it is not initialized).\n\n### Method `sic.reset()`\n\n`sic.reset()` resets global `sic.Normalizer` instance to `None`, forcing\nsubsequently called `sic.normalize()` to create new global instance again if it\nneeds it.\n\n### Attribute `sic.result`, function `sic.result()`\n\n`sic.result` attribute retains the value of `sic.Normalizer.result` property\nthat belonged to most recently used `sic.Normalizer` instance accessed from\n`sic.normalize()` function (either global or local).\n\nPython 3.6 does not support [PEP-562](https://www.python.org/dev/peps/pep-0562/)\n(module attributes). So in Python 3.6, use function `sic.result()` rather than\nattribute `sic.result`:\n\n```python\nsic.result() # will work in Python \u003e= 3.6\nsic.result   # will work in Python \u003e= 3.7\n```\n\n## Examples\n\n### Basic usage\n\n```python\nimport sic\n\n# create Builder object\nbuilder = sic.Builder()\n# create Normalizer object with default set of rules\nmachine = builder.build_normalizer()\n\n# using default word_separator and normalizer_option\nx = machine.normalize('alpha-2-macroglobulin-p')\nprint(x) # 'alpha - 2 - macroglobulin - p'\nprint(machine.result)\n\"\"\"\n{\n  'original': 'alpha-2-macroglobulin-p',\n  'normalized': 'alpha - 2 - macroglobulin - p',\n  'map': [\n    0, 1, 2, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 21, 22, 22\n  ],\n  'r_map: [\n    [0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 13], [14, 14], [15, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 26], [27, 28]\n  ]\n}\n\"\"\"\n```\n\n### Custom word separator\n\n```python\nx = machine.normalize('alpha-2-macroglobulin-p', word_separator='|')\nprint(x) # 'alpha|-|2|-|macroglobulin|-|p'\n```\n\n### Post-processing options\n\n```python\n# using normalizer_option=1\nx = machine.normalize('alpha-2-macroglobulin-p', normalizer_option=1)\nprint(x) # '- - - 2 alpha macroglobulin p'\n```\n\n```python\n# using normalizer_option=2\nx = machine.normalize('alpha-2-macroglobulin-p', normalizer_option=2)\nprint(x) # '- 2 alpha macroglobulin p'\n```\n\n```python\n# using normalizer_option=3\n# assuming normalization config includes the following:\n# \u003csetting name=\"cs\" value=\"0\" /\u003e\n# \u003csplit value=\"mis\" where=\"l\" /\u003e\n# \u003ctoken to=\"spelling\" from=\"speling\" /\u003e\nx = machine.normalize('Misspeling')\nprint(x) # 'Misspelling'\n```\n\n### Using implicitly instantiated classes\n\n```python\n# normalize() with default instance\nx = sic.normalize('alpha-2-macroglobulin-p', word_separator='|')\nprint(x) # 'alpha|-|2|-|macroglobulin|-|p'\n\n# custom configuration for implicitly instantiated normalizer\nsic.build_normalizer('/path/to/config.xml')\nx = sic.normalize('some string')\nprint(x) # will be normalized according to config at /path/to/config.xml\n\n# custom config and normalization in one line\nx = sic.normalize('some string', tokenizer_config='/path/to/another/config.xml')\nprint(x) # will be normalized according to config at /path/to/another/config.xml\n```\n\n### Saving and loading compiled normalizer to/from disk\n\n```python\nmachine.save('/path/to/file') # will write /path/to/file\nanother_machine = sic.Normalizer()\nanother_machine.load('/path/to/file') # will read /path/to/file\n```\n\n### Adding normalization rules to already compiled model\n\n```python\n# (assuming `machine` is sic.Normalizer instance armed with tokenization ruleset)\nnew_ruleset = [sic.ReplaceToken('from', 'to'), sic.SplitToken('token', 'r')]\nnew_ruleset_string = ''.join([rule.decode() for rule in new_ruleset])\nmachine.make_tokenizer(new_ruleset_string, update=True) # rules from `new_ruleset` will be added to the normalizer\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpgolo%2Fsic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpgolo%2Fsic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpgolo%2Fsic/lists"}