{"id":37707772,"url":"https://github.com/pgolo/pilsner","last_synced_at":"2026-01-16T13:10:37.096Z","repository":{"id":56058070,"uuid":"286345436","full_name":"pgolo/pilsner","owner":"pgolo","description":"Utility for dictionary-based named entity recognition","archived":false,"fork":false,"pushed_at":"2023-07-11T18:41:01.000Z","size":2250,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-10-30T00:55:44.004Z","etag":null,"topics":["named-entity-disambiguation","named-entity-linking","named-entity-recognition","rule-based-nlp","text-mining"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pgolo.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-08-10T01:08:29.000Z","updated_at":"2023-07-11T18:40:16.000Z","dependencies_parsed_at":"2022-08-15T12:20:40.091Z","dependency_job_id":null,"html_url":"https://github.com/pgolo/pilsner","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/pgolo/pilsner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fpilsner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fpilsner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fpilsner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fpilsner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pgolo","download_url":"https://codeload.github.com/pgolo/pilsner/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pgolo%2Fpilsner/sbom","scorecard":{"id":730175,"data":{"date":"2025-08-11","repo":{"name":"github.com/pgolo/pilsner","commit":"82202227fd9d3f16a81d51cb75a5755c06f32e84"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.3,"checks":[{"name":"Binary-Artifacts","score":6,"reason":"binaries present in source code","details":["Warn: binary detected: dist/pilsner-0.1.0-cp36-cp36m-win_amd64.whl:1","Warn: binary detected: dist/pilsner-0.1.0-cp37-cp37m-win_amd64.whl:1","Warn: binary detected: dist/pilsner-0.1.0-cp38-cp38-win_amd64.whl:1","Warn: binary detected: dist/pilsner-0.1.0-cp39-cp39-win_amd64.whl:1"],"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Code-Review","score":0,"reason":"Found 0/7 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":8,"reason":"2 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2025-49 / GHSA-5rjg-fvgr-3xxf","Warn: Project is vulnerable to: GHSA-cx63-2mw6-8hw5"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 29 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-22T14:11:12.975Z","repository_id":56058070,"created_at":"2025-08-22T14:11:12.975Z","updated_at":"2025-08-22T14:11:12.975Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["named-entity-disambiguation","named-entity-linking","named-entity-recognition","rule-based-nlp","text-mining"],"created_at":"2026-01-16T13:10:37.031Z","updated_at":"2026-01-16T13:10:37.084Z","avatar_url":"https://github.com/pgolo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pilsner\n\nPython implemented library servicing named entity recognition\n\n[![pypi][pypi-img]][pypi-url]\n\n[pypi-img]: https://img.shields.io/pypi/v/pilsner?style=plastic\n[pypi-url]: https://pypi.org/project/pilsner/\n\n## 1. Purpose\n\nThis library is Python implementation of toolkit for dictionary based named\nentity recognition. It is intended to store any thesaurus in a trie-like\nstructure and identify any of stored synonyms in a string.\n\n## 2. Installation and dependencies\n\n```bash\npip install pilsner\n```\n\n`pilsner` is tested in Python 3.6, 3.7, and 3.8.\n\nThe only dependency is `sic` package. While it can be automatically installed\nat the time of `pilsner` installation, manual installation of `sic` beforehand\nmight also be considered (see benchmark of cythonized vs pure Python\nimplementation in `sic` docimentation,\n[https://pypi.org/project/sic/](https://pypi.org/project/sic/)).\n\n## 3. Diagram\n\n`pilsner` consists of two major components: `Model` and `Utility`. `Model`\nclass provides storage for the dictionary and string normalization rules, as\nwell as low-level methods for populating this storage. `Utility` class provides\nhigh-level methods for storing and retrieving data to/from `Model` instance.\n\n![Diagram](https://github.com/pgolo/pilsner/blob/master/misc/pilsner-diagram.svg)\n\n## 4. Usage\n\n```python\nimport pilsner\n```\n\n### 4.1. Initialize model\n\n- To initialize empty model:\n\n```python\nm = pilsner.Model()\n```\n\n- To specify path to temporary database for empty model:\n\n```python\nm = pilsner.Model(storage_location='path/to/database.file')\n```\n\n- To create empty model that uses database created in memory rather than on\ndisk:\n\n```python\nm = pilsner.Model(storage_location=':memory:')\n```\n\n- To create empty model that does not store any attributes in a database at all:\n\n```python\nm = pilsner.Model(simple=True)\n```\n\n\u003e If database is created in memory, the model cannot be later saved on disk\n(can only be used instantly).\n\n- To load model from disk:\n\n```python\nm = pilsner.Model(filename='path/to/model')\n```\n\n\u003e More on how model is saved to and loaded from disk - see\n[4.6. Save model](#46-save-model) and [4.7. Load model](#47-load-model).\n\n### 4.2. Add string normalization units\n\n- Depending on the dictionary and nature of the text supposed to be parsed,\nstring normalization might not be required at all, and nothing specific is to\nbe done here in such case.\n- Without string normalization, synonyms from the dictionary will be stored as\nthey are and looked up by recognizer case-sensitively.\n- To add a single normalization unit:\n\n```python\n# Assuming m is pilsner.Model instance:\nm.add_normalizer(\n    normalizer_name='normalizer_tag',\n    filename='path/to/normalizer_config.xml'\n)\n```\n\n\u003e String normalization is technically done by `sic` component. See\n\u003e documentation for `sic` at\n\u003e [https://pypi.org/project/sic/](https://pypi.org/project/sic/) to learn how\n\u003e to design normalizer config.\n\n- Model can embed more than one normalization unit.\n- Default normalization unit for the model is the one added first or the last\none added with parameter `default` set to `True`.\n- Having multiple normalization units in one model makes perfect sense when the\nstored dictionary contains synonyms of different nature that should be\nnormalized in different ways (for example, abbreviations probably should not\nget normalized at all, while other synonyms might include tokens or punctuation\nmarks that should not affect entity recognition). For that purpose, Model class\nincludes `normalizer_map` dict that is supposed to map names of added\nnormalization units to values in specific field in a dictionary designating the\nway a synonym should be normalized (tokenizer field, or tokenizer column):\n\n```python\n# Assuming m is pilsner.Model instance:\nm.normalizer_map = {\n    'synonym_type_1': 'normalizer_1',\n    'synonym_type_2': 'normalizer_2'\n}\n```\n\n\u003e The snippet above instructs `pilsner` to normalize synonyms that have\n\u003e `synonym_type_1` value in `tokenizer` column with `normalizer_1`\n\u003e normalization unit, and normalize synonyms that have `synonym_type_2` value\n\u003e in `tokenizer` column with `normalizer_2` normalization unit. For more about\n\u003e fields in a dictionary, see [4.4. Define dictionary](#44-define-dictionary).\n\n### 4.3. Initialize utility\n\n- To load dictionary into `Model` instance, as well as to parse text, the\n`Utility` instance is required:\n\n```python\nr = pilsner.Utility()\n```\n\n### 4.4. Define dictionary\n\n- Source dictionary for `pilsner` must be delimited text file.\n- Along with the source dictionary, specifications of the columns (fields) must\nbe provided as list where each item corresponds to a column (from left to\nright). Each item in this list must be a dict object with string keys `name`,\n`include`, `delimiter`, `id_flag`, `normalizer_flag`, and `value_flag`, so\nthat:\n  - `field['name']` is a string for column title;\n  - `field['include']` is a boolean that must be set to `True` for the column\n  to be included in the model, otherwise `False`;\n  - `field['delimiter']` is a string that is supposed to split single cell into\n  list of values if the column holds concatenated lists rather than individual\n  values;\n  - `field['id_flag]` is a boolean that must be set to `True` if the column is\n  supposed to be used for grouping synonyms (generally, entity ID is such\n  column), otherwise `False`;\n  - `field['normalizer_flag']` is a boolean that must be set to `True` if the\n  column holds indication on what normalization unit must be applied to this\n  particular synonym, otherwise `False`;\n  - `field['value_flag']` is a boolean that must be set to `True` if the column\n  holds synonyms that are supposed to be looked up when parsing a text,\n  otherwise `False`.\n\n\u003e If dictionary has a column flagged with `normalizer_flag`, synonym in each\n\u003e row will be normalized with string normalization unit which name is mapped on\n\u003e value in this column using `pilsner.Model.normalizer_map` dict. If value is\n\u003e not among `pilsner.Model.normalizer_map` keys, default normalization unit\n\u003e will be used.\n\n### 4.5. Compile model\n\n- To store dictionary in `Model` instance, method `compile_model` of `Utility`\ninstance must be called with the following required parameters:\n  - `model`: pointer to initilized `Model` instance;\n  - `filename`: string with path and filename of source dictionary;\n  - `fields`: dict object with definitions of columns (see\n  [4.4. Define dictionary](#44-define-dictionary));\n  - `word_separator`: string defining what is to be considered word separator\n  (generally, it should be whitespace);\n  - `column_separator`: string defining what is to be considered column\n  separator (e.g. `\\t` for tab-delimited file);\n  - `column_enclosure`: string defining what is to be stripped away from cell\n  after row has been split into columns (typically, it should be `\\n` for new\n  line character to be trimmed from the rightmost column).\n\n```python\n# Assuming m is pilsner.Model instance and r is pilsner.Utility instance:\nr.compile_model(\n    model=m,\n    filename='path/to/dictionary_in_a_text_file.txt',\n    fields=fields,\n    word_separator=' ',\n    column_separator='\\t',\n    column_enclosure='\\n'\n)\n```\n\n- To review optional parameters, see comments in the code.\n\n### 4.6. Save model\n\n- If `Model` instance has compiled dictionary, and if database location for the\n`Model` instance is not explicitly set to `':memory:'`, the data such instance\nis holding can be saved to disk:\n\n```python\n# Assuming m is pilsner.Model instance\nm.save('path/to/model_name')\n```\n\n- The snippet above will write the following files:\n  - `path/to/model_name.attributes`: database with attributes (fields from the\n  dictionary that are not synonyms) - will only be written if `Model` instance\n  is not created with `simple=True` parameter;\n  - `path/to/model_name.keywords`: keywords used for disambiguation;\n  - `path/to/model_name.normalizers`: string normalization units;\n  - `path/to/model_name.0.dictionary`: trie with synonyms;\n  - `path/to/model_name.\u003cN\u003e.dictionary`: additional tries with synonyms (`\u003cN\u003e`\n  being integer number of a trie) in case more than one trie was created (see\n  comments in the code - `pilsner.Utility.compile_model` method, `item_limit`\n  parameter).\n\n### 4.7. Load model\n\n- To initialize new `Model` instance using previously saved data:\n\n```python\nm = pilsner.Model(filename='path/to/model_name')\n```\n\n- Alternatively, data can be loaded to previously initialized `Model` instance:\n\n```python\nm = pilsner.Model()\nm.load('path/to/model_name')\n```\n\n- In both cases, the program will look for the following files:\n  - `path/to/model_name.attributes`: database with attributes (fields from the\n  dictionary that are not synonyms) - if not found, `Model` instance will work\n  as if it is initialized with `simple=True` parameter, meaning no attributes\n  other than primary IDs could be processed;\n  - `path/to/model_name.keywords`: keywords used for disambiguation;\n  - `path/to/model_name.normalizers`: string normalization units;\n  - `path/to/model_name.\u003cN\u003e.dictionary`: tries with synonyms (`\u003cN\u003e` being\n  integer).\n\n### 4.8. Parse string\n\n- To parse a string without filtering out any synonyms and output all\nattributes of spotted entities:\n\n```python\n# Assuming m is pilsner.Model instance, r is pilsner.Utility instance,\n# and text_to_parse is string to parse\nparsed = r.parse(\n    model=m,\n    source_string=text_to_parse\n)\n```\n\n- The output will be dict object where keys are tuples for location of spotted\nentity in a string (begin, end) and values are dicts for attributes that are\nassociated with identified entity (`{'attribute_name': {attribute_values}}`).\n- To ignore entity by its label rather than some of its attributes, compiled\nmodel can be adjusted using `pilsnet.Utility.ignore_node()` method:\n\n```python\n# Assuming m is pilsner.Model instance, r is pilsner.Utility instance\nr.ignore_node(\n  model=m,\n  label='irrelevant substring'\n)\n# substring 'irrelevant substring' will not be found by pilsner.Utility.parse()\n# even if it is present in the model\n```\n\n- For details about optional parameters, see comments in the code -\n`pilsner.Utility.parse()` function.\n\n## 5. Example\n\nEverything written above is put together in example code,\nsee **/misc/example/** directory in the project's repository.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpgolo%2Fpilsner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpgolo%2Fpilsner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpgolo%2Fpilsner/lists"}