{"id":22423835,"url":"https://github.com/statcan/canonym","last_synced_at":"2025-06-14T09:06:28.425Z","repository":{"id":195077318,"uuid":"691574703","full_name":"StatCan/Canonym","owner":"StatCan","description":"Canonym - Anonymization Package","archived":false,"fork":false,"pushed_at":"2024-12-04T16:14:19.000Z","size":214,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-14T09:04:50.424Z","etag":null,"topics":["anonymization","presidio","privacy-protection","private-information-retrieval"],"latest_commit_sha":null,"homepage":"https://github.com/StatCan/Canonym","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StatCan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-09-14T13:05:59.000Z","updated_at":"2025-02-12T16:02:00.000Z","dependencies_parsed_at":"2024-12-03T23:22:31.199Z","dependency_job_id":"db7be441-1cc9-47f1-855e-cf9a76f8dda2","html_url":"https://github.com/StatCan/Canonym","commit_stats":null,"previous_names":["statcan/canonym"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/StatCan/Canonym","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StatCan%2FCanonym","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StatCan%2FCanonym/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StatCan%2FCanonym/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StatCan%2FCanonym/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StatCan","download_url":"https://codeload.github.com/StatCan/Canonym/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StatCan%2FCanonym/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259790457,"owners_count":22911547,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anonymization","presidio","privacy-protection","private-information-retrieval"],"created_at":"2024-12-05T18:13:12.735Z","updated_at":"2025-06-14T09:06:28.406Z","avatar_url":"https://github.com/StatCan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Le français suit l'anglais\n\n# Canonym - Anonymization Package\n\n## Installation\n\nCreate a virtual environment with Python \\\u003e=3.10 .. code:\n\n    conda create -n=canonym python=3.10\n    conda activate canonym\n\nYou can install Canonym using the wheel located in the dist folder:\n\n``` \npip install dist\\canonym_public-2024.12.3-py3-none-any.whl\n```\n\n## Usage\n\n### Basic usage\n\n``` python\nfrom canonym import Canonym\n\nanonymizer = Canonym()\n\ntext = \"This is John Doe, from Ottawa, his phone number is 123-456-7890\"\nanonymizer.anonymize(text)\n```\n\nCanonym can accept as an input the following :  \n-   string\n-   list of strings\n-   Textitem object\n-   Pandas Series\n-   Pandas DataFrame\n\nYou can use directly the **anonymize()** method, or use the input specific methods :  \n-   **str**: anonymizer.anonymize_text,\n-   **TextItem**: anonymizer.anonymize_text,\n-   **list**: anonymizer.anonymize_list,\n-   **DataFrame**: anonymizer.anonymize_dataframe,\n-   **Series**: anonymizer.anonymize_pd_series\n\n### Strategies\n\nThe following default are available by default:\n\n-   **replace_all_with_tag** : replaces all entities with their Entity\n    Type\n-   **redact_all** : redacts all PI\n-   **hash_all** : Hashes the PI entities\n-   **mask_all** : Masks all DEFAULT entities\n-   **scramble_all** : Scrambles (changes the order of letters) for all\n    entities\n-   **mixed_per_entity_type** : {hash : ALPHABET_ENTITIES, mask :\n    SPECIAL_ENTITIES, randomize : NUMERIC_ENTITIES, redact :\n    ALPHANUMERIC_ENTITIES}\n-   **hash_one** : {hash: \\[PERSON\\]}\n-   **mask_some**: {mask: \\[PERSON, FULL_ADDRESS\\], redact:\n    \\[PHONE_NUMBER\\]}\n-   **replace_custom**: {replace: CUSTOM_ENTITIES}\n-   **redact_custom**: {redact : CUSTOM_ENTITIES}\n-   **faker_custom**: {faker : CUSTOM_ENTITIES}\n-   **faker_all**: {faker : DEFAULT_ENTITIES}\n-   **replace_w_value_custom**: {replace_val : CUSTOM_ENTITIES}\n\nBy default the anonymize method will use the *replace_all_with_tag*\nstrategy, to use a different strategy use :\n\n``` python\n   anonymizer.anonymize(text, strategy='redact_all')\n```\n\n### Language\n\nThe **language** parameter provides the language of the text, defaults\nto english, if *None* or \"auto\" are provided a search will be conducted\nto automatically find the right language for each text input. By default\nCanonym handles English and French text.\n\nIn the case of a Pandas DataFrame, also accepts a dict of format\n**{column_name:language or auto, }**, so each column can be set to a\ndifferent language or to automatic search.\n\n``` python\n# a string\nanonymizer.anonymize(text, strategy='redact_all', language='fr')\n# a Pandas DataFrame\nanonymizer.anonymize(df, language={'column1': 'en', 'column2': 'fr', 'column3': 'auto'} )\n```\n\n### Advanced Configuration\n\nThe behavior of Canonym can be modified, by editing the two configuration files :  \n-   ner_config_default.yaml\n-   anonymizer_config_default.yaml\n\nin **ner_config_default.yaml** the following can be defined :\n\n\u003e -   AVAILABLE_LANGS : Which language Canonym can handle, defaults to\n\u003e     *en* and *fr*\n\u003e\n\u003e -   SCORE_THRESHOLD : The confidence score threshold over which an\n\u003e     entity is tagged, defaults at **0.4**\n\u003e\n\u003e -   DEFAULT_RECOGNIZERS : List of recognizers loaded by Canonym\n\u003e\n\u003e -   POST_PROCESSING_ENTITIES : Entity specific post-processing\n\u003e\n\u003e -   PRESIDIO_NLP_ENGINE_CONFIG : Some entities will be handled by a spacy engine that needs to be defined  \n\u003e     -   nlp_engine_name\n\u003e     -   models\n\u003e\n\u003e -   SPACY_ENTITIES : List of entities, that need to be handled by\n\u003e     Spacy\n\u003e\n\u003e -   TRANSFORMER_MODELS_ENTITIES : List of entities, that need to be\n\u003e     handled by the Transformers models\n\u003e\n\u003e -   TRANSFORMER_MODELS_ENHANCERS : Post processing enhancement for the\n\u003e     Tags provided by the Transformers models (Extending partial words\n\u003e     or merging similar contiguous entities)\n\nin **anonymizer_config_default.yaml** the following can be defined :\n\n\u003e -   AVAILABLE_LANGS : Which language Canonym can handle, defaults to\n\u003e     *en* and *fr*\n\u003e\n\u003e -   DEFAULT_ENTITIES: List of all entities that can be anonymized\n\u003e\n\u003e -   ALPHABET_ENTITIES: Set of Alphabet entities\n\u003e\n\u003e -   SPECIAL_ENTITIES: Set of special entities ( email, url, etc..)\n\u003e\n\u003e -   NUMERIC_ENTITIES:\n\u003e\n\u003e -   ALPHANUMERIC_ENTITIES:\n\u003e\n\u003e -   CUSTOM_ENTITIES: Custom set of entities to be redacted\n\u003e\n\u003e -   ALL_ANONYMIZER_STRATEGIES: List of strategies, a strategy is defined as :  \n\u003e     *strategy_name* : {anonymization_action_1 : SET_1_OF_ENTITES,\n\u003e     anonymization_action_2 : SET_2_OF_ENTITES}\n\n\n## Contributing\n\nBefore contributing please read the instructions in CONTRIBUTING.md \n\nlink: [CONTRIBUTING.MD](https://github.com/StatCan/Canonym/blob/main/CONTRIBUTING.md)\n\n\n## License\n\n[MIT License](https://github.com/StatCan/Canonym/blob/main/LICENSE)\n\n\n# Canonym - Librairie d'anonymisation Statistique Canada\n\n## Installation\n\nCréer un environnement virtuel avec Python \\\u003e=3.10 ... code:\n\n    conda create -n=canonym python=3.10\n    conda activate canonym\n\nVous pouvez installer Canonym en utilisant le fichier whl situé dans le\ndossier dist :\n\n... code:\n\n    pip install dist\\canonym_public-2024.12.3-py3-none-any.whl\n\n## Utilisation\n\n### Utilisation de base\n\n... code:: python\n\n\u003e from canonym import Canonym\n\u003e\n\u003e anonymizer = Canonym()\n\u003e\n\u003e text = \"This is John Doe, from Ottawa, his phone number is\n\u003e 123-456-7890\" anonymizer.anonymize(text)\n\nCanonym peut accepter comme intrants :  \n-   chaîne de caractères\n-   liste de chaînes de caractères\n-   objet Textitem\n-   Série Pandas\n-   DataFrame Pandas\n\nVous pouvez utiliser directement la méthode **anonymize()**, ou utiliser les méthodes spécifiques à chaque type d'intrants :  \n-   **str** : anonymizer.anonymize_text,\n-   **TextItem** : anonymizer.anonymize_text,\n-   **list** : anonymizer.anonymize_list,\n-   **DataFrame** : anonymizer.anonymize_dataframe,\n-   **Series** : anonymizer.anonymize_pd_series\n\n### Stratégies\n\nLes stratégies suivantes sont disponibles par défaut :\n\n-   **replace_all_with_tag** : remplace toutes les entités par leur type\n    d'entité.\n-   **redact_all** : expurge tous les PI\n-   **hash_all** : Hache les entités PI\n-   **mask_all** : Masque toutes les entités DEFAULT\n-   **scramble_all** : Brouille (change l'ordre des lettres) toutes les\n    entités.\n-   **mixed_per_entity_type** : {hash : ALPHABET_ENTITIES, mask :\n    SPECIAL_ENTITIES, randomize : NUMERIC_ENTITIES, redact :\n    ALPHANUMERIC_ENTITIES}\n-   **hash_one** : {hash : \\[PERSON\\]}\n-   **mask_some** : {mask : \\[PERSON, FULL_ADDRESS\\], redact :\n    \\[PHONE_NUMBER\\]}\n-   **replace_custom** : {replace : CUSTOM_ENTITIES}\n-   **redact_custom** : {redact : CUSTOM_ENTITIES}\n-   **faker_custom** : {faker : CUSTOM_ENTITIES}\n-   **faker_all** : {faker : DEFAULT_ENTITIES}\n-   **replace_w_value_custom** : {replace_val : CUSTOM_ENTITIES}\n\nPar défaut, la méthode d'anonymisation utilise la stratégie\n*replace_all_with_tag*, pour utiliser une stratégie différente, utilisez\n:\n\n``` python\n   anonymizer.anonymize(text, strategy='redact_all')\n```\n\n### Langue\n\nLe paramètre **language** indique la langue du texte, par défaut\nl'anglais, si *None* ou \"auto\" sont fournis, une recherche sera\neffectuée pour trouver automatiquement la bonne langue pour chaque\nentrée de texte pour trouver automatiquement la bonne langue pour chaque\ntexte saisi. Par défaut, Canonym est capable de traiter les textes en\nanglais et en français.\n\nDans le cas d'un DataFrame Pandas, Canonym accepte également un dict de\nformat **{nom_de_colonne:langue ou auto, }**, afin que chaque colonne\npuisse être configurée pour une langue différente ou pour une recherche\nautomatique.\n\n``` python\n# une chaîne de caractères\nanonymizer.anonymize(text, strategy='redact_all', language='fr')\n# un DataFrame Pandas\nanonymizer.anonymize(df, language={'column1' : 'en', 'column2' : 'fr', 'column3' : 'auto'} )\n```\n\n### Configuration avancée\n\nLe comportement de Canonym peut être modifié en éditant les deux fichiers de configuration :  \n-   ner_config_default.yaml\n-   anonymizer_config_default.yaml\n\ndans **ner_config_default.yaml** les éléments suivants peuvent être\ndéfinis :\n\n\u003e -   AVAILABLE_LANGS : Les langues que Canonym peut gérer, par défaut\n\u003e     *en* et *fr*.\n\u003e\n\u003e -   SCORE_THRESHOLD : Le seuil de confiance à partir duquel une entité\n\u003e     est étiquetée, par défaut **0.4**.\n\u003e\n\u003e -   DEFAULT_RECOGNIZERS : Liste des outils de reconnaissance chargés\n\u003e     par Canonym\n\u003e\n\u003e -   POST_PROCESSING_ENTITIES : Post-traitement spécifique aux entités\n\u003e\n\u003e -   PRESIDIO_NLP_ENGINE_CONFIG : Certaines entités seront traitées par un moteur spacy qui doit être défini.  \n\u003e     -   nom du moteur nlp\n\u003e     -   modèles\n\u003e\n\u003e -   SPACY_ENTITIES : Liste des entités qui doivent être gérées par\n\u003e     Spacy\n\u003e\n\u003e -   TRANSFORMER_MODELS_ENTITIES : Liste des entités qui doivent être\n\u003e     traitées par les modèles Transformers\n\u003e\n\u003e -   TRANSFORMER_MODELS_ENHANCERS : Amélioration du post-traitement\n\u003e     pour les étiquettes fournies par les modèles Transformers\n\u003e     (extension des mots partiels ou fusion d'entités contiguës\n\u003e     similaires)\n\ndans **anonymizer_config_default.yaml**, les éléments suivants peuvent\nêtre définis :\n\n\u003e -   AVAILABLE_LANGS : Les langues que Canonym peut gérer, par défaut\n\u003e     *en* et *fr*.\n\u003e\n\u003e -   DEFAULT_ENTITIES : Liste de toutes les entités qui peuvent être\n\u003e     anonymisées\n\u003e\n\u003e -   ALPHABET_ENTITIES : Ensemble d'entités alphabétiques\n\u003e\n\u003e -   SPECIAL_ENTITIES : Ensemble d'entités spéciales (email, url, etc.)\n\u003e\n\u003e -   NUMERIC_ENTITIES :\n\u003e\n\u003e -   ALPHANUMERIC_ENTITIES :\n\u003e\n\u003e -   CUSTOM_ENTITIES : Ensemble personnalisé d'entités à expurger\n\u003e\n\u003e -   ALL_ANONYMIZER_STRATEGIES : Liste des stratégies, une stratégie est définie comme suit :  \n\u003e     *nom_de_la_stratégie* : {anonymisation_action_1 :\n\u003e     SET_1_OF_ENTITES, anonymisation_action_2 : SET_2_OF_ENTITES}\n\n## Contribuer\n\nAvant de contribuer merci de lire les instructions présentes dans CONTRIBUTING.md\n\nlien: [CONTRIBUTING.MD](https://github.com/StatCan/Canonym/blob/main/CONTRIBUTING.md)\n\n## License\n\n[MIT License](https://github.com/StatCan/Canonym/blob/main/LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstatcan%2Fcanonym","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstatcan%2Fcanonym","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstatcan%2Fcanonym/lists"}