{"id":24836678,"url":"https://github.com/jftuga/deidentification","last_synced_at":"2025-07-31T09:37:45.134Z","repository":{"id":271881006,"uuid":"910985235","full_name":"jftuga/deidentification","owner":"jftuga","description":"Deidentify people's names and gender specific pronouns","archived":false,"fork":false,"pushed_at":"2025-05-03T12:09:52.000Z","size":291,"stargazers_count":37,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-18T10:55:00.093Z","etag":null,"topics":["anonymization","data-anonymization","data-scrubbing","de-identification","deidentification","deidentify","named-entity-recognition","natural-language-processing","ner","nlp","pii","pii-anonymization","python","python3","text-anonymization"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/text-deidentification/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jftuga.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":"deidentification-html-demo.png","publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-02T01:25:08.000Z","updated_at":"2025-06-29T10:32:36.000Z","dependencies_parsed_at":"2025-01-10T14:27:44.597Z","dependency_job_id":"34e6b5a5-18da-48d3-ad42-7705903dd47a","html_url":"https://github.com/jftuga/deidentification","commit_stats":null,"previous_names":["jftuga/deidentification"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/jftuga/deidentification","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jftuga%2Fdeidentification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jftuga%2Fdeidentification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jftuga%2Fdeidentification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jftuga%2Fdeidentification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jftuga","download_url":"https://codeload.github.com/jftuga/deidentification/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jftuga%2Fdeidentification/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268017357,"owners_count":24181669,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-31T02:00:08.723Z","response_time":66,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anonymization","data-anonymization","data-scrubbing","de-identification","deidentification","deidentify","named-entity-recognition","natural-language-processing","ner","nlp","pii","pii-anonymization","python","python3","text-anonymization"],"created_at":"2025-01-31T05:02:01.786Z","updated_at":"2025-07-31T09:37:45.111Z","avatar_url":"https://github.com/jftuga.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Deidentification\n\nA Python module that removes personally identifiable information (PII) from text documents, focusing on personal names and gender-specific pronouns. This tool uses spaCy's Named Entity Recognition (NER) capabilities combined with custom pronoun handling to provide thorough text de-identification.\n\n## Key Features\n\n- Accurately identifies and replaces personal names using spaCy's NER\n- Handles gender-specific pronouns with customizable replacements\n- Supports both plain text and HTML output formats\n- Uses an optimized backward-processing strategy for accurate text replacements\n- Iterative processing ensures comprehensive PII removal\n- Configurable replacement tokens and debug output\n- GPU acceleration support through spaCy\n\n## Installation\n\n```bash\npip install text-deidentification\n\n# or...\n\npip install git+https://github.com/jftuga/deidentification.git\n```\n\n### Requirements\n\n- Python 3.10 or higher\n- spaCy's `en_core_web_trf` model (or another compatible model)\n\nDownload the required spaCy model:\n```bash\npython -m spacy download en_core_web_trf\n```\n\n## Usage\n\n### Command Line Interface\n\nThe package includes a command-line tool for quick de-identification of text files:\n\n```bash\ndeidentify input_file [options]\n# or:\npython -m deidentification.deidentify input_file [options]\n```\n\nOptions:\n- `-r, --replacement TEXT`: Specify replacement text for identified names (default: \"PERSON\")\n- `-o, --output FILE`: Output file (defaults to stdout)\n- `-H, --html`: Output in HTML format with highlighted replacements\n- `-d, --debug`: Enable debug mode\n- `-t, --tokens`: Save identified elements to a JSON file (filename--tokens.json)\n- `-x, --exclude EXCLUDE`: comma-delimited list of entities to exclude from de-identification; or change with `DEIDENTIFY_EXCLUDE_DELIM` env var\n- `-v, --version`: Display version information\n\nExample:\n```bash\n# De-identify a text file and save with HTML markup\ndeidentify input.txt -H -o output.html -r \"[REDACTED]\"\n```\n\n### Python API Usage\n\n```python\nfrom deidentification import Deidentification\n\n# Create a deidentification instance with default settings\ndeidentifier = Deidentification()\n\n# Process text\ntext = \"John Smith went to the store. He bought some groceries.\"\ndeidentified_text = deidentifier.deidentify(text)\nprint(deidentified_text)\n# Output: \"PERSON went to the store. HE/SHE bought some groceries.\"\n```\n\n### HTML Output\n\n```python\n# Generate HTML output with highlighted replacements\nhtml_output = deidentifier.deidentify_with_wrapped_html(text)\n```\n\n### HTML Output Demo\n\n![deidentification html demo](deidentification-html-demo.png)\n\n### Custom Configuration\n\n```python\nfrom deidentification import (\n    Deidentification,\n    DeidentificationConfig,\n    DeidentificationOutputStyle,\n)\n\nconfig = DeidentificationConfig(\n    spacy_model=\"en_core_web_trf\",\n    output_style=DeidentificationOutputStyle.HTML,\n    replacement=\"[REDACTED]\",\n    excluded_entities={\"Joe Smith\",\"Alice Jones\"},\n    debug=True\n)\ndeidentifier = Deidentification(config)\n```\n\n## Configuration Options\n\nThe `DeidentificationConfig` class supports the following options:\n\n- `spacy_load` (bool): Whether to load the spaCy model (default: True)\n- `spacy_model` (str): Name of the spaCy model to use (default: \"en_core_web_trf\")\n- `output_style` (DeidentificationOutputStyle): Output format - TEXT or HTML (default: TEXT)\n- `replacement` (str): Replacement text for identified names (default: \"PERSON\")\n- `debug` (bool): Enable debug output (default: False)\n\n## How It Works\n\nThe de-identification process follows these steps:\n\n1. Text is normalized for consistent processing\n2. spaCy processes the text to identify person entities\n3. Gender-specific pronouns are identified using a predefined list\n4. Entities and pronouns are sorted by their position in reverse order\n5. Replacements are made from end to beginning to maintain position accuracy\n6. The process repeats until no new entities are detected\n\nThe backward-processing strategy is key to accurate replacements, as it prevents position shifts from affecting subsequent replacements.\n\n## Debug Output\n\nWhen debug mode is enabled, the tool provides detailed information about:\n- Identified person entities\n- Found pronouns\n- Replacement positions and actions\n- Processing iterations\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjftuga%2Fdeidentification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjftuga%2Fdeidentification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjftuga%2Fdeidentification/lists"}