{"id":29084840,"url":"https://github.com/readytensor/rt_roberta_pii_redactor","last_synced_at":"2025-08-12T17:34:42.044Z","repository":{"id":224392064,"uuid":"763140717","full_name":"readytensor/rt_roberta_pii_redactor","owner":"readytensor","description":"Roberta for PII redaction ","archived":false,"fork":false,"pushed_at":"2024-08-06T15:33:53.000Z","size":151,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-27T22:12:04.339Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/readytensor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-25T17:03:07.000Z","updated_at":"2024-08-06T15:33:57.000Z","dependencies_parsed_at":"2024-02-25T18:30:19.722Z","dependency_job_id":"9260fedc-6449-4b68-9c40-ec04d03acfd5","html_url":"https://github.com/readytensor/rt_roberta_pii_redactor","commit_stats":null,"previous_names":["readytensor/rt_roberta_pii_redactor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/readytensor/rt_roberta_pii_redactor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/readytensor%2Frt_roberta_pii_redactor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/readytensor%2Frt_roberta_pii_redactor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/readytensor%2Frt_roberta_pii_redactor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/readytensor%2Frt_roberta_pii_redactor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/readytensor","download_url":"https://codeload.github.com/readytensor/rt_roberta_pii_redactor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/readytensor%2Frt_roberta_pii_redactor/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270107482,"owners_count":24528669,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-12T02:00:09.011Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-27T22:11:24.316Z","updated_at":"2025-08-12T17:34:42.021Z","avatar_url":"https://github.com/readytensor.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Personally identifiable information (PII) Redactor with RoBERTa\n\nPII Redactor model using RoBERTa. The goal of this model is to remove personally identifiable information from text documents.\n\nPII includes:\n\n- Names\n- Dates\n- Emails\n- Phone numbers\n- Addresses\n- URLs\n\n---\nHere are the highlights of this implementation: \u003cbr/\u003e\n\n- **RoBERTa** model trained using **transformers** library. The model is trained to identify names, addresses and dates.\n- **Regex** logic to capture phone numbers, emails and URLs.\n- **FakeGenerator** module to generate fake information.\n- **Redactor** module to replace PII with fake information.\n\n## Project Structure\n\nThe following is the directory structure of the project:\n\n- **`model_inputs_outputs/`**: This directory contains files that are either inputs to, or outputs from, the model. This directory is further divided into:\n  - **`/inputs/`**: This directory contains the input .txt files to be redacted. \n  - **`/model`**: This directory is used to store the model used for redaction along with the tokenizer used for tokenizing the text files.\n  - **`/outputs/`**: The outputs directory will contain the output files after running the model on the input files.\n- **`src/`**: This directory holds the source code for the project. It is further divided into various subdirectories:\n  - **`config/`**: for configuration files for data preprocessing, model hyperparameters, paths, etc.\n  - **`main.py`**: This script is used to run the model on the text files inside **inputs** directory.\n  - **`utils.py`**: This script contains utility functions used by the other scripts.\n- **`.gitignore`**: This file specifies the files and folders that should be ignored by Git.\n- **`LICENSE`**: This file contains the license for the project.\n- **`requirements.txt`** for the main code in the `src` directory.\n- **`label2id.json`** This file contains label encoding for the token classes that were used to train the model.\n- **`README.md`**: This file (this particular document) contains the documentation for the project, explaining how to set it up and use it.\n\n## Usage\n\n- Place the data you want to redact in a .txt or .pdf format\n- Move the files inside the **/model_inputs_outputs/inputs** directory\n- Run the **main.py** script\n- Get the result files from **/model_inputs_outputs/outputs** directory\n\n\n## Requirements\n\nDependencies for the main model implementation in `src` are listed in the file `requirements.txt`.\nYou can install these packages by running the following command from the root of your project directory:\n\n```python\npip install -r requirements.txt\n```\n\n## LICENSE\n\nThis project is provided under the Apache-2.0 License. Please see the [LICENSE](LICENSE) file for more information.\n\n## Contact Information\n\nRepository created by Ready Tensor, Inc. Visit https://www.readytensor.ai/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freadytensor%2Frt_roberta_pii_redactor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Freadytensor%2Frt_roberta_pii_redactor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freadytensor%2Frt_roberta_pii_redactor/lists"}