https://github.com/edujbarrios/anonymizertool
A sofisticated tool that can anonymize content and give embedding-ready JSON files for RAG-LLMs
https://github.com/edujbarrios/anonymizertool
Last synced: 8 months ago
JSON representation
A sofisticated tool that can anonymize content and give embedding-ready JSON files for RAG-LLMs
- Host: GitHub
- URL: https://github.com/edujbarrios/anonymizertool
- Owner: edujbarrios
- License: mit
- Created: 2025-02-12T10:54:54.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-02-12T11:21:07.000Z (8 months ago)
- Last Synced: 2025-02-12T11:48:38.799Z (8 months ago)
- Language: Python
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF anonymizer tool
### https://anonymizertool.streamlit.app/
A tool that allows users to upload a PDF and get an anonimized PDF or embedding JSON files.
It can anonymize through regex:
- Spanish IDs (NIF / DNI)
- phone
- adress### Example:
```python
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
```**More keywords can be added by using specific regex by editing `src/utils.py` and the function `aonymize_text()`**
## Set up guide
**A full detailed set up will be added soon**, by now just take into account this tool uses **Streamlit** as a core for the UI, and other libraries for the in deep process. Check `pyproject.toml` for more details.
The way to execute this program is the following:
`streamlit run app.py`
## Contributing
Any contribution is welcomed, throw a pull request if you have any updates on this code.
## ToDo:
- Set Up README guide
- Create full anonymizations across a wide pdf dataset directory
- Allow Excel files