https://github.com/shreypandit/llm-database-cleaning
DBMS project for llm cleaning on a dataset
https://github.com/shreypandit/llm-database-cleaning
Last synced: 8 months ago
JSON representation
DBMS project for llm cleaning on a dataset
- Host: GitHub
- URL: https://github.com/shreypandit/llm-database-cleaning
- Owner: ShreyPandit
- License: apache-2.0
- Created: 2024-12-11T02:18:35.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-11T15:13:03.000Z (over 1 year ago)
- Last Synced: 2025-02-07T06:13:50.996Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 2.42 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Text Data Noise Generator and Cleaner
A toolkit for generating and cleaning data quality issues in text, designed for educational and testing purposes.
## Project Overview
This project provides tools to:
1. Generate realistic data quality issues in clean text
2. Clean noisy text data using prompt engineering
3. Test and validate data cleaning approaches
## Components
### 1. Noise Generator (`noise_generator.py`)
Introduces controlled noise into clean text data:
- Unicode corruption and invisible characters
- Random string injection
- Word duplication patterns
- HTML/XML artifacts
- Control characters and encoding issues
- Whitespace corruption
### 2. Cleaning Prompt (`cleaning_prompt.txt`)
Prompt for Large Language Models to clean noisy text:
- Detailed cleaning instructions
- Two-shot examples
- Specific noise type handling
## Requirements
```bash
Python 3.7+
```
## Installation
```bash
git clone https://github.com/yourusername/text-data-noise-toolkit.git
cd text-data-noise-toolkit
pip install -r requirements.txt
```
## Usage
### Adding Noise to Clean Data
```python
from noise_generator import add_noise_to_text
add_noise_to_text(
text_file='clean_data.txt',
output_file='noisy_data.txt',
noise_probability=0.3
)
```
### Cleaning Noisy Data
1. Copy the cleaning prompt from `cleaning_prompt.txt`
2. Use it with your preferred LLM
3. Input your noisy text for cleaning
## Example
Input (Clean):
```
The quick brown fox jumps over the lazy dog.
```
With Noise:
```
The quick brown\u200B fox jumps jumps jumps jumps over the lazy dog #R$T2k9pL@
```
Cleaned:
```
The quick brown fox jumps over the lazy dog
```
## Configuration
Adjust noise parameters in `noise_generator.py`:
```python
noise_probability = 0.3 # 30% chance of noise per line
noise_functions = [add_unicode_noise, add_random_string] # Select noise types
```