An open API service indexing awesome lists of open source software.

https://github.com/maxime-cllt/datalint

Unsafe value program detection in CSV file
https://github.com/maxime-cllt/datalint

ai csv huggingface pytorch rust

Last synced: 7 months ago
JSON representation

Unsafe value program detection in CSV file

Awesome Lists containing this project

README

          


๐Ÿ“Š DataLint


High-performance CSV data validation and anomaly detection tool



Rust
PyTorch
Version
License

## ๐Ÿš€ Overview

**DataLint** is a production-ready machine learning model designed to prevent the ingestion of erroneous or malicious
data in CSV files.
Built with Rust for optimal performance, it provides powerful CSV file validation capabilities by detecting erroneous,
malicious,
or anomalous data patterns using advanced AI techniques.

### โœจ Key Features

- ๐Ÿ” **AI-Powered Detection**: Leverages pre-trained neural networks for intelligent data anomaly detection,
use [TinyBERT](https://huggingface.co/prajjwal1/bert-tiny) tokenizer for efficient data indexing
- โšก **High Performance**: Built with Rust for maximum speed and memory efficiency
- ๐Ÿ“ **CSV Processing**: Specialized for CSV file validation and analysis
- ๐Ÿ›ก๏ธ **Security Focus**: Identifies potentially dangerous or malicious data patterns
- ๐Ÿ”ง **Production Ready**: Optimized for server-side deployment in production environments
- ๐Ÿ“Š **JSON Output**: Generates detailed analysis reports in JSON format

## ๐ŸŽฏ Use Cases

- **Data Quality Assurance**: Validate CSV imports before processing
- **Security Scanning**: Detect potentially malicious data injections
- **Data Pipeline Integration**: Automated validation in ETL processes
- **Compliance Checking**: Ensure data meets quality standards
- **Anomaly Detection**: Identify outliers and unusual patterns

## ๐Ÿ“‹ Prerequisites

### Required Tools

- **[Rust](https://www.rust-lang.org/tools/install)** (latest stable version)
- **[Cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html)** (included with Rust)

### External Dependencies

- **AI Model**: Pre-trained PyTorch model for data anomaly detection
- **Tokenizer**: JSON-formatted vocabulary file for data indexing and tokenization
- **PyTorch Runtime**: Required DLLs and libraries for model inference

## ๐Ÿ› ๏ธ Installation

### 1. Clone the Repository

```bash
git clone https://github.com/Maxime-Cllt/DataLint.git
cd DataLint
```

### 2. Build the Project

```bash
# Development build
cargo build

# Optimized release build (recommended for production)
cargo build --release
```

## โš™๏ธ Configuration

Create a `config.json` file in the same directory as the executable:

```json
{
"model_path": "C:\\Users\\model\\neural\\perfage_ia",
"vocabulary_path": "C:\\Users\\tokenizer\\tokenizer.json"
}
```

### Configuration Options



Option
Description




model_path
Path to the pre-trained PyTorch model directory


vocabulary_path
Path to the tokenizer JSON file for data processing

## ๐Ÿš€ Usage

### Command Line Interface

```bash
# Using cargo (development)
cargo run --release "input_file.csv" "output_report.json"

# Using compiled executable (production)
./target/release/DataLint "input_file.csv" "output_report.json"

# On Windows
.\target\release\DataLint.exe "input_file.csv" "output_report.json"
```

### Parameters

- **Input File**: Path to the CSV file to be validated
- **Output File**: Path where the JSON analysis report will be saved

### Example Usage

```bash
# Analyze a customer data file
./DataLint "data/customers.csv" "reports/customer_analysis.json"

# Validate uploaded user data
./DataLint "uploads/user_data.csv" "validation/results.json"
```

## ๐Ÿ“Š Output Format

DataLint generates detailed JSON reports with the following structure:

```json
{
"analysed_file": "file.csv",
"ai_analyze": 1000,
"regex_analyze": 1000,
"time_ms": 1234,
"anomalies": [
{
"value": "#ERROR!",
"column": "\"Phone\"",
"score": 0.9670525,
"line": 71049
},
{
"value": "??",
"column": "\"Comment\"",
"score": 0.90427655,
"line": 75392
}
]
}
```

## ๐Ÿ—๏ธ Dependencies Setup

### PyTorch Installation

1. **Install PyTorch**: Follow the [official installation guide](https://pytorch.org/get-started/locally/)
2. **Copy DLLs**: Place all PyTorch DLL files in the same directory as the DataLint executable

### Required PyTorch DLLs (Windows)

- `torch_cpu.dll`
- `torch_cuda.dll` (if using GPU)
- `c10.dll`
- `fbgemm.dll`
- Additional dependency DLLs as required

## ๐Ÿ”ง Development

### Building from Source

To build DataLint from source, ensure you have Rust and Cargo installed, then run:

```bash
cargo build --release
```

## ๐Ÿงช Code quality

### Unit Tests available

The `tests` directory is tested using the command :

```bash
cargo test
```

### Benchmarking available

Code is benchmarked using the `criterion` crate. To run benchmarks, use:

```bash
cargo bench
```

## ๐Ÿค Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## ๐Ÿ“„ License

This project is licensed under the GPL-3.0 License - see the [LICENSE](LICENSE) file for details.