{"id":31052821,"url":"https://github.com/maxime-cllt/datalint","last_synced_at":"2026-04-16T11:04:47.889Z","repository":{"id":313034033,"uuid":"1003640183","full_name":"Maxime-Cllt/DataLint","owner":"Maxime-Cllt","description":"Unsafe value program detection in CSV file","archived":false,"fork":false,"pushed_at":"2025-09-03T13:45:39.000Z","size":16193,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-03T15:17:48.006Z","etag":null,"topics":["ai","csv","huggingface","pytorch","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Maxime-Cllt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-17T12:55:47.000Z","updated_at":"2025-09-03T13:45:42.000Z","dependencies_parsed_at":"2025-09-03T15:28:01.220Z","dependency_job_id":null,"html_url":"https://github.com/Maxime-Cllt/DataLint","commit_stats":null,"previous_names":["maxime-cllt/datalint"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Maxime-Cllt/DataLint","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maxime-Cllt%2FDataLint","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maxime-Cllt%2FDataLint/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maxime-Cllt%2FDataLint/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maxime-Cllt%2FDataLint/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Maxime-Cllt","download_url":"https://codeload.github.com/Maxime-Cllt/DataLint/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Maxime-Cllt%2FDataLint/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275193670,"owners_count":25421410,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-14T02:00:10.474Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","csv","huggingface","pytorch","rust"],"created_at":"2025-09-15T01:25:35.264Z","updated_at":"2026-04-16T11:04:46.785Z","avatar_url":"https://github.com/Maxime-Cllt.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003ch1\u003e📊 DataLint\u003c/h1\u003e\n    \u003cp\u003e\u003cem\u003eHigh-performance CSV data validation and anomaly detection tool\u003c/em\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Rust-dea584?style=for-the-badge\u0026logo=rust\u0026logoColor=white\" alt=\"Rust\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/PyTorch-EE4C2C?style=for-the-badge\u0026logo=pytorch\u0026logoColor=white\" alt=\"PyTorch\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Version-1.0.0-informational?style=for-the-badge\" alt=\"Version\" /\u003e\n    \u003cimg src=\"https://img.shields.io/badge/License-GPL--3.0-blue?style=for-the-badge\" alt=\"License\" /\u003e\n\u003c/div\u003e\n\n## 🚀 Overview\n\n**DataLint** is a production-ready machine learning model designed to prevent the ingestion of erroneous or malicious\ndata in CSV files.\nBuilt with Rust for optimal performance, it provides powerful CSV file validation capabilities by detecting erroneous,\nmalicious,\nor anomalous data patterns using advanced AI techniques.\n\n### ✨ Key Features\n\n- 🔍 **AI-Powered Detection**: Leverages pre-trained neural networks for intelligent data anomaly detection,\n  use [TinyBERT](https://huggingface.co/prajjwal1/bert-tiny) tokenizer for efficient data indexing\n- ⚡ **High Performance**: Built with Rust for maximum speed and memory efficiency\n- 📁 **CSV Processing**: Specialized for CSV file validation and analysis\n- 🛡️ **Security Focus**: Identifies potentially dangerous or malicious data patterns\n- 🔧 **Production Ready**: Optimized for server-side deployment in production environments\n- 📊 **JSON Output**: Generates detailed analysis reports in JSON format\n\n## 🎯 Use Cases\n\n- **Data Quality Assurance**: Validate CSV imports before processing\n- **Security Scanning**: Detect potentially malicious data injections\n- **Data Pipeline Integration**: Automated validation in ETL processes\n- **Compliance Checking**: Ensure data meets quality standards\n- **Anomaly Detection**: Identify outliers and unusual patterns\n\n## 📋 Prerequisites\n\n### Required Tools\n\n- **[Rust](https://www.rust-lang.org/tools/install)** (latest stable version)\n- **[Cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html)** (included with Rust)\n\n### External Dependencies\n\n- **AI Model**: Pre-trained PyTorch model for data anomaly detection\n- **Tokenizer**: JSON-formatted vocabulary file for data indexing and tokenization\n- **PyTorch Runtime**: Required DLLs and libraries for model inference\n\n## 🛠️ Installation\n\n### 1. Clone the Repository\n\n```bash\ngit clone https://github.com/Maxime-Cllt/DataLint.git\ncd DataLint\n```\n\n### 2. Build the Project\n\n```bash\n# Development build\ncargo build\n\n# Optimized release build (recommended for production)\ncargo build --release\n```\n\n## ⚙️ Configuration\n\nCreate a `config.json` file in the same directory as the executable:\n\n```json\n{\n  \"model_path\": \"C:\\\\Users\\\\model\\\\neural\\\\perfage_ia\",\n  \"vocabulary_path\": \"C:\\\\Users\\\\tokenizer\\\\tokenizer.json\"\n}\n```\n\n### Configuration Options\n\n\u003ctable\u003e\n        \u003cthead\u003e\n            \u003ctr\u003e\n                \u003cth\u003eOption\u003c/th\u003e\n                \u003cth\u003eDescription\u003c/th\u003e\n            \u003c/tr\u003e\n        \u003c/thead\u003e\n        \u003ctbody\u003e\n            \u003ctr\u003e\n                \u003ctd\u003e\u003ccode\u003emodel_path\u003c/code\u003e\u003c/td\u003e\n                \u003ctd\u003ePath to the pre-trained PyTorch model directory\u003c/td\u003e\n            \u003c/tr\u003e\n            \u003ctr\u003e\n                \u003ctd\u003e\u003ccode\u003evocabulary_path\u003c/code\u003e\u003c/td\u003e\n                \u003ctd\u003ePath to the tokenizer JSON file for data processing\u003c/td\u003e\n            \u003c/tr\u003e\n        \u003c/tbody\u003e\n\u003c/table\u003e\n\n## 🚀 Usage\n\n### Command Line Interface\n\n```bash\n# Using cargo (development)\ncargo run --release \"input_file.csv\" \"output_report.json\"\n\n# Using compiled executable (production)\n./target/release/DataLint \"input_file.csv\" \"output_report.json\"\n\n# On Windows\n.\\target\\release\\DataLint.exe \"input_file.csv\" \"output_report.json\"\n```\n\n### Parameters\n\n- **Input File**: Path to the CSV file to be validated\n- **Output File**: Path where the JSON analysis report will be saved\n\n### Example Usage\n\n```bash\n# Analyze a customer data file\n./DataLint \"data/customers.csv\" \"reports/customer_analysis.json\"\n\n# Validate uploaded user data\n./DataLint \"uploads/user_data.csv\" \"validation/results.json\"\n```\n\n## 📊 Output Format\n\nDataLint generates detailed JSON reports with the following structure:\n\n```json\n{\n  \"analysed_file\": \"file.csv\",\n  \"ai_analyze\": 1000,\n  \"regex_analyze\": 1000,\n  \"time_ms\": 1234,\n  \"anomalies\": [\n    {\n      \"value\": \"#ERROR!\",\n      \"column\": \"\\\"Phone\\\"\",\n      \"score\": 0.9670525,\n      \"line\": 71049\n    },\n    {\n      \"value\": \"??\",\n      \"column\": \"\\\"Comment\\\"\",\n      \"score\": 0.90427655,\n      \"line\": 75392\n    }\n  ]\n}\n```\n\n## 🏗️ Dependencies Setup\n\n### PyTorch Installation\n\n1. **Install PyTorch**: Follow the [official installation guide](https://pytorch.org/get-started/locally/)\n2. **Copy DLLs**: Place all PyTorch DLL files in the same directory as the DataLint executable\n\n### Required PyTorch DLLs (Windows)\n\n- `torch_cpu.dll`\n- `torch_cuda.dll` (if using GPU)\n- `c10.dll`\n- `fbgemm.dll`\n- Additional dependency DLLs as required\n\n## 🔧 Development\n\n### Building from Source\n\nTo build DataLint from source, ensure you have Rust and Cargo installed, then run:\n\n```bash\ncargo build --release\n```\n\n## 🧪 Code quality\n\n### Unit Tests available \n\nThe `tests` directory is tested using the command :\n\n```bash\ncargo test\n```\n\n### Benchmarking available\n\nCode is benchmarked using the `criterion` crate. To run benchmarks, use:\n\n```bash\ncargo bench\n```\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## 📄 License\n\nThis project is licensed under the GPL-3.0 License - see the [LICENSE](LICENSE) file for details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxime-cllt%2Fdatalint","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxime-cllt%2Fdatalint","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxime-cllt%2Fdatalint/lists"}