https://github.com/kevindebenedetti/dataset-generator
Automated web-scraping to LLM-powered question-answer dataset generator with duplicate detection and optional Langfuse export.
https://github.com/kevindebenedetti/dataset-generator
fastapi nextjs openai
Last synced: 5 months ago
JSON representation
Automated web-scraping to LLM-powered question-answer dataset generator with duplicate detection and optional Langfuse export.
- Host: GitHub
- URL: https://github.com/kevindebenedetti/dataset-generator
- Owner: KevinDeBenedetti
- Created: 2025-08-28T21:54:20.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2026-01-10T22:28:26.000Z (5 months ago)
- Last Synced: 2026-01-11T06:50:38.111Z (5 months ago)
- Topics: fastapi, nextjs, openai
- Language: Python
- Homepage:
- Size: 5.86 MB
- Stars: 2
- Watchers: 0
- Forks: 1
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# Dataset Generator
[](https://github.com/KevinDeBenedetti/dataset-generator/actions)
[](https://codecov.io/gh/KevinDeBenedetti/dataset-generator)
Web scraping and automatic dataset generation tool for question-answer datasets with advanced export capabilities and LLM integration.
## ๐ฏ Objective
Create quality datasets for training AI models by automatically scraping reliable sources and generating contextualized question-answer pairs. Export datasets to multiple formats including Langfuse for training data management.
## โก Quick Start
```bash
# Configuration
cp .env.example .env
# Edit .env with your API keys
# Launch
make start
```
## ๐๏ธ Architecture
This project is designed with a modular architecture that separates concerns into distinct components:
- **Scraper**: Retrieval of web content from specified URLs
- **LLM Client**: Interaction with language models to generate question-answer pairs
- **Data Manager**: Data management and dataset storage with multiple export formats
- **Pipeline**: Orchestration of the complete dataset generation process
- **Export Module**: Advanced dataset export to various platforms (Langfuse, JSON, CSV, etc.)
## โจ Key Features
- **Multi-source Scraping**: Support for various web sources and content types
- **AI-Powered QA Generation**: Leverage state-of-the-art LLMs for intelligent question-answer pair creation
- **Multi-language Support**: Generate datasets in French, English, Spanish, and German
- **Langfuse Integration**: Direct export to Langfuse for dataset management and training workflows
- **Multiple Export Formats**: JSON, CSV, JSONL, and platform-specific formats
- **Quality Control**: Automated validation and filtering of generated content
- **Batch Processing**: Efficient handling of large-scale data generation
- **API Interface**: RESTful API for programmatic access and integration
## ๐ Workflow
1. **Scraping**: Retrieving raw web data from multiple sources
2. **Cleaning**: Processing and normalizing text to extract relevant content
3. **QA Generation**: Creating high-quality question-answer pairs via LLMs with configurable prompts
4. **Quality Assurance**: Automated validation and filtering of generated datasets
5. **Export**: Multi-format export including Langfuse integration for seamless training workflows
6. **Storage**: Persistent storage with metadata tracking and version control
## ๐ Export Options
- **Langfuse**: Direct integration for training data management
- **JSON/JSONL**: Standard formats for data interchange
- **CSV**: Tabular format for analysis and review
- **Custom Formats**: Extensible export system for specific requirements
## ๐ง Configuration
The tool supports extensive configuration options for:
- LLM model selection and parameters
- Export format preferences
- Quality thresholds and validation rules
- Batch processing settings
- API rate limiting and retry policies
## ๐ Supported Languages
- **French (fr)**: French language dataset generation
- **English (en)**: English language dataset generation
- **Spanish (es)**: Spanish language dataset generation
- **German (de)**: German language dataset generation
## ๐งช Testing & Coverage
This project maintains high test coverage to ensure code quality and reliability.
```bash
# Run tests with coverage (HTML report)
make test
# Run tests for CI (XML report, enforces 70% minimum)
make test-ci
# Run pre-commit hooks (includes tests on push)
uv run prek run --all-files
```
### Coverage Reports
- **Local**: After running tests, view `htmlcov/index.html` for detailed coverage report
- **CI/CD**: Coverage reports are automatically generated and uploaded on every PR
- **Codecov**: [View detailed coverage on Codecov](https://codecov.io/gh/KevinDeBenedetti/dataset-generator)
Current coverage threshold: **70%** minimum required for CI to pass