{"id":35066652,"url":"https://github.com/devo8604/cicd_llm_data_scraper","last_synced_at":"2026-04-11T17:01:53.541Z","repository":{"id":328898937,"uuid":"1117257228","full_name":"devo8604/CICD_LLM_DATA_SCRAPER","owner":"devo8604","description":"Automated pipeline for generating high-quality Q\u0026A training data from Git repositories. Processes source code with LLMs to create fine-tuning datasets. Features smart caching, resume support, MLX (Apple Silicon) \u0026 llama.cpp backends, multiple export formats (Alpaca, ChatML, etc).","archived":false,"fork":false,"pushed_at":"2025-12-23T05:11:15.000Z","size":388,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-29T01:33:30.815Z","etag":null,"topics":["alpaca","code-analysis","data-pipeline","dataset-generation","fine-tuning","instruction-tuning","llamacpp","llm","machine-learning","mlx","python","question-answering","sqlite","synthetic-data","training-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devo8604.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-16T04:00:41.000Z","updated_at":"2025-12-16T23:28:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/devo8604/CICD_LLM_DATA_SCRAPER","commit_stats":null,"previous_names":["devo8604/cicd_llm_data_scraper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/devo8604/CICD_LLM_DATA_SCRAPER","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devo8604%2FCICD_LLM_DATA_SCRAPER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devo8604%2FCICD_LLM_DATA_SCRAPER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devo8604%2FCICD_LLM_DATA_SCRAPER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devo8604%2FCICD_LLM_DATA_SCRAPER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devo8604","download_url":"https://codeload.github.com/devo8604/CICD_LLM_DATA_SCRAPER/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devo8604%2FCICD_LLM_DATA_SCRAPER/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31687881,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-11T13:07:20.380Z","status":"ssl_error","status_checked_at":"2026-04-11T13:06:47.903Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alpaca","code-analysis","data-pipeline","dataset-generation","fine-tuning","instruction-tuning","llamacpp","llm","machine-learning","mlx","python","question-answering","sqlite","synthetic-data","training-data"],"created_at":"2025-12-27T11:31:36.890Z","updated_at":"2026-04-11T17:01:53.534Z","avatar_url":"https://github.com/devo8604.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM Data Pipeline\n\nAn automated pipeline for generating high-quality question-and-answer training data from Git repositories. This tool processes source code files and uses Large Language Models to create Q\u0026A pairs suitable for fine-tuning code-focused LLMs.\n\n## Table of Contents\n\n- [Features](#features)\n- [Quick Start](#quick-start)\n  - [Prerequisites](#prerequisites)\n  - [Installation](#installation)\n  - [Basic Usage](#basic-usage)\n- [Configuration](#configuration)\n  - [Option 1: llama.cpp (Recommended)](#option-1-llamacpp-recommended-for-most-users)\n  - [Option 2: MLX (Apple Silicon)](#option-2-mlx-apple-silicon-only)\n  - [MLX Model Management](#mlx-model-management)\n- [Commands](#commands)\n  - [scrape - Clone/Update Repositories](#scrape---cloneupdate-repositories)\n  - [prepare - Generate Q\u0026A Pairs](#prepare---generate-qa-pairs)\n  - [retry - Retry Failed Files](#retry---retry-failed-files)\n  - [export - Export Training Data](#export---export-training-data)\n- [Directory Structure](#directory-structure)\n- [Database Schema](#database-schema)\n  - [TrainingSamples](#trainingsamples)\n  - [ConversationTurns](#conversationturns)\n  - [FileHashes](#filehashes)\n  - [FailedFiles](#failedfiles)\n- [Troubleshooting](#troubleshooting)\n  - [MLX Issues (Apple Silicon)](#mlx-issues-apple-silicon)\n  - [llama.cpp Issues](#llamacpp-issues)\n  - [General Issues](#general-issues)\n- [Performance Tips](#performance-tips)\n- [File Exclusions](#file-exclusions)\n- [Examples](#examples)\n  - [Process a Specific Repository](#process-a-specific-repository)\n  - [Resume After Interruption](#resume-after-interruption)\n  - [Retry Failed Files](#retry-failed-files-1)\n- [License](#license)\n- [Contributing](#contributing)\n- [Support](#support)\n\n## Features\n\n- 🔄 **Automated Repository Management**: Clone and update Git repositories from a simple text file\n- 🤖 **Intelligent Q\u0026A Generation**: Uses LLMs to generate contextual questions and answers from code\n- 💾 **Smart Caching**: Tracks file hashes to avoid reprocessing unchanged files\n- 🔁 **Resume Support**: Automatically resumes from where it left off if interrupted\n- 🍎 **MLX Support**: Native Apple Silicon acceleration (M1/M2/M3 Macs)\n- 🦙 **llama.cpp Compatible**: Works with any OpenAI-compatible API endpoint\n- 📊 **Multiple Export Formats**: CSV, Alpaca, ChatML, Llama3, Mistral, Gemma formats\n- 🔋 **Battery Management**: Automatically pauses on low battery (macOS)\n- 🗃️ **SQLite Storage**: All data stored in a portable SQLite database\n\n## Quick Start\n\n### Prerequisites\n\n- Python 3.14 or higher\n- Git\n\n### Installation\n\n1. **Clone the repository**:\n   ```bash\n   git clone \u003crepository-url\u003e\n   cd cicdllm\n   ```\n\n2. **Create and activate virtual environment**:\n   ```bash\n   python3 -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n\n3. **Install dependencies**:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n### Basic Usage\n\n1. **Create a `repos.txt` file** with Git repository URLs (one per line):\n   ```\n   https://github.com/user/repo1\n   https://github.com/user/repo2\n   ```\n\n2. **Configure your LLM backend** (see [Configuration](#configuration))\n\n3. **Clone repositories**:\n   ```bash\n   python3 main.py scrape\n   ```\n\n4. **Generate Q\u0026A pairs**:\n   ```bash\n   python3 main.py prepare\n   ```\n\n5. **Export training data**:\n   ```bash\n   python3 main.py export --template alpaca-jsonl --output-file training_data.jsonl\n   ```\n\n## Configuration\n\nThe pipeline supports two LLM backends:\n\n### Option 1: llama.cpp (Recommended for most users)\n\nEdit `src/config.py`:\n\n```python\n# LLM Client Settings\nUSE_MLX = False  # Use llama.cpp\nLLM_BASE_URL = \"http://localhost:11454\"  # Your llama.cpp server\nLLM_MODEL_NAME = \"your-model-name\"\n```\n\nStart your llama.cpp server:\n```bash\n# Example with llama-server\nllama-server -m path/to/model.gguf --port 11454\n```\n\n### Option 2: MLX (Apple Silicon only)\n\nFor M1/M2/M3 Macs with native acceleration:\n\n```bash\n# Install MLX dependencies\npip install mlx mlx-lm\n```\n\nEdit `src/config.py`:\n```python\nUSE_MLX = True  # Enable MLX\nMLX_MODEL_NAME = \"mlx-community/Qwen2.5-Coder-14B-Instruct-4bit\"\n```\n\n#### MLX Model Management\n\n```bash\n# List locally cached models\npython3 main.py mlx list\n\n# Download a model\npython3 main.py mlx download mlx-community/Qwen2.5-Coder-14B-Instruct-4bit\n\n# Get model information\npython3 main.py mlx info mlx-community/Qwen2.5-Coder-14B-Instruct-4bit\n\n# Remove a model\npython3 main.py mlx remove mlx-community/Qwen2.5-Coder-14B-Instruct-4bit\n```\n\n## Commands\n\n### `scrape` - Clone/Update Repositories\n\n```bash\npython3 main.py scrape [OPTIONS]\n```\n\nClones or updates all repositories listed in `repos.txt`.\n\n**Options:**\n- `--data-dir`: Directory for data storage (default: `data`)\n- `--max-log-files`: Maximum log files to keep (default: 5)\n\n### `prepare` - Generate Q\u0026A Pairs\n\n```bash\npython3 main.py prepare [OPTIONS]\n```\n\nProcesses files and generates question-answer pairs.\n\n**Options:**\n- `--max-tokens`: Maximum tokens for LLM responses (default: 500)\n- `--temperature`: Sampling temperature 0.0-2.0 (default: 0.7)\n- `--max-file-size`: Maximum file size in bytes (default: 5MB)\n- `--data-dir`: Directory for data storage\n- `--max-log-files`: Maximum log files to keep\n\n**Features:**\n- ✅ Skips unchanged files (uses SHA256 hashing)\n- ✅ Automatically resumes from interruption\n- ✅ Tracks failed files for retry\n- ✅ Excludes binary files, images, and large files\n- ✅ Progress bars for repositories and files\n\n### `retry` - Retry Failed Files\n\n```bash\npython3 main.py retry [OPTIONS]\n```\n\nAttempts to reprocess files that failed during `prepare`.\n\n### `export` - Export Training Data\n\n```bash\npython3 main.py export --template \u003cFORMAT\u003e --output-file \u003cPATH\u003e\n```\n\n**Required Arguments:**\n- `--template`: Output format (see below)\n- `--output-file`: Path to output file\n\n**Supported Formats:**\n- `csv` - Comma-separated values\n- `alpaca-jsonl` - Alpaca instruction format (JSONL)\n- `chatml-jsonl` - ChatML format (JSONL)\n- `llama3` - Llama 3 chat template\n- `mistral` - Mistral instruction format\n- `gemma` - Gemma chat template\n\n**Example:**\n```bash\npython3 main.py export --template alpaca-jsonl --output-file output.jsonl\n```\n\n## Directory Structure\n\n```\ncicdllm/\n├── data/\n│   └── pipeline.db          # SQLite database with Q\u0026A pairs\n├── logs/                    # Log files (rotated)\n├── repos/                   # Cloned repositories\n│   └── \u003corg\u003e/\n│       └── \u003crepo\u003e/\n├── src/                     # Source code\n│   ├── services/           # Service layer\n│   ├── config.py           # Configuration\n│   ├── data_pipeline.py    # Main pipeline\n│   ├── db_manager.py       # Database operations\n│   ├── llm_client.py       # LLM API client\n│   ├── mlx_client.py       # MLX client (Apple Silicon)\n│   └── ...\n├── tests/                   # Unit tests\n├── main.py                  # Entry point\n├── repos.txt               # Repository URLs\n└── requirements.txt        # Python dependencies\n```\n\n## Database Schema\n\nThe pipeline uses SQLite to store all generated data:\n\n### TrainingSamples\n\nStores Q\u0026A sample metadata.\n\n| Column                 | Type        | Description                    |\n| ---------------------- | ----------- | ------------------------------ |\n| `sample_id`            | `INTEGER`   | Primary key                    |\n| `dataset_source`       | `VARCHAR`   | Source file path               |\n| `creation_date`        | `TIMESTAMP` | When created                   |\n| `model_type_intended`  | `VARCHAR`   | Intended model type            |\n| `sample_quality_score` | `REAL`      | Quality score                  |\n| `is_multiturn`         | `BOOLEAN`   | Multi-turn conversation flag   |\n\n### ConversationTurns\n\nStores individual Q\u0026A turns.\n\n| Column          | Type      | Description                     |\n| --------------- | --------- | ------------------------------- |\n| `turn_id`       | `INTEGER` | Primary key                     |\n| `sample_id`     | `INTEGER` | Foreign key to TrainingSamples  |\n| `turn_index`    | `INTEGER` | Turn order                      |\n| `role`          | `VARCHAR` | 'user' or 'assistant'           |\n| `content`       | `TEXT`    | Question or answer text         |\n| `is_label`      | `BOOLEAN` | Label flag                      |\n| `metadata_json` | `TEXT`    | Additional metadata (JSON)      |\n\n### FileHashes\n\nTracks processed files to avoid reprocessing.\n\n| Column           | Type       | Description                 |\n| ---------------- | ---------- | --------------------------- |\n| `file_path`      | `TEXT`     | Primary key, file path      |\n| `content_hash`   | `TEXT`     | SHA256 hash                 |\n| `last_processed` | `DATETIME` | Last processing time        |\n| `sample_id`      | `INTEGER`  | Foreign key to TrainingSamples |\n\n### FailedFiles\n\nStores failed file information for retry.\n\n| Column      | Type        | Description         |\n| ----------- | ----------- | ------------------- |\n| `failed_id` | `INTEGER`   | Primary key         |\n| `file_path` | `TEXT`      | Failed file path    |\n| `reason`    | `TEXT`      | Failure reason      |\n| `failed_at` | `TIMESTAMP` | Failure timestamp   |\n\n## Troubleshooting\n\n### MLX Issues (Apple Silicon)\n\n**Memory Errors:**\n```\n[METAL] Command buffer execution failed: Insufficient Memory\n```\n\n**Solutions:**\n- Use smaller models (e.g., 7B instead of 30B parameters)\n- Close other memory-intensive applications\n- Recommended models by RAM:\n  - 8GB: 1B-3B parameter models\n  - 16GB: 3B-7B parameter models\n  - 24GB+: 7B-14B parameter models\n  - 32GB+: 14B+ parameter models\n\n**Model Loading Failures:**\n- Verify internet connectivity\n- Check model name is correct\n- Try a different model from MLX Community\n\n### llama.cpp Issues\n\n**Connection Refused:**\n```bash\n# Verify server is running\ncurl http://localhost:11454/v1/models\n\n# Check port matches config\n# src/config.py: LLM_BASE_URL = \"http://localhost:11454\"\n```\n\n**Model Not Found:**\n- Ensure model is loaded in llama.cpp\n- Check `LLM_MODEL_NAME` matches exactly\n- List available models: `curl http://localhost:11454/v1/models`\n\n### General Issues\n\n**Empty Q\u0026A Pairs:**\n- Check LLM is responding correctly\n- Increase `--max-tokens` if answers are truncated\n- Adjust `--temperature` for better variety\n\n**Processing Stuck:**\n- Check logs in `logs/` directory\n- Verify LLM server is responding\n- Use `Ctrl+C` to interrupt (state is saved automatically)\n\n**Database Locked:**\n- Ensure only one pipeline instance is running\n- Close any SQLite browser connections\n\n## Performance Tips\n\n1. **File Size Limits**: Adjust `MAX_FILE_SIZE` in config for your needs\n2. **Concurrent Processing**: Set `MAX_CONCURRENT_FILES` \u003e 1 for parallel processing\n3. **Batch Size**: Adjust `FILE_BATCH_SIZE` for optimal throughput\n4. **LLM Timeouts**: Increase `LLM_REQUEST_TIMEOUT` for slower models\n5. **Battery Management**: On macOS, pipeline pauses when battery \u003c 15%\n\n## File Exclusions\n\nThe following file types are automatically excluded:\n\n- Images: `.png`, `.jpg`, `.jpeg`, `.gif`, `.svg`\n- Archives: `.zip`, `.tar`, `.gz`\n- Binary: `.bin`, `.pack`, `.idx`, `.rev`\n- Documents: `.pdf`, `.pptx`\n- System: `.DS_Store`\n\nConfigure in `src/config.py`:\n```python\nEXCLUDED_FILE_EXTENSIONS = (\n    \".png\", \".jpg\", # ... add more\n)\n```\n\n## Examples\n\n### Process a Specific Repository\n\n```bash\n# Create repos.txt with one repository\necho \"https://github.com/user/awesome-project\" \u003e repos.txt\n\n# Scrape and process\npython3 main.py scrape\npython3 main.py prepare --max-tokens 1000 --temperature 0.8\n\n# Export to Alpaca format\npython3 main.py export --template alpaca-jsonl --output-file alpaca_data.jsonl\n```\n\n### Resume After Interruption\n\nThe pipeline automatically saves state. Simply run the command again:\n\n```bash\n# If interrupted during prepare\npython3 main.py prepare  # Resumes from last position\n```\n\n### Retry Failed Files\n\n```bash\n# After prepare completes\npython3 main.py retry  # Reprocesses all failed files\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and contribution guidelines.\n\n## Support\n\nFor issues and questions:\n- Check the [Troubleshooting](#troubleshooting) section\n- Review logs in the `logs/` directory\n- Open an issue on GitHub\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevo8604%2Fcicd_llm_data_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevo8604%2Fcicd_llm_data_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevo8604%2Fcicd_llm_data_scraper/lists"}