https://github.com/devo8604/cicd_llm_data_scraper

Automated pipeline for generating high-quality Q&A training data from Git repositories. Processes source code with LLMs to create fine-tuning datasets. Features smart caching, resume support, MLX (Apple Silicon) & llama.cpp backends, multiple export formats (Alpaca, ChatML, etc).
https://github.com/devo8604/cicd_llm_data_scraper

alpaca code-analysis data-pipeline dataset-generation fine-tuning instruction-tuning llamacpp llm machine-learning mlx python question-answering sqlite synthetic-data training-data

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/devo8604/cicd_llm_data_scraper
Owner: devo8604
License: mit
Created: 2025-12-16T04:00:41.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-12-23T05:11:15.000Z (6 months ago)
Last Synced: 2025-12-29T01:33:30.815Z (6 months ago)
Topics: alpaca, code-analysis, data-pipeline, dataset-generation, fine-tuning, instruction-tuning, llamacpp, llm, machine-learning, mlx, python, question-answering, sqlite, synthetic-data, training-data
Language: Python
Homepage:
Size: 379 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          # LLM Data Pipeline

An automated pipeline for generating high-quality question-and-answer training data from Git repositories. This tool processes source code files and uses Large Language Models to create Q&A pairs suitable for fine-tuning code-focused LLMs.

## Table of Contents

- [Features](#features)

- [Quick Start](#quick-start)

  - [Prerequisites](#prerequisites)

  - [Installation](#installation)

  - [Basic Usage](#basic-usage)

- [Configuration](#configuration)

  - [Option 1: llama.cpp (Recommended)](#option-1-llamacpp-recommended-for-most-users)

  - [Option 2: MLX (Apple Silicon)](#option-2-mlx-apple-silicon-only)

  - [MLX Model Management](#mlx-model-management)

- [Commands](#commands)

  - [scrape - Clone/Update Repositories](#scrape---cloneupdate-repositories)

  - [prepare - Generate Q&A Pairs](#prepare---generate-qa-pairs)

  - [retry - Retry Failed Files](#retry---retry-failed-files)

  - [export - Export Training Data](#export---export-training-data)

- [Directory Structure](#directory-structure)

- [Database Schema](#database-schema)

  - [TrainingSamples](#trainingsamples)

  - [ConversationTurns](#conversationturns)

  - [FileHashes](#filehashes)

  - [FailedFiles](#failedfiles)

- [Troubleshooting](#troubleshooting)

  - [MLX Issues (Apple Silicon)](#mlx-issues-apple-silicon)

  - [llama.cpp Issues](#llamacpp-issues)

  - [General Issues](#general-issues)

- [Performance Tips](#performance-tips)

- [File Exclusions](#file-exclusions)

- [Examples](#examples)

  - [Process a Specific Repository](#process-a-specific-repository)

  - [Resume After Interruption](#resume-after-interruption)

  - [Retry Failed Files](#retry-failed-files-1)

- [License](#license)

- [Contributing](#contributing)

- [Support](#support)

## Features

- 🔄 **Automated Repository Management**: Clone and update Git repositories from a simple text file

- 🤖 **Intelligent Q&A Generation**: Uses LLMs to generate contextual questions and answers from code

- 💾 **Smart Caching**: Tracks file hashes to avoid reprocessing unchanged files

- 🔁 **Resume Support**: Automatically resumes from where it left off if interrupted

- 🍎 **MLX Support**: Native Apple Silicon acceleration (M1/M2/M3 Macs)

- 🦙 **llama.cpp Compatible**: Works with any OpenAI-compatible API endpoint

- 📊 **Multiple Export Formats**: CSV, Alpaca, ChatML, Llama3, Mistral, Gemma formats

- 🔋 **Battery Management**: Automatically pauses on low battery (macOS)

- 🗃️ **SQLite Storage**: All data stored in a portable SQLite database

## Quick Start

### Prerequisites

- Python 3.14 or higher

- Git

### Installation

1. **Clone the repository**:

   ```bash

   git clone 

   cd cicdllm

   ```

2. **Create and activate virtual environment**:

   ```bash

   python3 -m venv venv

   source venv/bin/activate  # On Windows: venv\Scripts\activate

   ```

3. **Install dependencies**:

   ```bash

   pip install -r requirements.txt

   ```

### Basic Usage

1. **Create a `repos.txt` file** with Git repository URLs (one per line):

   ```

   https://github.com/user/repo1

   https://github.com/user/repo2

   ```

2. **Configure your LLM backend** (see [Configuration](#configuration))

3. **Clone repositories**:

   ```bash

   python3 main.py scrape

   ```

4. **Generate Q&A pairs**:

   ```bash

   python3 main.py prepare

   ```

5. **Export training data**:

   ```bash

   python3 main.py export --template alpaca-jsonl --output-file training_data.jsonl

   ```

## Configuration

The pipeline supports two LLM backends:

### Option 1: llama.cpp (Recommended for most users)

Edit `src/config.py`:

```python

# LLM Client Settings

USE_MLX = False  # Use llama.cpp

LLM_BASE_URL = "http://localhost:11454"  # Your llama.cpp server

LLM_MODEL_NAME = "your-model-name"

```

Start your llama.cpp server:

```bash

# Example with llama-server

llama-server -m path/to/model.gguf --port 11454

```

### Option 2: MLX (Apple Silicon only)

For M1/M2/M3 Macs with native acceleration:

```bash

# Install MLX dependencies

pip install mlx mlx-lm

```

Edit `src/config.py`:

```python

USE_MLX = True  # Enable MLX

MLX_MODEL_NAME = "mlx-community/Qwen2.5-Coder-14B-Instruct-4bit"

```

#### MLX Model Management

```bash

# List locally cached models

python3 main.py mlx list

# Download a model

python3 main.py mlx download mlx-community/Qwen2.5-Coder-14B-Instruct-4bit

# Get model information

python3 main.py mlx info mlx-community/Qwen2.5-Coder-14B-Instruct-4bit

# Remove a model

python3 main.py mlx remove mlx-community/Qwen2.5-Coder-14B-Instruct-4bit

```

## Commands

### `scrape` - Clone/Update Repositories

```bash

python3 main.py scrape [OPTIONS]

```

Clones or updates all repositories listed in `repos.txt`.

**Options:**

- `--data-dir`: Directory for data storage (default: `data`)

- `--max-log-files`: Maximum log files to keep (default: 5)

### `prepare` - Generate Q&A Pairs

```bash

python3 main.py prepare [OPTIONS]

```

Processes files and generates question-answer pairs.

**Options:**

- `--max-tokens`: Maximum tokens for LLM responses (default: 500)

- `--temperature`: Sampling temperature 0.0-2.0 (default: 0.7)

- `--max-file-size`: Maximum file size in bytes (default: 5MB)

- `--data-dir`: Directory for data storage

- `--max-log-files`: Maximum log files to keep

**Features:**

- ✅ Skips unchanged files (uses SHA256 hashing)

- ✅ Automatically resumes from interruption

- ✅ Tracks failed files for retry

- ✅ Excludes binary files, images, and large files

- ✅ Progress bars for repositories and files

### `retry` - Retry Failed Files

```bash

python3 main.py retry [OPTIONS]

```

Attempts to reprocess files that failed during `prepare`.

### `export` - Export Training Data

```bash

python3 main.py export --template  --output-file 

```

**Required Arguments:**

- `--template`: Output format (see below)

- `--output-file`: Path to output file

**Supported Formats:**

- `csv` - Comma-separated values

- `alpaca-jsonl` - Alpaca instruction format (JSONL)

- `chatml-jsonl` - ChatML format (JSONL)

- `llama3` - Llama 3 chat template

- `mistral` - Mistral instruction format

- `gemma` - Gemma chat template

**Example:**

```bash

python3 main.py export --template alpaca-jsonl --output-file output.jsonl

```

## Directory Structure

```

cicdllm/

├── data/

│   └── pipeline.db          # SQLite database with Q&A pairs

├── logs/                    # Log files (rotated)

├── repos/                   # Cloned repositories

│   └── /

│       └── /

├── src/                     # Source code

│   ├── services/           # Service layer

│   ├── config.py           # Configuration

│   ├── data_pipeline.py    # Main pipeline

│   ├── db_manager.py       # Database operations

│   ├── llm_client.py       # LLM API client

│   ├── mlx_client.py       # MLX client (Apple Silicon)

│   └── ...

├── tests/                   # Unit tests

├── main.py                  # Entry point

├── repos.txt               # Repository URLs

└── requirements.txt        # Python dependencies

```

## Database Schema

The pipeline uses SQLite to store all generated data:

### TrainingSamples

Stores Q&A sample metadata.

| Column                 | Type        | Description                    |

| ---------------------- | ----------- | ------------------------------ |

| `sample_id`            | `INTEGER`   | Primary key                    |

| `dataset_source`       | `VARCHAR`   | Source file path               |

| `creation_date`        | `TIMESTAMP` | When created                   |

| `model_type_intended`  | `VARCHAR`   | Intended model type            |

| `sample_quality_score` | `REAL`      | Quality score                  |

| `is_multiturn`         | `BOOLEAN`   | Multi-turn conversation flag   |

### ConversationTurns

Stores individual Q&A turns.

| Column          | Type      | Description                     |

| --------------- | --------- | ------------------------------- |

| `turn_id`       | `INTEGER` | Primary key                     |

| `sample_id`     | `INTEGER` | Foreign key to TrainingSamples  |

| `turn_index`    | `INTEGER` | Turn order                      |

| `role`          | `VARCHAR` | 'user' or 'assistant'           |

| `content`       | `TEXT`    | Question or answer text         |

| `is_label`      | `BOOLEAN` | Label flag                      |

| `metadata_json` | `TEXT`    | Additional metadata (JSON)      |

### FileHashes

Tracks processed files to avoid reprocessing.

| Column           | Type       | Description                 |

| ---------------- | ---------- | --------------------------- |

| `file_path`      | `TEXT`     | Primary key, file path      |

| `content_hash`   | `TEXT`     | SHA256 hash                 |

| `last_processed` | `DATETIME` | Last processing time        |

| `sample_id`      | `INTEGER`  | Foreign key to TrainingSamples |

### FailedFiles

Stores failed file information for retry.

| Column      | Type        | Description         |

| ----------- | ----------- | ------------------- |

| `failed_id` | `INTEGER`   | Primary key         |

| `file_path` | `TEXT`      | Failed file path    |

| `reason`    | `TEXT`      | Failure reason      |

| `failed_at` | `TIMESTAMP` | Failure timestamp   |

## Troubleshooting

### MLX Issues (Apple Silicon)

**Memory Errors:**

```

[METAL] Command buffer execution failed: Insufficient Memory

```

**Solutions:**

- Use smaller models (e.g., 7B instead of 30B parameters)

- Close other memory-intensive applications

- Recommended models by RAM:

  - 8GB: 1B-3B parameter models

  - 16GB: 3B-7B parameter models

  - 24GB+: 7B-14B parameter models

  - 32GB+: 14B+ parameter models

**Model Loading Failures:**

- Verify internet connectivity

- Check model name is correct

- Try a different model from MLX Community

### llama.cpp Issues

**Connection Refused:**

```bash

# Verify server is running

curl http://localhost:11454/v1/models

# Check port matches config

# src/config.py: LLM_BASE_URL = "http://localhost:11454"

```

**Model Not Found:**

- Ensure model is loaded in llama.cpp

- Check `LLM_MODEL_NAME` matches exactly

- List available models: `curl http://localhost:11454/v1/models`

### General Issues

**Empty Q&A Pairs:**

- Check LLM is responding correctly

- Increase `--max-tokens` if answers are truncated

- Adjust `--temperature` for better variety

**Processing Stuck:**

- Check logs in `logs/` directory

- Verify LLM server is responding

- Use `Ctrl+C` to interrupt (state is saved automatically)

**Database Locked:**

- Ensure only one pipeline instance is running

- Close any SQLite browser connections

## Performance Tips

1. **File Size Limits**: Adjust `MAX_FILE_SIZE` in config for your needs

2. **Concurrent Processing**: Set `MAX_CONCURRENT_FILES` > 1 for parallel processing

3. **Batch Size**: Adjust `FILE_BATCH_SIZE` for optimal throughput

4. **LLM Timeouts**: Increase `LLM_REQUEST_TIMEOUT` for slower models

5. **Battery Management**: On macOS, pipeline pauses when battery < 15%

## File Exclusions

The following file types are automatically excluded:

- Images: `.png`, `.jpg`, `.jpeg`, `.gif`, `.svg`

- Archives: `.zip`, `.tar`, `.gz`

- Binary: `.bin`, `.pack`, `.idx`, `.rev`

- Documents: `.pdf`, `.pptx`

- System: `.DS_Store`

Configure in `src/config.py`:

```python

EXCLUDED_FILE_EXTENSIONS = (

    ".png", ".jpg", # ... add more

)

```

## Examples

### Process a Specific Repository

```bash

# Create repos.txt with one repository

echo "https://github.com/user/awesome-project" > repos.txt

# Scrape and process

python3 main.py scrape

python3 main.py prepare --max-tokens 1000 --temperature 0.8

# Export to Alpaca format

python3 main.py export --template alpaca-jsonl --output-file alpaca_data.jsonl

```

### Resume After Interruption

The pipeline automatically saves state. Simply run the command again:

```bash

# If interrupted during prepare

python3 main.py prepare  # Resumes from last position

```

### Retry Failed Files

```bash

# After prepare completes

python3 main.py retry  # Reprocesses all failed files

```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and contribution guidelines.

## Support

For issues and questions:

- Check the [Troubleshooting](#troubleshooting) section

- Review logs in the `logs/` directory

- Open an issue on GitHub

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/devo8604/cicd_llm_data_scraper

Awesome Lists containing this project

README