An open API service indexing awesome lists of open source software.

https://github.com/richie-rich90454/training-generator

Training Generator is a cross-platform desktop app built with Electron and Node.js that converts documents (PDF, DOCX, DOC, RTF, TXT, MD, HTML) into structured AI training data. Using local Ollama models, it extracts instructions, Q&A pairs, and conversation data for machine learning, AI fine-tuning, and NLP workflows, while keeping all processing.
https://github.com/richie-rich90454/training-generator

ai ai-data-analysis ai-training-data cpp desktop-app document-conversion electron html-css-javascript jsonl local-ai ml ollama ollama-api training-materials

Last synced: 4 months ago
JSON representation

Training Generator is a cross-platform desktop app built with Electron and Node.js that converts documents (PDF, DOCX, DOC, RTF, TXT, MD, HTML) into structured AI training data. Using local Ollama models, it extracts instructions, Q&A pairs, and conversation data for machine learning, AI fine-tuning, and NLP workflows, while keeping all processing.

Awesome Lists containing this project

README

          

# ๐Ÿค– Training Generator

[![CI](https://github.com/richie-rich90454/training-generator/actions/workflows/ci.yml/badge.svg)](https://github.com/richie-rich90454/training-generator/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Release](https://img.shields.io/github/v/release/richie-rich90454/training-generator?color=blue&label=latest%20release)](https://github.com/richie-rich90454/training-generator/releases)
[![Stars](https://img.shields.io/github/stars/richie-rich90454/training-generator?style=social)](https://github.com/richie-rich90454/training-generator/stargazers)
[![Forks](https://img.shields.io/github/forks/richie-rich90454/training-generator?style=social)](https://github.com/richie-rich90454/training-generator/network/members)
[![Contributors](https://img.shields.io/github/contributors/richie-rich90454/training-generator)](https://github.com/richie-rich90454/training-generator/graphs/contributors)
[![Open Issues](https://img.shields.io/github/issues/richie-rich90454/training-generator)](https://github.com/richie-rich90454/training-generator/issues)
[![Open Pull Requests](https://img.shields.io/github/issues-pr/richie-rich90454/training-generator)](https://github.com/richie-rich90454/training-generator/pulls)
[![Last Commit](https://img.shields.io/github/last-commit/richie-rich90454/training-generator)](https://github.com/richie-rich90454/training-generator/commits/main)
[![Discussions](https://img.shields.io/badge/Discussions-join-blue)](https://github.com/richie-rich90454/training-generator/discussions)
[![Downloads](https://img.shields.io/github/downloads/richie-rich90454/training-generator/total?color=blue)](https://github.com/richie-rich90454/training-generator/releases)
[![Build Status](https://img.shields.io/github/actions/workflow/status/richie-rich90454/training-generator/ci.yml?branch=main)](https://github.com/richie-rich90454/training-generator/actions/workflows/ci.yml)

---

### ๐Ÿš€ Built With

[![Electron](https://img.shields.io/badge/Electron-39.2.7-47848F.svg)](https://www.electronjs.org/) [![Node.js](https://img.shields.io/badge/Node.js-18+-339933.svg)](https://nodejs.org/) [![Ollama](https://img.shields.io/badge/Ollama-Local%20AI-9cf)](https://ollama.com/) [![Vite](https://img.shields.io/badge/Vite-7.3.0-646cff.svg)](https://vitejs.dev/) [![Axios](https://img.shields.io/badge/Axios-1.7.9-0055ff.svg)](https://axios-http.com/) [![HTML](https://img.shields.io/badge/HTML5-E34F26?style=flat&logo=html5&logoColor=white)](https://developer.mozilla.org/en-US/docs/Web/HTML) [![CSS](https://img.shields.io/badge/CSS3-1572B6?style=flat&logo=css3&logoColor=white)](https://developer.mozilla.org/en-US/docs/Web/CSS) [![JavaScript](https://img.shields.io/badge/JavaScript-F7DF1E?style=flat&logo=javascript&logoColor=black)](https://developer.mozilla.org/en-US/docs/Web/JavaScript)

**Training Generator** is a cross-platform **desktop application** built with **Electron** and **Node.js** that converts documents (PDF, DOCX, DOC, RTF, TXT, MD, HTML) into **AI training data** using **local Ollama models**. Extract instructions, Q&A pairs, conversation data, and structured output for machine learning, NLP workflows, or AI fine-tuning โ€” all processed offline for privacy and speed.

- Convert PDF, DOCX, TXT, MD, HTML, RTF documents to AI training data
- Generate instruction/Q&A pairs & conversation datasets
- Multi-language support: EN, CN, FR, DE, ES, JP, KR
- Real-time output preview & batch processing
- Local processing for privacy โ€” no data leaves your computer

## โœจ Features

### ๐Ÿ“ **File Support**
- **Multi-format Processing**: PDF, DOCX, DOC, RTF, TXT, MD, and HTML files
- **Smart Text Extraction**: Advanced parsing for complex document structures
- **Large File Handling**: Support for files up to 100MB with efficient chunking

### ๐Ÿง  **AI Processing**
- **Ollama Integration**: Uses local Ollama API for private AI processing
- **Multiple Processing Types**:
- ๐Ÿ“ **Instruction Extraction** (Q&A pairs for fine-tuning)
- ๐Ÿ’ฌ **Conversation Generation** (Dialog-style training data)
- ๐Ÿ”ช **Text Chunking** (Intelligent document segmentation)
- ๐ŸŽจ **Custom Analysis** (User-defined prompt templates)
- **Multi-language Support**: English, Chinese, Spanish, French, German, Japanese, Korean

### ๐Ÿ“Š **Output & Export**
- **Flexible Formats**: JSONL (Alpaca style), ChatML, CSV, Plain Text
- **Batch Processing**: Process multiple files simultaneously
- **Progress Tracking**: Real-time progress bars and detailed logging

### ๐ŸŽจ **User Experience**
- **Modern UI**: Clean, responsive interface with drag & drop support
- **Native Splash Screen**: C++/WinAPI native splash screen on Windows for fast startup
- **Dark/Light Themes**: System-aware theme switching
- **Preset Management**: Save and load processing configurations
- **Real-time Preview**: Live output preview before export

## ๐Ÿš€ Quick Start

### Prerequisites
- **Node.js 18+** and npm (Recommended: Node.js 24+ for best compatibility)
- **Ollama** (for AI processing) - [Download here](https://ollama.com/)

### Dependency Compatibility
All project dependencies are verified to be compatible with Node.js 18+:

| Dependency | Version | Node.js Compatibility | Purpose |
|------------|---------|----------------------|---------|
| **Electron** | ^39.2.7 | 18+ (uses Node.js 20.9.0) | Desktop application framework |
| **Vite** | ^7.3.0 | 18+ | Build tool and dev server |
| **Axios** | ^1.7.9 | 18+ | HTTP client for Ollama API |
| **html-to-text** | ^9.0.5 | 18+ | HTML document parsing |
| **mammoth** | ^1.11.0 | 18+ | DOCX document parsing |
| **officeparser** | ^3.0.0 | 18+ | DOC document parsing |
| **pdf-parse** | ^1.1.4 | 18+ | PDF document parsing |
| **rtf-parser-fixes** | ^1.3.4 | 18+ | RTF document parsing |
| **electron-builder** | ^26.0.12 | 18+ | Application packaging |

**Note**: The `fs` package (`^0.0.1-security`) is a placeholder package and works with all Node.js versions.

### Installation & Running

```bash
# Clone the repository
git clone https://github.com/richie-rich90454/training-generator.git
cd training-generator

# Install dependencies
npm install

# Start Ollama (in a separate terminal)
ollama serve

# Pull a model (example)
ollama pull llama3.2

# Run the application
npm run dev
```

### Quick Demo
```bash
# Test basic functionality
node test-app.js

# Test Ollama connection
node test-ollama.js

# Run complete system test
node test-complete.js
```

## ๐Ÿ“– Detailed Usage

### Development Mode
```bash
npm run dev
```
Starts Vite dev server and Electron app with hot reload. Perfect for development and testing.

### Production Mode
```bash
npm start
```
Runs the built Electron application from the distribution.

### Building for Distribution
```bash
# Build the application
npm run build

# Create platform-specific packages
npm run package # All platforms
npm run package:win # Windows only
npm run package:mac # macOS only
npm run package:linux # Linux only
```

### Automated Release Packaging
When a new GitHub release is created, the following packages are automatically built and attached to the release:

**macOS (Apple Silicon/M-series only):**
- DMG installer (`.dmg`)
- Portable ZIP archive (`.zip`) - unpacked application bundle

**Linux (x64 & arm64):**
- AppImage (`.AppImage`) - portable executable
- Snap package (`.snap`) - universal Linux package
- DEB package (`.deb`) - Debian/Ubuntu installer

**Note:** Windows packages are not automatically built but can be created manually using `npm run package:win`.

The automated packaging workflow only runs on stable releases (skips alpha/beta tags).

## ๐Ÿ—๏ธ Project Structure

```
training-generator/
โ”œโ”€โ”€ src/ # Source code
โ”‚ โ”œโ”€โ”€ main.js # Electron main process
โ”‚ โ”œโ”€โ”€ preload.js # IPC bridge between main and renderer
โ”‚ โ”œโ”€โ”€ bootstrap.js # Application bootstrap logic
โ”‚ โ”œโ”€โ”€ renderer/
โ”‚ โ”‚ โ””โ”€โ”€ main.js # Frontend application logic
โ”‚ โ”œโ”€โ”€ core/
โ”‚ โ”‚ โ””โ”€โ”€ fileParser.js # Multi-format document parser
โ”‚ โ”œโ”€โ”€ styles/
โ”‚ โ”‚ โ””โ”€โ”€ main.css # Application styles
โ”‚ โ”œโ”€โ”€ prompts/ # AI prompt templates (multiple languages)
โ”‚ โ””โ”€โ”€ workers/ # Web workers for background processing
โ”œโ”€โ”€ assets/ # Application assets (icons, fonts)
โ”œโ”€โ”€ native-splash/ # Native C++/WinAPI splash screen (Windows)
โ”œโ”€โ”€ index.html # Main application window
โ”œโ”€โ”€ vite.config.js # Vite build configuration
โ”œโ”€โ”€ package.json # Project dependencies and scripts
โ””โ”€โ”€ README.md # This file
```

## โš™๏ธ Configuration

The application provides extensive configuration options:

### Processing Settings
- **Model Selection**: Choose from available Ollama models
- **Chunk Size**: Adjust text segmentation (500-10000 characters)
- **Temperature**: Control AI creativity (0.0-1.0)
- **Output Format**: JSONL, ChatML, CSV, or Plain Text
- **Language**: Multiple output language options

### Application Preferences
- **Theme**: Auto, Light, or Dark mode
- **Window Behavior**: Remember size/position, start maximized
- **Auto-save**: Automatic preset saving
- **File Size Limits**: Configure maximum file size (10-1000MB)

### System Integration
- **Ollama Auto-detection**: Automatic connection to local Ollama instance
- **Progress Persistence**: Resume interrupted processing sessions
- **Export Location**: Remember last used export directory

## ๐Ÿ”ง Troubleshooting

### Common Issues & Solutions

#### ๐Ÿšซ Ollama Connection Issues
```bash
# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama if not running
ollama serve

# Verify service status (Windows)
netstat -ano | findstr :11434
```

#### ๐Ÿ“„ File Parsing Problems
- **Scanned PDFs**: Use OCR software first for image-based PDFs
- **Large Files**: Reduce chunk size or split files manually
- **Encoding Issues**: Convert files to UTF-8 text format first

#### โšก Performance Optimization
- **GPU Acceleration**: Ensure Ollama is using GPU (check Ollama logs)
- **Memory Management**: Close other GPU-intensive applications
- **Chunk Size**: Adjust based on model context window (2000-4000 tokens optimal)

#### ๐Ÿ› Application Errors
```bash
# Clear dependencies and rebuild
npm cache clean --force
npm ci

# Check Node.js version
node --version # Should be 18+

# Run in debug mode
npm run dev -- --debug
```

### Debug Mode
For advanced troubleshooting, enable debug logging:
```bash
npm run dev -- --debug
```
Logs are available in:
- **Windows**: Application console output
- **macOS/Linux**: `~/.config/Training Generator/logs/`

## ๐Ÿงช Testing

The project includes comprehensive test suites:

```bash
# Run all tests
npm test

# Individual test scripts
node test_language_prompts.js # Language prompt validation
node test-app.js # Basic application functionality
node test-complete.js # Complete system integration test
node test-ollama.js # Ollama connection and model testing
```

## ๐Ÿ›ฃ๏ธ Roadmap & Future Features

### Planned Enhancements
- **๐Ÿ”Œ Plugin System**: Extensible processing pipelines
- **๐ŸŒ Cloud Integration**: Optional cloud model support (OpenAI, Anthropic, etc.)
- **๐Ÿ“ˆ Advanced Analytics**: Processing statistics and quality metrics
- **๐Ÿ”„ Batch Scheduling**: Automated processing queues
- **๐Ÿ” Content Filtering**: Smart filtering of sensitive information

### In Development
- **๐Ÿงฉ Modular Architecture**: Plugin-based file parser system
- **๐Ÿ“Š Performance Dashboard**: Real-time processing metrics
- **๐Ÿ”— API Server Mode**: REST API for headless operation

### Community Requests
- **๐Ÿ—‚๏ธ Folder Monitoring**: Watch folders for automatic processing
- **๐Ÿ“ฑ Mobile Companion**: Mobile app for remote monitoring
- **๐Ÿ” Enterprise Features**: User management, audit logging, compliance tools

## ๐Ÿค Contributing

We welcome contributions! Here's how you can help:

1. **Fork the repository**
2. **Create a feature branch**: `git checkout -b feature/amazing-feature`
3. **Make your changes**
4. **Run tests**: `npm test`
5. **Submit a pull request**

### Development Setup
```bash
# Install development dependencies
npm install

# Set up pre-commit hooks (if configured)
npm run prepare

# Start development server
npm run dev
```

### Code Style
- Use consistent formatting (Prettier configuration coming soon)
- Add comments for complex logic
- Update documentation for new features
- Include tests for new functionality

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ™ Support

- **Documentation**: [GitHub Wiki](https://github.com/richie-rich90454/training-generator/wiki)
- **Issue Tracker**: [Report Bugs](https://github.com/richie-rich90454/training-generator/issues)
- **Discussions**: [GitHub Discussions](https://github.com/richie-rich90454/training-generator/discussions)
- If this project helps you, please โญ **Star** the repo and share feedback via [Discussions](https://github.com/richie-rich90454/training-generator/discussions)!

## ๐Ÿ“ธ Screenshots

*Screenshot placeholders - add actual screenshots to the `screenshots/` directory*

## In Development
- ๐ŸŸข Modular Architecture
- ๐ŸŸก Performance Dashboard
- ๐Ÿ”ด API Server Mode

---

**๐Ÿ”’ Privacy Note**: This application processes documents locally using your own Ollama instance. No data is sent to external servers unless you configure custom API endpoints.

**โšก Performance Tip**: For best results, use GPU-accelerated Ollama models and ensure sufficient system memory for large documents.

**๐Ÿ› Found a Bug?** Please report it on the [issue tracker](https://github.com/richie-rich90454/training-generator/issues) with detailed steps to reproduce.