https://github.com/richie-rich90454/training-generator
Training Generator is a cross-platform desktop app built with Electron and Node.js that converts documents (PDF, DOCX, DOC, RTF, TXT, MD, HTML) into structured AI training data. Using local Ollama models, it extracts instructions, Q&A pairs, and conversation data for machine learning, AI fine-tuning, and NLP workflows, while keeping all processing.
https://github.com/richie-rich90454/training-generator
ai ai-data-analysis ai-training-data cpp desktop-app document-conversion electron html-css-javascript jsonl local-ai ml ollama ollama-api training-materials
Last synced: 4 months ago
JSON representation
Training Generator is a cross-platform desktop app built with Electron and Node.js that converts documents (PDF, DOCX, DOC, RTF, TXT, MD, HTML) into structured AI training data. Using local Ollama models, it extracts instructions, Q&A pairs, and conversation data for machine learning, AI fine-tuning, and NLP workflows, while keeping all processing.
- Host: GitHub
- URL: https://github.com/richie-rich90454/training-generator
- Owner: richie-rich90454
- License: mit
- Created: 2025-12-30T22:12:45.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-01-23T12:35:43.000Z (5 months ago)
- Last Synced: 2026-01-24T04:49:50.471Z (5 months ago)
- Topics: ai, ai-data-analysis, ai-training-data, cpp, desktop-app, document-conversion, electron, html-css-javascript, jsonl, local-ai, ml, ollama, ollama-api, training-materials
- Language: JavaScript
- Homepage:
- Size: 1.87 MB
- Stars: 2
- Watchers: 0
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# ๐ค Training Generator
[](https://github.com/richie-rich90454/training-generator/actions/workflows/ci.yml)
[](https://opensource.org/licenses/MIT)
[](https://github.com/richie-rich90454/training-generator/releases)
[](https://github.com/richie-rich90454/training-generator/stargazers)
[](https://github.com/richie-rich90454/training-generator/network/members)
[](https://github.com/richie-rich90454/training-generator/graphs/contributors)
[](https://github.com/richie-rich90454/training-generator/issues)
[](https://github.com/richie-rich90454/training-generator/pulls)
[](https://github.com/richie-rich90454/training-generator/commits/main)
[](https://github.com/richie-rich90454/training-generator/discussions)
[](https://github.com/richie-rich90454/training-generator/releases)
[](https://github.com/richie-rich90454/training-generator/actions/workflows/ci.yml)
---
### ๐ Built With
[](https://www.electronjs.org/) [](https://nodejs.org/) [](https://ollama.com/) [](https://vitejs.dev/) [](https://axios-http.com/) [](https://developer.mozilla.org/en-US/docs/Web/HTML) [](https://developer.mozilla.org/en-US/docs/Web/CSS) [](https://developer.mozilla.org/en-US/docs/Web/JavaScript)
**Training Generator** is a cross-platform **desktop application** built with **Electron** and **Node.js** that converts documents (PDF, DOCX, DOC, RTF, TXT, MD, HTML) into **AI training data** using **local Ollama models**. Extract instructions, Q&A pairs, conversation data, and structured output for machine learning, NLP workflows, or AI fine-tuning โ all processed offline for privacy and speed.
- Convert PDF, DOCX, TXT, MD, HTML, RTF documents to AI training data
- Generate instruction/Q&A pairs & conversation datasets
- Multi-language support: EN, CN, FR, DE, ES, JP, KR
- Real-time output preview & batch processing
- Local processing for privacy โ no data leaves your computer
## โจ Features
### ๐ **File Support**
- **Multi-format Processing**: PDF, DOCX, DOC, RTF, TXT, MD, and HTML files
- **Smart Text Extraction**: Advanced parsing for complex document structures
- **Large File Handling**: Support for files up to 100MB with efficient chunking
### ๐ง **AI Processing**
- **Ollama Integration**: Uses local Ollama API for private AI processing
- **Multiple Processing Types**:
- ๐ **Instruction Extraction** (Q&A pairs for fine-tuning)
- ๐ฌ **Conversation Generation** (Dialog-style training data)
- ๐ช **Text Chunking** (Intelligent document segmentation)
- ๐จ **Custom Analysis** (User-defined prompt templates)
- **Multi-language Support**: English, Chinese, Spanish, French, German, Japanese, Korean
### ๐ **Output & Export**
- **Flexible Formats**: JSONL (Alpaca style), ChatML, CSV, Plain Text
- **Batch Processing**: Process multiple files simultaneously
- **Progress Tracking**: Real-time progress bars and detailed logging
### ๐จ **User Experience**
- **Modern UI**: Clean, responsive interface with drag & drop support
- **Native Splash Screen**: C++/WinAPI native splash screen on Windows for fast startup
- **Dark/Light Themes**: System-aware theme switching
- **Preset Management**: Save and load processing configurations
- **Real-time Preview**: Live output preview before export
## ๐ Quick Start
### Prerequisites
- **Node.js 18+** and npm (Recommended: Node.js 24+ for best compatibility)
- **Ollama** (for AI processing) - [Download here](https://ollama.com/)
### Dependency Compatibility
All project dependencies are verified to be compatible with Node.js 18+:
| Dependency | Version | Node.js Compatibility | Purpose |
|------------|---------|----------------------|---------|
| **Electron** | ^39.2.7 | 18+ (uses Node.js 20.9.0) | Desktop application framework |
| **Vite** | ^7.3.0 | 18+ | Build tool and dev server |
| **Axios** | ^1.7.9 | 18+ | HTTP client for Ollama API |
| **html-to-text** | ^9.0.5 | 18+ | HTML document parsing |
| **mammoth** | ^1.11.0 | 18+ | DOCX document parsing |
| **officeparser** | ^3.0.0 | 18+ | DOC document parsing |
| **pdf-parse** | ^1.1.4 | 18+ | PDF document parsing |
| **rtf-parser-fixes** | ^1.3.4 | 18+ | RTF document parsing |
| **electron-builder** | ^26.0.12 | 18+ | Application packaging |
**Note**: The `fs` package (`^0.0.1-security`) is a placeholder package and works with all Node.js versions.
### Installation & Running
```bash
# Clone the repository
git clone https://github.com/richie-rich90454/training-generator.git
cd training-generator
# Install dependencies
npm install
# Start Ollama (in a separate terminal)
ollama serve
# Pull a model (example)
ollama pull llama3.2
# Run the application
npm run dev
```
### Quick Demo
```bash
# Test basic functionality
node test-app.js
# Test Ollama connection
node test-ollama.js
# Run complete system test
node test-complete.js
```
## ๐ Detailed Usage
### Development Mode
```bash
npm run dev
```
Starts Vite dev server and Electron app with hot reload. Perfect for development and testing.
### Production Mode
```bash
npm start
```
Runs the built Electron application from the distribution.
### Building for Distribution
```bash
# Build the application
npm run build
# Create platform-specific packages
npm run package # All platforms
npm run package:win # Windows only
npm run package:mac # macOS only
npm run package:linux # Linux only
```
### Automated Release Packaging
When a new GitHub release is created, the following packages are automatically built and attached to the release:
**macOS (Apple Silicon/M-series only):**
- DMG installer (`.dmg`)
- Portable ZIP archive (`.zip`) - unpacked application bundle
**Linux (x64 & arm64):**
- AppImage (`.AppImage`) - portable executable
- Snap package (`.snap`) - universal Linux package
- DEB package (`.deb`) - Debian/Ubuntu installer
**Note:** Windows packages are not automatically built but can be created manually using `npm run package:win`.
The automated packaging workflow only runs on stable releases (skips alpha/beta tags).
## ๐๏ธ Project Structure
```
training-generator/
โโโ src/ # Source code
โ โโโ main.js # Electron main process
โ โโโ preload.js # IPC bridge between main and renderer
โ โโโ bootstrap.js # Application bootstrap logic
โ โโโ renderer/
โ โ โโโ main.js # Frontend application logic
โ โโโ core/
โ โ โโโ fileParser.js # Multi-format document parser
โ โโโ styles/
โ โ โโโ main.css # Application styles
โ โโโ prompts/ # AI prompt templates (multiple languages)
โ โโโ workers/ # Web workers for background processing
โโโ assets/ # Application assets (icons, fonts)
โโโ native-splash/ # Native C++/WinAPI splash screen (Windows)
โโโ index.html # Main application window
โโโ vite.config.js # Vite build configuration
โโโ package.json # Project dependencies and scripts
โโโ README.md # This file
```
## โ๏ธ Configuration
The application provides extensive configuration options:
### Processing Settings
- **Model Selection**: Choose from available Ollama models
- **Chunk Size**: Adjust text segmentation (500-10000 characters)
- **Temperature**: Control AI creativity (0.0-1.0)
- **Output Format**: JSONL, ChatML, CSV, or Plain Text
- **Language**: Multiple output language options
### Application Preferences
- **Theme**: Auto, Light, or Dark mode
- **Window Behavior**: Remember size/position, start maximized
- **Auto-save**: Automatic preset saving
- **File Size Limits**: Configure maximum file size (10-1000MB)
### System Integration
- **Ollama Auto-detection**: Automatic connection to local Ollama instance
- **Progress Persistence**: Resume interrupted processing sessions
- **Export Location**: Remember last used export directory
## ๐ง Troubleshooting
### Common Issues & Solutions
#### ๐ซ Ollama Connection Issues
```bash
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Start Ollama if not running
ollama serve
# Verify service status (Windows)
netstat -ano | findstr :11434
```
#### ๐ File Parsing Problems
- **Scanned PDFs**: Use OCR software first for image-based PDFs
- **Large Files**: Reduce chunk size or split files manually
- **Encoding Issues**: Convert files to UTF-8 text format first
#### โก Performance Optimization
- **GPU Acceleration**: Ensure Ollama is using GPU (check Ollama logs)
- **Memory Management**: Close other GPU-intensive applications
- **Chunk Size**: Adjust based on model context window (2000-4000 tokens optimal)
#### ๐ Application Errors
```bash
# Clear dependencies and rebuild
npm cache clean --force
npm ci
# Check Node.js version
node --version # Should be 18+
# Run in debug mode
npm run dev -- --debug
```
### Debug Mode
For advanced troubleshooting, enable debug logging:
```bash
npm run dev -- --debug
```
Logs are available in:
- **Windows**: Application console output
- **macOS/Linux**: `~/.config/Training Generator/logs/`
## ๐งช Testing
The project includes comprehensive test suites:
```bash
# Run all tests
npm test
# Individual test scripts
node test_language_prompts.js # Language prompt validation
node test-app.js # Basic application functionality
node test-complete.js # Complete system integration test
node test-ollama.js # Ollama connection and model testing
```
## ๐ฃ๏ธ Roadmap & Future Features
### Planned Enhancements
- **๐ Plugin System**: Extensible processing pipelines
- **๐ Cloud Integration**: Optional cloud model support (OpenAI, Anthropic, etc.)
- **๐ Advanced Analytics**: Processing statistics and quality metrics
- **๐ Batch Scheduling**: Automated processing queues
- **๐ Content Filtering**: Smart filtering of sensitive information
### In Development
- **๐งฉ Modular Architecture**: Plugin-based file parser system
- **๐ Performance Dashboard**: Real-time processing metrics
- **๐ API Server Mode**: REST API for headless operation
### Community Requests
- **๐๏ธ Folder Monitoring**: Watch folders for automatic processing
- **๐ฑ Mobile Companion**: Mobile app for remote monitoring
- **๐ Enterprise Features**: User management, audit logging, compliance tools
## ๐ค Contributing
We welcome contributions! Here's how you can help:
1. **Fork the repository**
2. **Create a feature branch**: `git checkout -b feature/amazing-feature`
3. **Make your changes**
4. **Run tests**: `npm test`
5. **Submit a pull request**
### Development Setup
```bash
# Install development dependencies
npm install
# Set up pre-commit hooks (if configured)
npm run prepare
# Start development server
npm run dev
```
### Code Style
- Use consistent formatting (Prettier configuration coming soon)
- Add comments for complex logic
- Update documentation for new features
- Include tests for new functionality
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Support
- **Documentation**: [GitHub Wiki](https://github.com/richie-rich90454/training-generator/wiki)
- **Issue Tracker**: [Report Bugs](https://github.com/richie-rich90454/training-generator/issues)
- **Discussions**: [GitHub Discussions](https://github.com/richie-rich90454/training-generator/discussions)
- If this project helps you, please โญ **Star** the repo and share feedback via [Discussions](https://github.com/richie-rich90454/training-generator/discussions)!
## ๐ธ Screenshots
*Screenshot placeholders - add actual screenshots to the `screenshots/` directory*
## In Development
- ๐ข Modular Architecture
- ๐ก Performance Dashboard
- ๐ด API Server Mode
---
**๐ Privacy Note**: This application processes documents locally using your own Ollama instance. No data is sent to external servers unless you configure custom API endpoints.
**โก Performance Tip**: For best results, use GPU-accelerated Ollama models and ensure sufficient system memory for large documents.
**๐ Found a Bug?** Please report it on the [issue tracker](https://github.com/richie-rich90454/training-generator/issues) with detailed steps to reproduce.