https://github.com/tehw0lf/writing-style-analyzer

Analyze and profile writing styles in German and English text using local LLMs. Privacy-first, 100% local processing.
https://github.com/tehw0lf/writing-style-analyzer

academic-writing german llm local-first nlp privacy python style-analysis transformers writing-analysis

Last synced: 3 months ago
JSON representation

Analyze and profile writing styles in German and English text using local LLMs. Privacy-first, 100% local processing.

Host: GitHub
URL: https://github.com/tehw0lf/writing-style-analyzer
Owner: tehw0lf
License: mit
Created: 2025-11-03T20:09:47.000Z (8 months ago)
Default Branch: main
Last Pushed: 2026-03-15T20:29:54.000Z (4 months ago)
Last Synced: 2026-03-16T08:11:18.467Z (4 months ago)
Topics: academic-writing, german, llm, local-first, nlp, privacy, python, style-analysis, transformers, writing-analysis
Language: Python
Size: 88.9 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Writing Style Analyzer

A local writing style analyzer that uses Large Language Models (LLMs) to analyze and profile writing styles in German and English text. This tool runs completely locally without external API calls.

## Features

- **Local LLM Integration**: Uses HuggingFace Transformers or llama.cpp with GGUF models
- **Multilingual Support**: Optimized for German and English text analysis
- **Comprehensive Analysis**:
- Sentence and paragraph structure
- Lexical diversity metrics
- Language-specific features (German formality, compound words, etc.)
- Common phrases and vocabulary patterns
- Tone and formality detection
- **Dual-Format Output**: Generates both JSON (for analysis) and Markdown (for AI agents)
- **Profile Generation**: Creates detailed profiles for different writing contexts
- **No External Dependencies**: Runs completely offline using local models

## Project Structure

```
writing-style-analyzer/
├── analyze.py # Main profile generation tool ⭐
├── german_academic_analyzer.py # Universal German text analysis library ⭐⭐
├── pyproject.toml # UV project configuration
├── config.yaml # Configuration file
├── texts/ # Input directory for text samples
├── profiles/ # Output directory for generated profiles
├── user-profiles/ # V2 validated profiles and documentation
│ ├── profiles/ # Validated academic profiles (default, excellence)
│ ├── test-prompts/ # Test validation framework
│ ├── validate_test*.py # Test validation scripts (⚠️ user-specific)
│ └── *.md # Comprehensive usage guides
├── SCRIPTS_README.md # Guide to all analysis scripts ⭐
└── README.md # This file
```

**Key Files for Other Users:**
- `analyze.py` - Create your own writing profile ✅
- `german_academic_analyzer.py` - Universal German text analyzer ✅
- `user-profiles/validate_test*.py` - ⚠️ SKIP these (hardcoded to original author)

**See `SCRIPTS_README.md` for detailed explanation of each script!**

## Installation

### Prerequisites

- Python 3.10 or higher
- [UV package manager](https://github.com/astral-sh/uv)

### Install UV (if not already installed)

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

### Setup Project

```bash
# Navigate to project directory
cd writing-style-analyzer

# Create virtual environment and install dependencies
uv venv
uv sync

# Or if using pip:
uv pip install -e .
```

### Model Setup

The analyzer uses HuggingFace models by default. On first run, the model will be downloaded automatically (~3-7GB depending on model choice).

**Recommended Models for German/English:**

1. **Qwen/Qwen2.5-3B-Instruct** (Default, excellent multilingual support)
2. **meta-llama/Llama-3.2-3B-Instruct** (Good multilingual performance)
3. **mistralai/Mistral-7B-Instruct-v0.2** (Larger, better quality, needs more resources)

Configure your preferred model in `config.yaml`:

```yaml
model:
type: "transformers"
name: "Qwen/Qwen2.5-3B-Instruct"
device: "auto" # auto-detects GPU/CPU
```

## Configuration

Edit `config.yaml` to customize:

- **Model settings**: Model type, name, device, parameters
- **Analysis settings**: Chunk size, languages, detail level
- **File processing**: Extensions, encoding, ignore patterns
- **Output settings**: JSON formatting, example inclusion

See the `config.yaml` file for detailed comments on all options.

## Usage

### Basic Usage

```bash
# Analyze blog posts
uv run analyze.py --input texts/blog --output profiles/blog-profile.json --profile-type blog

# Analyze social media content
uv run analyze.py --input texts/social --output profiles/social-profile.json --profile-type social

# Use custom config
uv run analyze.py --input texts/blog --output profiles/custom.json --config my-config.yaml
```

### Command-Line Options

```
Options:
--input, -i Input directory containing text files (required)
--output, -o Output path for profile JSON (required)
--profile-type, -t Profile type name (default: general)
--config, -c Path to config file (default: config.yaml)
--help, -h Show help message
```

### Example Workflow

1. **Collect your text samples**:
```bash
mkdir -p texts/blog
# Copy your writing samples (.txt, .md, .pdf, .docx, .odt)
```

2. **Run analysis**:
```bash
uv run analyze.py --input texts/blog --output profiles/my-blog.json --profile-type tech-blog
```

3. **Review the profile**:
```bash
cat profiles/my-blog.json
```

## Profile Output Format

The analyzer generates **two files** for each profile:

1. **JSON file** (`profile-name.json`): Complete analysis data, metrics, and metadata
2. **Markdown file** (`profile-name.md`): AI-friendly instructions for writing guidance

### JSON Profile Structure

The JSON profile contains the following structure:

```json
{
"profile_name": "tech-blog",
"created_at": "2025-10-26T12:34:56.789",
"analyzed_files": 15,
"primary_language": "de",
"languages_detected": ["de", "en"],
"metrics": {
"avg_sentence_length": 18.5,
"avg_paragraph_length": 3.2,
"lexical_diversity": 0.73,
"total_words": 5420,
"total_sentences": 293
},
"style_characteristics": {
"tone": "friendly-informative, conversational",
"formality": "casual-professional",
"typical_elements": [
"Uses 'du' form (German informal you)",
"Starts with questions or scenarios",
"Short paragraphs (2-4 sentences)"
],
"structural_patterns": [
"Question-led openings",
"Code examples embedded",
"Summary conclusions"
]
},
"vocabulary": {
"common_phrases": [
"im grunde",
"tatsächlich",
"aber",
"eigentlich"
],
"characteristics": "Mix of German and English technical terms"
},
"german_features": {
"formality": "informal (du-form)",
"has_compound_words": true,
"compound_word_examples": ["softwareentwicklung", "datenbankverbindung"],
"uses_umlauts": true
},
"avoid": [
"Marketing language",
"Passive voice",
"Overly formal structures"
]
}
```

### Markdown Profile Format

The markdown file provides AI-friendly instructions:

```markdown
# Profile Name Writing Style Profile

## Quick Instructions
Write in this style using these characteristics:

### Voice & Structure
- **Passive voice:** 45%
- **Sentence length:** ~20 words average
- **Lexical diversity:** 0.35

### Transition Words
**Contrastive:**
- Use: jedoch, allerdings, dennoch
- **Target:** ~25 uses per document

### Style Signature
- **Tone:** Professional and technical
- **Formality:** Formal

### What to Avoid
- Colloquial language
- Personal opinions without evidence
```

## Using Profiles with AI Assistants

Once you've generated a profile, you can use it to guide AI assistants when writing new text.

### Method 1: Upload Profile as Project Knowledge (Recommended)

**Best for:** Regular use, convenience

1. Create a project in your AI platform (Claude Desktop, ChatGPT, etc.)
2. Upload the generated `.md` profile file as project knowledge
3. Reference it in your prompts

**Example:**
```
Write a 500-word paragraph about [TOPIC] using my writing style from the profile.
```

### Method 2: Paste Profile in Each Conversation

**Best for:** One-off use, testing different profiles

1. Open the generated `.md` profile file
2. Copy the entire content
3. Paste it into your AI conversation
4. Follow with your writing request

**Example:**
```
[Paste full profile content]

Based on this writing style profile, write about [TOPIC]...
```

### Method 3: Reference Specific Metrics

**Best for:** Fine-tuning specific aspects

Extract key metrics from your profile and reference them:
```
Write a paragraph with:
- Average sentence length: ~[X] words
- Passive voice ratio: ~[Y]%
- Use transitions from categories: [list]
```

### Integration Options

**MCP Memory** (if available):
Store profiles in memory for later retrieval

**File Attachment** (if available):
Attach the `.md` file directly to conversations

**See `user-profiles/` directory for example usage guides and validation results.**

## German Text Analysis Features

The analyzer includes specialized support for German language features:

### Formality Detection
- **du-form** (informal): du, dich, dir, dein
- **Sie-form** (formal): Sie, Ihnen, Ihr

### Compound Words
Detects long German compound words (e.g., "Softwareentwicklungsumgebung")

### Umlauts & Special Characters
Full UTF-8 support for ä, ö, ü, ß

### Sentence Structure
Adapts to typically longer German sentences compared to English

## Hardware Requirements

### Minimum
- **CPU**: Modern x86_64 processor
- **RAM**: 8GB (for 3B parameter models)
- **Storage**: 10GB free space

### Recommended
- **GPU**: NVIDIA GPU with 6GB+ VRAM (CUDA support)
- **RAM**: 16GB
- **Storage**: 20GB free space

### Performance Tips

1. **Use GPU acceleration** when available:
```yaml
model:
device: "cuda" # or "mps" for Apple Silicon
```

2. **Use smaller models** for faster analysis:
- 3B models: Fast, good quality
- 7B models: Slower, better quality

3. **Adjust chunk size** in config for memory constraints:
```yaml
analysis:
chunk_size: 4000 # Reduce if running out of memory
```

## Troubleshooting

### Out of Memory Errors

**Symptoms**: Process killed or CUDA out of memory

**Solutions**:
1. Use CPU instead of GPU: `device: "cpu"` in config
2. Use smaller model (3B instead of 7B)
3. Reduce chunk_size in config
4. Close other applications

### Model Download Issues

**Symptoms**: Connection timeouts or download failures

**Solutions**:
1. Check internet connection
2. Use HuggingFace mirror if available
3. Manually download model and configure path
4. Try alternative model

### Language Detection Issues

**Symptoms**: Wrong language detected

**Solutions**:
1. Ensure text files have sufficient content (>100 words)
2. Check UTF-8 encoding is correct
3. Mixed-language texts may show "en" as primary if English dominates

### Slow Performance

**Symptoms**: Analysis takes very long

**Solutions**:
1. Enable GPU acceleration in config
2. Use smaller/faster model
3. Reduce number of input files
4. Increase chunk_size for batch processing

## Advanced Usage

### Using GGUF Models (llama.cpp)

For potentially better performance with quantized models:

1. **Install llama-cpp-python**:
```bash
uv pip install llama-cpp-python
```

2. **Download a GGUF model** (e.g., from HuggingFace)

3. **Configure**:
```yaml
model:
type: "llama-cpp"
path: "/path/to/model.gguf"
```

### Batch Processing Multiple Directories

```bash
#!/bin/bash
for dir in texts/*/; do
profile_name=$(basename "$dir")
uv run analyze.py --input "$dir" --output "profiles/${profile_name}.json" --profile-type "$profile_name"
done
```

### Custom Analysis Parameters

Create multiple config files for different use cases:

```bash
# Quick analysis (lower quality, faster)
uv run analyze.py --input texts/blog --output profiles/quick.json --config config-fast.yaml

# Detailed analysis (higher quality, slower)
uv run analyze.py --input texts/blog --output profiles/detailed.json --config config-detailed.yaml
```

## Development

### Testing

The project includes a comprehensive automated test suite with 49 tests covering profile validation, analysis functions, and regression testing.

```bash
# Run all tests
uv run pytest tests/

# Run with coverage
uv run pytest tests/ --cov=. --cov-report=term-missing

# Run specific test categories
uv run pytest tests/ -m profile # Profile validation
uv run pytest tests/ -m analysis # Analysis functions
uv run pytest tests/ -m regression # Regression tests
```

**Privacy-First Design:** All tests use synthetic data only. Your personal profiles and texts remain private (gitignored).

See [tests/README.md](tests/README.md) for complete test suite documentation.

### Project Dependencies

Core dependencies:
- `transformers`: HuggingFace model support
- `torch`: PyTorch for model inference
- `pyyaml`: Configuration file parsing
- `langdetect`: Language detection
- `tqdm`: Progress bars
- `pypdf`: PDF text extraction
- `python-docx`: Microsoft Word (.docx) text extraction
- `odfpy`: LibreOffice Writer (.odt) text extraction

Optional:
- `llama-cpp-python`: GGUF model support

Development:
- `pytest`: Testing framework
- `pytest-cov`: Coverage reporting
- `black`: Code formatting
- `ruff`: Linting

### Code Structure

- **TextProcessor**: Text analysis and metric calculation
- **LLMAnalyzer**: LLM integration and style analysis
- **WritingStyleAnalyzer**: Main orchestrator
- **Configuration**: YAML-based configuration management

## Model Recommendations

### For German + English (Bilingual)

| Model | Size | Quality | Speed | Notes |
|-------|------|---------|-------|-------|
| Qwen2.5-3B-Instruct | 3B | ⭐⭐⭐⭐ | ⚡⚡⚡ | Best balance, default |
| Llama-3.2-3B | 3B | ⭐⭐⭐ | ⚡⚡⚡ | Good alternative |
| Mistral-7B-Instruct | 7B | ⭐⭐⭐⭐⭐ | ⚡⚡ | Best quality, slower |

### For German Primary

Qwen2.5 series has excellent German support and is recommended for German-heavy content.

## License

This project is provided as-is for personal and educational use.

## Contributing

Contributions welcome! Areas for improvement:
- Additional language support
- Profile comparison tools
- Statistical validation metrics
- Web interface

## Support

For issues and questions:
1. Check this README and `config.yaml` comments
2. Review logs in `analyzer.log`
3. Check HuggingFace model documentation
4. Verify Python and dependency versions

## Example Profiles

This repository includes example profile documentation in the `user-profiles/` directory (gitignored for privacy). This shows how to organize your personal writing style profiles and documentation.

### Profile Organization

**Project-level (this directory):**
- Tool documentation (README, QUICKSTART, CLAUDE.md)
- Example texts for testing
- Analyzer source code

**User-level (`user-profiles/` - gitignored):**
- Your analyzed writing style profiles
- Profile usage guides
- Comparison and test documentation

### Creating Your Own Profiles

The `user-profiles/` directory is where you'll store your generated profiles and documentation. This directory is **gitignored** to protect your privacy.

**To create your first profile:**

1. **Collect text samples** (10-20 files, 5000+ words total)
```bash
mkdir -p texts/my-writing
# Copy your .txt, .md, .pdf, .docx files here
```

2. **Run the analyzer:**
```bash
uv run analyze.py --input texts/my-writing --output profiles/my-style.json --profile-type my-style
```

3. **Review the output:**
- `profiles/my-style.json` - Complete analysis data
- `profiles/my-style.md` - AI-friendly profile for guidance

4. **Use with AI assistants:**
- Upload the `.md` file to your AI platform
- Reference it when asking for text generation

**Profile Organization:**

We recommend creating a `user-profiles/` directory structure:
```
user-profiles/
├── profiles/ # Your generated profiles
│ ├── academic.json
│ ├── academic.md
│ ├── blog.json
│ └── blog.md
└── README.md # Your personal usage notes
```

**Example profiles are available in the repository's issue tracker for reference, but your profiles will be unique to your writing style.**

## Changelog

### v1.0.0 (2025-11-03) - Initial Public Release 🎉
- **First stable release:** Production-ready writing style analyzer
- **MIT License:** Open source and freely usable
- **GitHub Actions CI/CD:** Automated testing, linting, formatting, and releases
- Uses reusable workflows for consistent builds
- Automated version extraction and release creation
- Comprehensive test suite (49 tests, 95% coverage, <1s runtime)
- **Repository publishing:** Public GitHub repository with comprehensive documentation
- 10 relevant topics/tags for discoverability
- Automated wheel and source distribution builds
- Professional release notes and changelog
- **Code quality improvements:**
- Fixed all linting issues (modern Python type hints)
- Consistent code formatting with black
- Clean codebase ready for contributions
- **Privacy-first design:** All personal data gitignored by default
- **Documentation:** Complete setup guides, QUICKSTART, and developer documentation

### v0.5.0 (2025-10-27)
- **Hybrid Pattern Discovery System:** Major upgrade to profile generation
- Combines authoritative patterns from Duden/academic style guides with LLM-discovered patterns
- Generates profiles with 3-4x more linguistic patterns than basic analysis
- New transition categories: conditional, clarifying, concessive
- Improved passive voice accuracy and argumentation detection
- **Dual-Format Output:** Profiles now generated in both JSON and Markdown
- JSON for analysis and metrics
- Markdown for AI assistant integration
- **Comprehensive Documentation:** Profile usage guides and validation framework
- Profile creation guide
- AI integration best practices
- Validation test framework

### v0.4.0 (2025-10-27)
- **Documentation restructuring:** Separated project and user-specific documentation
- Created `user-profiles/` directory for personal profiles (gitignored)
- Moved profile-specific documentation to `user-profiles/`
- Updated .gitignore to protect user privacy
- **Profile management improvements:** Simplified profile organization
- Clearer naming conventions for generated profiles
- Profile archiving and versioning support
- Validation framework for testing profile quality
- **Empirical validation:** Testing framework confirms profile quality and distinctiveness

### v0.3.0 (2025-10-27)
- Added LaTeX (.tex) file support with pylatexenc
- Replaced pypdf with pdfplumber for better PDF text extraction
- Added comprehensive linguistic analysis:
- Voice analysis (passive vs active)
- Transition word analysis (5 categories)
- Sentence complexity metrics
- Rhetorical device detection
- Improved content filtering (code/formula/reference detection)
- Enhanced phrase extraction with stopword filtering
- Robust JSON parsing with retry logic
- Created pre-analyzed academic profiles with detailed documentation

### v0.2.0 (2025-10-26)
- Improved German language support
- Added profile merging capabilities
- Enhanced error handling

### v0.1.0 (2025-10-26)
- Initial release
- German and English support
- HuggingFace Transformers integration
- Basic profile generation
- JSON output format

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tehw0lf/writing-style-analyzer

Awesome Lists containing this project

README