{"id":48356371,"url":"https://github.com/tehw0lf/writing-style-analyzer","last_synced_at":"2026-04-05T11:33:07.823Z","repository":{"id":322322662,"uuid":"1089045783","full_name":"tehw0lf/writing-style-analyzer","owner":"tehw0lf","description":"Analyze and profile writing styles in German and English text using local LLMs. Privacy-first, 100% local processing.","archived":false,"fork":false,"pushed_at":"2026-03-15T20:29:54.000Z","size":91,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-16T08:11:18.467Z","etag":null,"topics":["academic-writing","german","llm","local-first","nlp","privacy","python","style-analysis","transformers","writing-analysis"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tehw0lf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-03T20:09:47.000Z","updated_at":"2026-03-15T20:26:28.000Z","dependencies_parsed_at":"2025-12-30T06:03:29.246Z","dependency_job_id":null,"html_url":"https://github.com/tehw0lf/writing-style-analyzer","commit_stats":null,"previous_names":["tehw0lf/writing-style-analyzer"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/tehw0lf/writing-style-analyzer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tehw0lf%2Fwriting-style-analyzer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tehw0lf%2Fwriting-style-analyzer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tehw0lf%2Fwriting-style-analyzer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tehw0lf%2Fwriting-style-analyzer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tehw0lf","download_url":"https://codeload.github.com/tehw0lf/writing-style-analyzer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tehw0lf%2Fwriting-style-analyzer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31434624,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T08:13:15.228Z","status":"ssl_error","status_checked_at":"2026-04-05T08:13:11.839Z","response_time":75,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["academic-writing","german","llm","local-first","nlp","privacy","python","style-analysis","transformers","writing-analysis"],"created_at":"2026-04-05T11:33:07.149Z","updated_at":"2026-04-05T11:33:07.808Z","avatar_url":"https://github.com/tehw0lf.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Writing Style Analyzer\n\nA local writing style analyzer that uses Large Language Models (LLMs) to analyze and profile writing styles in German and English text. This tool runs completely locally without external API calls.\n\n## Features\n\n- **Local LLM Integration**: Uses HuggingFace Transformers or llama.cpp with GGUF models\n- **Multilingual Support**: Optimized for German and English text analysis\n- **Comprehensive Analysis**:\n  - Sentence and paragraph structure\n  - Lexical diversity metrics\n  - Language-specific features (German formality, compound words, etc.)\n  - Common phrases and vocabulary patterns\n  - Tone and formality detection\n- **Dual-Format Output**: Generates both JSON (for analysis) and Markdown (for AI agents)\n- **Profile Generation**: Creates detailed profiles for different writing contexts\n- **No External Dependencies**: Runs completely offline using local models\n\n## Project Structure\n\n```\nwriting-style-analyzer/\n├── analyze.py                      # Main profile generation tool ⭐\n├── german_academic_analyzer.py     # Universal German text analysis library ⭐⭐\n├── pyproject.toml                  # UV project configuration\n├── config.yaml                     # Configuration file\n├── texts/                          # Input directory for text samples\n├── profiles/                       # Output directory for generated profiles\n├── user-profiles/                  # V2 validated profiles and documentation\n│   ├── profiles/                   # Validated academic profiles (default, excellence)\n│   ├── test-prompts/               # Test validation framework\n│   ├── validate_test*.py           # Test validation scripts (⚠️ user-specific)\n│   └── *.md                        # Comprehensive usage guides\n├── SCRIPTS_README.md               # Guide to all analysis scripts ⭐\n└── README.md                       # This file\n```\n\n**Key Files for Other Users:**\n- `analyze.py` - Create your own writing profile ✅\n- `german_academic_analyzer.py` - Universal German text analyzer ✅\n- `user-profiles/validate_test*.py` - ⚠️ SKIP these (hardcoded to original author)\n\n**See `SCRIPTS_README.md` for detailed explanation of each script!**\n\n## Installation\n\n### Prerequisites\n\n- Python 3.10 or higher\n- [UV package manager](https://github.com/astral-sh/uv)\n\n### Install UV (if not already installed)\n\n```bash\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n```\n\n### Setup Project\n\n```bash\n# Navigate to project directory\ncd writing-style-analyzer\n\n# Create virtual environment and install dependencies\nuv venv\nuv sync\n\n# Or if using pip:\nuv pip install -e .\n```\n\n### Model Setup\n\nThe analyzer uses HuggingFace models by default. On first run, the model will be downloaded automatically (~3-7GB depending on model choice).\n\n**Recommended Models for German/English:**\n\n1. **Qwen/Qwen2.5-3B-Instruct** (Default, excellent multilingual support)\n2. **meta-llama/Llama-3.2-3B-Instruct** (Good multilingual performance)\n3. **mistralai/Mistral-7B-Instruct-v0.2** (Larger, better quality, needs more resources)\n\nConfigure your preferred model in `config.yaml`:\n\n```yaml\nmodel:\n  type: \"transformers\"\n  name: \"Qwen/Qwen2.5-3B-Instruct\"\n  device: \"auto\"  # auto-detects GPU/CPU\n```\n\n## Configuration\n\nEdit `config.yaml` to customize:\n\n- **Model settings**: Model type, name, device, parameters\n- **Analysis settings**: Chunk size, languages, detail level\n- **File processing**: Extensions, encoding, ignore patterns\n- **Output settings**: JSON formatting, example inclusion\n\nSee the `config.yaml` file for detailed comments on all options.\n\n## Usage\n\n### Basic Usage\n\n```bash\n# Analyze blog posts\nuv run analyze.py --input texts/blog --output profiles/blog-profile.json --profile-type blog\n\n# Analyze social media content\nuv run analyze.py --input texts/social --output profiles/social-profile.json --profile-type social\n\n# Use custom config\nuv run analyze.py --input texts/blog --output profiles/custom.json --config my-config.yaml\n```\n\n### Command-Line Options\n\n```\nOptions:\n  --input, -i       Input directory containing text files (required)\n  --output, -o      Output path for profile JSON (required)\n  --profile-type, -t Profile type name (default: general)\n  --config, -c      Path to config file (default: config.yaml)\n  --help, -h        Show help message\n```\n\n### Example Workflow\n\n1. **Collect your text samples**:\n   ```bash\n   mkdir -p texts/blog\n   # Copy your writing samples (.txt, .md, .pdf, .docx, .odt)\n   ```\n\n2. **Run analysis**:\n   ```bash\n   uv run analyze.py --input texts/blog --output profiles/my-blog.json --profile-type tech-blog\n   ```\n\n3. **Review the profile**:\n   ```bash\n   cat profiles/my-blog.json\n   ```\n\n## Profile Output Format\n\nThe analyzer generates **two files** for each profile:\n\n1. **JSON file** (`profile-name.json`): Complete analysis data, metrics, and metadata\n2. **Markdown file** (`profile-name.md`): AI-friendly instructions for writing guidance\n\n### JSON Profile Structure\n\nThe JSON profile contains the following structure:\n\n```json\n{\n  \"profile_name\": \"tech-blog\",\n  \"created_at\": \"2025-10-26T12:34:56.789\",\n  \"analyzed_files\": 15,\n  \"primary_language\": \"de\",\n  \"languages_detected\": [\"de\", \"en\"],\n  \"metrics\": {\n    \"avg_sentence_length\": 18.5,\n    \"avg_paragraph_length\": 3.2,\n    \"lexical_diversity\": 0.73,\n    \"total_words\": 5420,\n    \"total_sentences\": 293\n  },\n  \"style_characteristics\": {\n    \"tone\": \"friendly-informative, conversational\",\n    \"formality\": \"casual-professional\",\n    \"typical_elements\": [\n      \"Uses 'du' form (German informal you)\",\n      \"Starts with questions or scenarios\",\n      \"Short paragraphs (2-4 sentences)\"\n    ],\n    \"structural_patterns\": [\n      \"Question-led openings\",\n      \"Code examples embedded\",\n      \"Summary conclusions\"\n    ]\n  },\n  \"vocabulary\": {\n    \"common_phrases\": [\n      \"im grunde\",\n      \"tatsächlich\",\n      \"aber\",\n      \"eigentlich\"\n    ],\n    \"characteristics\": \"Mix of German and English technical terms\"\n  },\n  \"german_features\": {\n    \"formality\": \"informal (du-form)\",\n    \"has_compound_words\": true,\n    \"compound_word_examples\": [\"softwareentwicklung\", \"datenbankverbindung\"],\n    \"uses_umlauts\": true\n  },\n  \"avoid\": [\n    \"Marketing language\",\n    \"Passive voice\",\n    \"Overly formal structures\"\n  ]\n}\n```\n\n### Markdown Profile Format\n\nThe markdown file provides AI-friendly instructions:\n\n```markdown\n# Profile Name Writing Style Profile\n\n## Quick Instructions\nWrite in this style using these characteristics:\n\n### Voice \u0026 Structure\n- **Passive voice:** 45%\n- **Sentence length:** ~20 words average\n- **Lexical diversity:** 0.35\n\n### Transition Words\n**Contrastive:**\n- Use: jedoch, allerdings, dennoch\n- **Target:** ~25 uses per document\n\n### Style Signature\n- **Tone:** Professional and technical\n- **Formality:** Formal\n\n### What to Avoid\n- Colloquial language\n- Personal opinions without evidence\n```\n\n## Using Profiles with AI Assistants\n\nOnce you've generated a profile, you can use it to guide AI assistants when writing new text.\n\n### Method 1: Upload Profile as Project Knowledge (Recommended)\n\n**Best for:** Regular use, convenience\n\n1. Create a project in your AI platform (Claude Desktop, ChatGPT, etc.)\n2. Upload the generated `.md` profile file as project knowledge\n3. Reference it in your prompts\n\n**Example:**\n```\nWrite a 500-word paragraph about [TOPIC] using my writing style from the profile.\n```\n\n### Method 2: Paste Profile in Each Conversation\n\n**Best for:** One-off use, testing different profiles\n\n1. Open the generated `.md` profile file\n2. Copy the entire content\n3. Paste it into your AI conversation\n4. Follow with your writing request\n\n**Example:**\n```\n[Paste full profile content]\n\nBased on this writing style profile, write about [TOPIC]...\n```\n\n### Method 3: Reference Specific Metrics\n\n**Best for:** Fine-tuning specific aspects\n\nExtract key metrics from your profile and reference them:\n```\nWrite a paragraph with:\n- Average sentence length: ~[X] words\n- Passive voice ratio: ~[Y]%\n- Use transitions from categories: [list]\n```\n\n### Integration Options\n\n**MCP Memory** (if available):\nStore profiles in memory for later retrieval\n\n**File Attachment** (if available):\nAttach the `.md` file directly to conversations\n\n**See `user-profiles/` directory for example usage guides and validation results.**\n\n## German Text Analysis Features\n\nThe analyzer includes specialized support for German language features:\n\n### Formality Detection\n- **du-form** (informal): du, dich, dir, dein\n- **Sie-form** (formal): Sie, Ihnen, Ihr\n\n### Compound Words\nDetects long German compound words (e.g., \"Softwareentwicklungsumgebung\")\n\n### Umlauts \u0026 Special Characters\nFull UTF-8 support for ä, ö, ü, ß\n\n### Sentence Structure\nAdapts to typically longer German sentences compared to English\n\n## Hardware Requirements\n\n### Minimum\n- **CPU**: Modern x86_64 processor\n- **RAM**: 8GB (for 3B parameter models)\n- **Storage**: 10GB free space\n\n### Recommended\n- **GPU**: NVIDIA GPU with 6GB+ VRAM (CUDA support)\n- **RAM**: 16GB\n- **Storage**: 20GB free space\n\n### Performance Tips\n\n1. **Use GPU acceleration** when available:\n   ```yaml\n   model:\n     device: \"cuda\"  # or \"mps\" for Apple Silicon\n   ```\n\n2. **Use smaller models** for faster analysis:\n   - 3B models: Fast, good quality\n   - 7B models: Slower, better quality\n\n3. **Adjust chunk size** in config for memory constraints:\n   ```yaml\n   analysis:\n     chunk_size: 4000  # Reduce if running out of memory\n   ```\n\n## Troubleshooting\n\n### Out of Memory Errors\n\n**Symptoms**: Process killed or CUDA out of memory\n\n**Solutions**:\n1. Use CPU instead of GPU: `device: \"cpu\"` in config\n2. Use smaller model (3B instead of 7B)\n3. Reduce chunk_size in config\n4. Close other applications\n\n### Model Download Issues\n\n**Symptoms**: Connection timeouts or download failures\n\n**Solutions**:\n1. Check internet connection\n2. Use HuggingFace mirror if available\n3. Manually download model and configure path\n4. Try alternative model\n\n### Language Detection Issues\n\n**Symptoms**: Wrong language detected\n\n**Solutions**:\n1. Ensure text files have sufficient content (\u003e100 words)\n2. Check UTF-8 encoding is correct\n3. Mixed-language texts may show \"en\" as primary if English dominates\n\n### Slow Performance\n\n**Symptoms**: Analysis takes very long\n\n**Solutions**:\n1. Enable GPU acceleration in config\n2. Use smaller/faster model\n3. Reduce number of input files\n4. Increase chunk_size for batch processing\n\n## Advanced Usage\n\n### Using GGUF Models (llama.cpp)\n\nFor potentially better performance with quantized models:\n\n1. **Install llama-cpp-python**:\n   ```bash\n   uv pip install llama-cpp-python\n   ```\n\n2. **Download a GGUF model** (e.g., from HuggingFace)\n\n3. **Configure**:\n   ```yaml\n   model:\n     type: \"llama-cpp\"\n     path: \"/path/to/model.gguf\"\n   ```\n\n### Batch Processing Multiple Directories\n\n```bash\n#!/bin/bash\nfor dir in texts/*/; do\n  profile_name=$(basename \"$dir\")\n  uv run analyze.py --input \"$dir\" --output \"profiles/${profile_name}.json\" --profile-type \"$profile_name\"\ndone\n```\n\n### Custom Analysis Parameters\n\nCreate multiple config files for different use cases:\n\n```bash\n# Quick analysis (lower quality, faster)\nuv run analyze.py --input texts/blog --output profiles/quick.json --config config-fast.yaml\n\n# Detailed analysis (higher quality, slower)\nuv run analyze.py --input texts/blog --output profiles/detailed.json --config config-detailed.yaml\n```\n\n## Development\n\n### Testing\n\nThe project includes a comprehensive automated test suite with 49 tests covering profile validation, analysis functions, and regression testing.\n\n```bash\n# Run all tests\nuv run pytest tests/\n\n# Run with coverage\nuv run pytest tests/ --cov=. --cov-report=term-missing\n\n# Run specific test categories\nuv run pytest tests/ -m profile      # Profile validation\nuv run pytest tests/ -m analysis     # Analysis functions\nuv run pytest tests/ -m regression   # Regression tests\n```\n\n**Privacy-First Design:** All tests use synthetic data only. Your personal profiles and texts remain private (gitignored).\n\nSee [tests/README.md](tests/README.md) for complete test suite documentation.\n\n### Project Dependencies\n\nCore dependencies:\n- `transformers`: HuggingFace model support\n- `torch`: PyTorch for model inference\n- `pyyaml`: Configuration file parsing\n- `langdetect`: Language detection\n- `tqdm`: Progress bars\n- `pypdf`: PDF text extraction\n- `python-docx`: Microsoft Word (.docx) text extraction\n- `odfpy`: LibreOffice Writer (.odt) text extraction\n\nOptional:\n- `llama-cpp-python`: GGUF model support\n\nDevelopment:\n- `pytest`: Testing framework\n- `pytest-cov`: Coverage reporting\n- `black`: Code formatting\n- `ruff`: Linting\n\n### Code Structure\n\n- **TextProcessor**: Text analysis and metric calculation\n- **LLMAnalyzer**: LLM integration and style analysis\n- **WritingStyleAnalyzer**: Main orchestrator\n- **Configuration**: YAML-based configuration management\n\n## Model Recommendations\n\n### For German + English (Bilingual)\n\n| Model | Size | Quality | Speed | Notes |\n|-------|------|---------|-------|-------|\n| Qwen2.5-3B-Instruct | 3B | ⭐⭐⭐⭐ | ⚡⚡⚡ | Best balance, default |\n| Llama-3.2-3B | 3B | ⭐⭐⭐ | ⚡⚡⚡ | Good alternative |\n| Mistral-7B-Instruct | 7B | ⭐⭐⭐⭐⭐ | ⚡⚡ | Best quality, slower |\n\n### For German Primary\n\nQwen2.5 series has excellent German support and is recommended for German-heavy content.\n\n## License\n\nThis project is provided as-is for personal and educational use.\n\n## Contributing\n\nContributions welcome! Areas for improvement:\n- Additional language support\n- Profile comparison tools\n- Statistical validation metrics\n- Web interface\n\n## Support\n\nFor issues and questions:\n1. Check this README and `config.yaml` comments\n2. Review logs in `analyzer.log`\n3. Check HuggingFace model documentation\n4. Verify Python and dependency versions\n\n## Example Profiles\n\nThis repository includes example profile documentation in the `user-profiles/` directory (gitignored for privacy). This shows how to organize your personal writing style profiles and documentation.\n\n### Profile Organization\n\n**Project-level (this directory):**\n- Tool documentation (README, QUICKSTART, CLAUDE.md)\n- Example texts for testing\n- Analyzer source code\n\n**User-level (`user-profiles/` - gitignored):**\n- Your analyzed writing style profiles\n- Profile usage guides\n- Comparison and test documentation\n\n### Creating Your Own Profiles\n\nThe `user-profiles/` directory is where you'll store your generated profiles and documentation. This directory is **gitignored** to protect your privacy.\n\n**To create your first profile:**\n\n1. **Collect text samples** (10-20 files, 5000+ words total)\n   ```bash\n   mkdir -p texts/my-writing\n   # Copy your .txt, .md, .pdf, .docx files here\n   ```\n\n2. **Run the analyzer:**\n   ```bash\n   uv run analyze.py --input texts/my-writing --output profiles/my-style.json --profile-type my-style\n   ```\n\n3. **Review the output:**\n   - `profiles/my-style.json` - Complete analysis data\n   - `profiles/my-style.md` - AI-friendly profile for guidance\n\n4. **Use with AI assistants:**\n   - Upload the `.md` file to your AI platform\n   - Reference it when asking for text generation\n\n**Profile Organization:**\n\nWe recommend creating a `user-profiles/` directory structure:\n```\nuser-profiles/\n├── profiles/           # Your generated profiles\n│   ├── academic.json\n│   ├── academic.md\n│   ├── blog.json\n│   └── blog.md\n└── README.md          # Your personal usage notes\n```\n\n**Example profiles are available in the repository's issue tracker for reference, but your profiles will be unique to your writing style.**\n\n## Changelog\n\n### v1.0.0 (2025-11-03) - Initial Public Release 🎉\n- **First stable release:** Production-ready writing style analyzer\n- **MIT License:** Open source and freely usable\n- **GitHub Actions CI/CD:** Automated testing, linting, formatting, and releases\n  - Uses reusable workflows for consistent builds\n  - Automated version extraction and release creation\n  - Comprehensive test suite (49 tests, 95% coverage, \u003c1s runtime)\n- **Repository publishing:** Public GitHub repository with comprehensive documentation\n  - 10 relevant topics/tags for discoverability\n  - Automated wheel and source distribution builds\n  - Professional release notes and changelog\n- **Code quality improvements:**\n  - Fixed all linting issues (modern Python type hints)\n  - Consistent code formatting with black\n  - Clean codebase ready for contributions\n- **Privacy-first design:** All personal data gitignored by default\n- **Documentation:** Complete setup guides, QUICKSTART, and developer documentation\n\n### v0.5.0 (2025-10-27)\n- **Hybrid Pattern Discovery System:** Major upgrade to profile generation\n  - Combines authoritative patterns from Duden/academic style guides with LLM-discovered patterns\n  - Generates profiles with 3-4x more linguistic patterns than basic analysis\n  - New transition categories: conditional, clarifying, concessive\n  - Improved passive voice accuracy and argumentation detection\n- **Dual-Format Output:** Profiles now generated in both JSON and Markdown\n  - JSON for analysis and metrics\n  - Markdown for AI assistant integration\n- **Comprehensive Documentation:** Profile usage guides and validation framework\n  - Profile creation guide\n  - AI integration best practices\n  - Validation test framework\n\n### v0.4.0 (2025-10-27)\n- **Documentation restructuring:** Separated project and user-specific documentation\n  - Created `user-profiles/` directory for personal profiles (gitignored)\n  - Moved profile-specific documentation to `user-profiles/`\n  - Updated .gitignore to protect user privacy\n- **Profile management improvements:** Simplified profile organization\n  - Clearer naming conventions for generated profiles\n  - Profile archiving and versioning support\n  - Validation framework for testing profile quality\n- **Empirical validation:** Testing framework confirms profile quality and distinctiveness\n\n### v0.3.0 (2025-10-27)\n- Added LaTeX (.tex) file support with pylatexenc\n- Replaced pypdf with pdfplumber for better PDF text extraction\n- Added comprehensive linguistic analysis:\n  - Voice analysis (passive vs active)\n  - Transition word analysis (5 categories)\n  - Sentence complexity metrics\n  - Rhetorical device detection\n- Improved content filtering (code/formula/reference detection)\n- Enhanced phrase extraction with stopword filtering\n- Robust JSON parsing with retry logic\n- Created pre-analyzed academic profiles with detailed documentation\n\n### v0.2.0 (2025-10-26)\n- Improved German language support\n- Added profile merging capabilities\n- Enhanced error handling\n\n### v0.1.0 (2025-10-26)\n- Initial release\n- German and English support\n- HuggingFace Transformers integration\n- Basic profile generation\n- JSON output format\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftehw0lf%2Fwriting-style-analyzer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftehw0lf%2Fwriting-style-analyzer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftehw0lf%2Fwriting-style-analyzer/lists"}