{"id":31578359,"url":"https://github.com/giarcheuli/docparser","last_synced_at":"2025-10-14T07:42:27.169Z","repository":{"id":317448030,"uuid":"1067456655","full_name":"giarcheuli/docparser","owner":"giarcheuli","description":"DocParser v2.0 - Project-Aware Document Analysis Tool with AI Integration","archived":false,"fork":false,"pushed_at":"2025-09-30T22:46:52.000Z","size":48,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-01T00:22:38.786Z","etag":null,"topics":["ai","cli","document-analysis","llama2","markdown","project-management","python","replicate"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/giarcheuli.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-30T22:09:16.000Z","updated_at":"2025-09-30T22:46:56.000Z","dependencies_parsed_at":"2025-10-01T00:22:50.561Z","dependency_job_id":"2c58a93c-f4dd-42ac-bbc3-3633b7a0206a","html_url":"https://github.com/giarcheuli/docparser","commit_stats":null,"previous_names":["giarcheuli/docparser"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/giarcheuli/docparser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giarcheuli%2Fdocparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giarcheuli%2Fdocparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giarcheuli%2Fdocparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giarcheuli%2Fdocparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/giarcheuli","download_url":"https://codeload.github.com/giarcheuli/docparser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/giarcheuli%2Fdocparser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278510917,"owners_count":25998997,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cli","document-analysis","llama2","markdown","project-management","python","replicate"],"created_at":"2025-10-05T19:57:08.571Z","updated_at":"2025-10-05T19:57:10.196Z","avatar_url":"https://github.com/giarcheuli.png","language":"Python","readme":"# DocGuru v2.0 - Project-Aware Document Analysis Tool\n\nA comprehensive command-line document analysis tool for macOS that provides **project-aware analysis** of directories containing various document formats and generates structured markdown reports with AI-powered insights.\n\n![Python](https://img.shields.io/badge/python-v3.8+-blue.svg)\n![Platform](https://img.shields.io/badge/platform-macOS-lightgrey.svg)\n![License](https://img.shields.io/badge/license-MIT-green.svg)\n![CLI](https://img.shields.io/badge/interface-CLI-orange.svg)\n\n## ✨ Key Features\n\n### �️ **Project-Aware Analysis**\n- **Automatic Project Detection**: Identifies projects based on level-2 directory structure\n- **Hierarchical Organization**: Understands document relationships within project contexts\n- **Cross-Project Analysis**: Identifies patterns and relationships between projects\n\n### 📊 **Comprehensive Reporting**\n- **4 Report Types**: Comprehensive, Overview, Individual Project, and Cross-Project Analysis\n- **Session-Based Organization**: Reports saved in timestamped session folders\n- **Structured Output**: `Reports/{directory}_{timestamp}/` organization\n\n### 🤖 **AI Integration**\n- **Replicate API**: Integration with Meta Llama-2-7b-chat model\n- **Content Summarization**: AI-powered document insights and analysis\n- **Project Context**: AI analysis considers project hierarchy and relationships\n\n### 📂 **Multi-Format Support**\n- **Documents**: DOC, DOCX, PDF, TXT, HTML, Markdown\n- **Spreadsheets**: XLSX, XLS  \n- **Structured Data**: XML files\n- **Metadata Extraction**: Comprehensive file analysis\n\n## 🚀 Quick Start\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/giarcheuli/docguru.git\ncd docguru\n\n# Install dependencies\npip3 install -r requirements.txt\n\n# Set up AI integration (optional)\nexport REPLICATE_API_TOKEN=\"your-replicate-api-key\"\n\n# Run analysis\npython3 docguru.py /path/to/documents --ai\n```\n\n### Basic Usage\n\n```bash\n# Analyze a directory with project detection\npython3 docguru.py documents/\n\n# With AI-powered insights\npython3 docguru.py documents/ --ai\n\n# Verbose output for debugging\npython3 docguru.py documents/ --verbose\n\n# Just list supported files\npython3 docguru.py documents/ --list-only\n```\n\n## 📋 Project Structure \u0026 Specifications\n\n### **Project Detection Logic**\nDocGuru v2.0 uses **level-2 directory detection** for project identification:\n\n```\nDocuments/\n├── Project_A/           # ← Level 2: Detected as Project\n│   ├── subfolder1/      # ← Level 3: Part of Project_A\n│   └── subfolder2/      # ← Level 3: Part of Project_A\n├── Project_B/           # ← Level 2: Detected as Project\n│   └── docs/            # ← Level 3: Part of Project_B\n└── standalone_file.pdf  # ← Level 1: Not in a project\n```\n\n### **Report Types Generated**\n\n1. **Comprehensive Report**: Full analysis of all documents organized by project\n2. **Overview Report**: Executive summary across all projects \n3. **Individual Project Reports**: Dedicated analysis for each detected project\n4. **Cross-Project Analysis**: Relationships and patterns between projects\n\n### **Session Organization**\n```\nReports/\n└── {directory_name}_{dd}_{mm}_{yy}_{hh}_{mm}/\n    ├── {dir}_COMPREHENSIVE_AI_{timestamp}.md\n    ├── {dir}_OVERVIEW_AI_{timestamp}.md  \n    ├── {dir}_{project1}_PROJECT_{timestamp}.md\n    ├── {dir}_{project2}_PROJECT_{timestamp}.md\n    └── {dir}_CROSS_PROJECT_ANALYSIS_{timestamp}.md\n```\n\n## 💻 Installation \u0026 Setup\n\n### Prerequisites\n- Python 3.8 or higher\n- macOS 10.14 or later\n\n### Quick Setup\n```bash\n# Clone repository\ngit clone https://github.com/giarcheuli/docguru.git\ncd docguru\n\n# Install dependencies\npip3 install -r requirements.txt\n\n# Test installation\npython3 docguru.py --help\n```\n\n### AI Integration Setup\nTo enable AI-powered analysis with Replicate:\n\n```bash\n# Set up Replicate API token\nexport REPLICATE_API_TOKEN=\"your-replicate-api-key\"\n\n# Verify setup\npython3 docguru.py documents/ --ai --verbose\n```\n\n**Note**: DocGuru v2.0 uses Replicate's Meta Llama-2-7b-chat model for AI analysis.\n\n## 📖 Usage Examples\n\n### Basic Project Analysis\n```bash\n# Analyze directory with project detection\npython3 docguru.py ~/Documents/Projects\n\n# Generate all 4 report types\npython3 docguru.py ~/Technical_Documentation --ai\n\n# Verbose output for troubleshooting\npython3 docguru.py ~/Documents --verbose\n```\n\n### Real-World Examples\n```bash\n# Analyze Confluence export with AI insights\npython3 docguru.py ~/Confluence_Export --ai\n\n# Quick project overview without AI\npython3 docguru.py ~/Client_Projects --no-summary\n\n# List files in complex directory structure\npython3 docguru.py ~/Multi_Project_Folder --list-only\n```\n\n### Command Line Options\n```\nusage: docguru.py [-h] [--ai] [--verbose] [--analysis-mode {qualitative,quantitative}] \n                    [--no-summary] [--list-only] directory\n\nDocGuru v2.0 - Project-Aware Document Analysis Tool\n\npositional arguments:\n  directory             Directory to analyze (required)\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --ai                  Enable AI-powered analysis and insights\n  --verbose, -v         Enable verbose logging  \n  --analysis-mode       Analysis approach: qualitative (insights) or quantitative (metrics)\n  --no-summary          Skip showing the summary at the end\n  --list-only           Only list supported files, don't analyze\n\nExamples:\n  docguru.py documents/                     # Project-aware analysis\n  docguru.py documents/ --ai               # With AI insights\n  docguru.py documents/ --verbose          # Verbose logging\n```\n\n### Supported File Types\n\n| Format | Extensions | Features |\n|--------|------------|----------|\n| **Word Documents** | `.doc`, `.docx` | Text extraction, metadata, styles, tables |\n| **PDF Files** | `.pdf` | Text extraction, page count, document properties |\n| **Text Files** | `.txt`, `.md`, `.markdown` | Content analysis, encoding detection |\n| **HTML Files** | `.html`, `.htm` | Text extraction, meta tags, links, images |\n| **Excel Files** | `.xlsx`, `.xls` | Sheet analysis, data preview, metadata |\n| **XML Files** | `.xml` | Structure analysis, namespaces, element counts |\n\n### Sample Report Structure\n\nGenerated reports include:\n\n#### **Comprehensive Report**\n- **Project Overview**: All detected projects with file counts and structure\n- **Document Analysis**: Detailed file-by-file analysis organized by project\n- **AI Insights**: Project-level summaries and cross-project patterns\n- **Statistics**: File types, sizes, and distribution metrics\n\n#### **Overview Report**  \n- **Executive Summary**: High-level analysis across all projects\n- **Project Comparison**: Relative project sizes and characteristics\n- **Key Findings**: Important insights and recommendations\n\n#### **Individual Project Reports**\n- **Project-Specific Analysis**: Deep dive into each detected project\n- **Document Breakdown**: All files within the project context\n- **Project Insights**: AI analysis specific to project content\n\n#### **Cross-Project Analysis**\n- **Relationship Analysis**: Common themes and patterns\n- **Comparative Insights**: Differences and similarities between projects\n- **Strategic Recommendations**: High-level organizational insights\n\n## ⚙️ Configuration\n\n### Environment Variables\n\n| Variable | Description | Required |\n|----------|-------------|----------|\n| `REPLICATE_API_TOKEN` | Replicate API token for Meta Llama-2-7b-chat | Optional (for AI features) |\n\n### Logging\n\nThe application automatically generates logs in `docguru.log` with:\n- Project detection and analysis progress\n- AI API interactions and responses  \n- Error details and troubleshooting information\n- File processing status and timing\n\n## 🏗️ Architecture\n\n```\ndocguru/\n├── docguru.py                   # Main CLI application entry point\n├── requirements.txt               # Python dependencies\n├── README.md                      # Project documentation\n├── .gitignore                     # Git ignore patterns\n├── src/                          # Source code modules\n│   ├── core/                     # Core analysis logic\n│   │   ├── scanner.py            # Project-aware directory scanning\n│   │   └── analyzer.py           # Main analysis orchestrator\n│   ├── analyzers/                # File-specific analyzers\n│   │   ├── text_analyzer.py      # Text and markdown files\n│   │   ├── pdf_analyzer.py       # PDF documents\n│   │   ├── word_analyzer.py      # Word documents (.doc/.docx)\n│   │   ├── excel_analyzer.py     # Excel spreadsheets\n│   │   ├── html_analyzer.py      # HTML files\n│   │   └── xml_analyzer.py       # XML files\n│   └── utils/                    # Utility modules\n│       ├── ai_analyzer.py        # AI integration (Replicate/Llama)\n│       └── project_report_generator.py # Session-based report generation\n├── tests/                        # Unit tests\n│   └── test_basic.py             # Basic functionality tests\n└── Reports/                      # Generated session reports (auto-created)\n    └── {session_folders}/        # Timestamped analysis sessions\n```\n\n## 🚀 Performance\n\nDocGuru v2.0 is optimized for project-aware analysis with:\n- **Efficient Project Detection**: Fast hierarchy scanning and project identification\n- **Memory Management**: Streaming content processing for large files\n- **Concurrent Processing**: Parallel AI analysis where possible\n- **Error Resilience**: Graceful handling of corrupted or inaccessible files\n\n### Typical Performance\n\n| Directory Size | File Count | Processing Time | Notes |\n|----------------|------------|-----------------|--------|\n| **Small Projects** | \u003c 50 files | 10-30 seconds | Without AI analysis |\n| **Medium Projects** | 50-200 files | 1-3 minutes | With basic AI analysis |  \n| **Large Projects** | 200+ files | 3-10 minutes | Full AI analysis with all report types |\n\n*Performance varies based on file sizes, complexity, and AI analysis depth*\n\n### AI Analysis Performance\n- **Document Analysis**: ~1-2 seconds per document\n- **Project Analysis**: ~3-5 seconds per project  \n- **Cross-Project Analysis**: ~5-10 seconds for final report\n- **Total AI Overhead**: +2-5 minutes for comprehensive AI insights\n\n## 🔧 Troubleshooting\n\n### Common Issues\n\n1. **\"No projects detected\" warning**\n   ```bash\n   # Ensure your directory has level-2 subdirectories\n   Documents/\n   ├── Project1/     # ← Level 2: Will be detected\n   │   └── files...\n   └── Project2/     # ← Level 2: Will be detected\n       └── files...\n   ```\n\n2. **AI analysis fails**\n   ```bash\n   # Check API token\n   echo $REPLICATE_API_TOKEN\n   \n   # Test with verbose logging\n   python3 docguru.py documents/ --ai --verbose\n   ```\n\n3. **\"No module named\" errors**\n   ```bash\n   # Reinstall dependencies\n   pip3 install -r requirements.txt\n   ```\n\n4. **Session folder creation fails**\n   ```bash\n   # Check write permissions in project directory\n   ls -la Reports/\n   chmod 755 Reports/\n   ```\n\n### Debug Mode\n\nEnable comprehensive logging:\n```bash\npython3 docguru.py /path/to/docs --verbose\n```\n\nCheck session logs:\n```bash\ntail -f docguru.log\n```\n\n## 🛠️ Development\n\n### Setting up for Development\n\n```bash\n# Clone repository\ngit clone https://github.com/giarcheuli/docparser.git\ncd docparser\n\n# Install dependencies\npip3 install -r requirements.txt\n\n# Run tests\npython3 tests/test_basic.py\n\n# Test with real data (need to create test directory)\nmkdir test_data \u0026\u0026 python3 docguru.py test_data/ --verbose\n```\n\n### Adding New File Types\n\n1. Create new analyzer in `src/analyzers/`\n2. Inherit from appropriate base class\n3. Implement `extract_text()` and `extract_metadata()` methods\n4. Add file extension to `SUPPORTED_EXTENSIONS` in `core/scanner.py`\n5. Register analyzer in `core/analyzer.py`\n\n### Project Structure Guidelines\n\n- **Level-2 Detection**: Projects are identified at 2 levels deep from root\n- **Session Management**: All reports go to timestamped session folders\n- **AI Integration**: Use project context for enhanced analysis\n- **Error Handling**: Graceful degradation when AI/analysis fails\n\n## 📝 License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes with proper testing\n4. Commit changes (`git commit -m 'Add amazing feature'`)\n5. Push to branch (`git push origin feature/amazing-feature`)\n6. Open a Pull Request\n\n### Contributing Guidelines\n\n- Follow Python PEP 8 style guidelines\n- Add tests for new functionality\n- Update documentation for new features\n- Ensure project-aware analysis remains intact\n- Test with both AI and non-AI modes\n\n## 📞 Support\n\nFor issues, questions, or contributions:\n\n1. **Check Issues**: Search existing GitHub issues\n2. **Create Issue**: Include logs, system info, and reproduction steps\n3. **Debug Mode**: Use `--verbose` flag for detailed logs\n4. **Documentation**: Check this README and code comments\n\n## 🗺️ Roadmap\n\n### Planned Features (v2.1)\n- [ ] **Configurable AI Integration**: Multi-provider support (Replicate, OpenAI, Anthropic, Gemini)\n- [ ] **Configurable Project Detection**: User-defined directory level for project identification\n- [ ] **JSON Output Format**: Alternative to markdown reports\n- [ ] **Batch Processing**: Multiple directory analysis\n- [ ] **Custom Templates**: User-defined report formats\n- [ ] **Watch Mode**: Monitor directories for changes\n- [ ] **Docker Support**: Containerized deployment\n- [ ] **Cloud Storage**: Direct integration with cloud services\n\n### AI Enhancements\n- [ ] **Multi-Provider Configuration**: Flexible AI provider selection with fallback support\n- [ ] **Custom Prompts**: User-defined analysis templates\n- [ ] **Workflow Analysis**: Document process understanding\n- [ ] **Sentiment Analysis**: Document tone and sentiment\n\n**See [ROADMAP.md](docs/ROADMAP.md) for detailed feature specifications and implementation plans.**\n\n---\n\n**DocGuru v2.0** - Built with ❤️ for intelligent, project-aware document analysis","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgiarcheuli%2Fdocparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgiarcheuli%2Fdocparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgiarcheuli%2Fdocparser/lists"}