https://github.com/michael-borck/document-lens-legacy
Analyzes text documents for readability, academic integrity, and linguistic insights via REST API.
https://github.com/michael-borck/document-lens-legacy
academic-integrity api docker document-analysis edtech microservice natural-language-processing nlp python readability rest-api text-analysis
Last synced: 1 day ago
JSON representation
Analyzes text documents for readability, academic integrity, and linguistic insights via REST API.
- Host: GitHub
- URL: https://github.com/michael-borck/document-lens-legacy
- Owner: michael-borck
- License: mit
- Created: 2025-08-31T15:15:03.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2026-04-28T09:07:18.000Z (about 1 month ago)
- Last Synced: 2026-05-06T15:14:16.732Z (26 days ago)
- Topics: academic-integrity, api, docker, document-analysis, edtech, microservice, natural-language-processing, nlp, python, readability, rest-api, text-analysis
- Language: Python
- Size: 18.5 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DocumentLens
[](https://github.com/topics/edtech) [](https://github.com/topics/academic-integrity) [](https://github.com/topics/api) [](https://github.com/topics/docker) [](https://github.com/topics/document-analysis) [](https://github.com/topics/microservice) [](https://github.com/topics/natural-language-processing) [](https://github.com/topics/nlp) [](https://github.com/topics/python) [](https://github.com/topics/readability)
**Text Analysis & Academic Intelligence Microservice**
Transform text content into actionable insights through comprehensive linguistic analysis, writing quality assessment, and academic integrity checking.
## ๐ Quick Start
```bash
# Docker deployment (recommended)
docker-compose up -d
# Or raw deployment
./deploy.sh
# API available at: http://localhost:8002
# Documentation: http://localhost:8002/docs
```
## ๐ API Endpoints
### Core Analysis
- `GET /health` - Service health check
- `POST /text` - Text analysis (readability, quality, word frequency)
- `POST /academic` - Academic analysis (citations, DOI resolution, integrity)
- `POST /files` - File upload + analysis (PDF, DOCX, TXT, MD)
### Advanced Text Analysis
- `POST /advanced/ngrams` - N-gram extraction with optional filter terms
- `POST /advanced/ner` - Named entity recognition
- `POST /advanced/search/keywords` - Batch keyword search across multiple terms
### Document Intelligence
- `POST /files/infer-metadata` - Infer year, company, industry, document type from content
- `POST /text/infer-metadata` - Metadata inference from raw text
- Page-level text extraction (via `include_extracted_text=true` on `/files`)
### Integration
- Root endpoint: `GET /` - Service info and available endpoints
- For presentations: Use [PresentationLens](https://github.com/michael-borck/presentation-lens)
- For recordings: Use [RecordingLens](https://github.com/michael-borck/recording-lens)
## ๐ฏ Use Cases
- **Text Analysis**: Readability, writing quality, word frequency for any text content
- **Academic Analysis**: Citation verification, DOI resolution, AI detection, integrity checking
- **Document Intelligence**: Extract and analyze text from PDFs and Word documents
- **Sustainability Research**: Batch keyword analysis for TCFD, GRI, SDGs, SASB frameworks
- **Corporate Report Analysis**: Auto-detect metadata (year, company, industry) from annual reports
- **Multi-Service Workflows**: Integrate with specialized analysis services
### Desktop Application Support
DocumentLens powers the **document-lens-desktop** Electron application for researchers analyzing corporate sustainability reports. Features include:
- Smart metadata inference (company name, year, industry, document type)
- Framework keyword analysis (TCFD, GRI, SDGs, SASB)
- Batch processing with SQLite storage
- Offline operation via bundled Python backend
## ๐๏ธ Microservices Ecosystem
DocumentLens is part of a focused microservices architecture:
| Service | Purpose | Repository |
|---------|---------|------------|
| **DocumentLens** | Text analysis & academic intelligence | *This repo* |
| **PresentationLens** | Presentation design & structure analysis | [presentation-lens](https://github.com/michael-borck/presentation-lens) |
| **RecordingLens** | Student recordings (video/audio) analysis | [recording-lens](https://github.com/michael-borck/recording-lens) |
| **CodeLens** | Source code quality & analysis | [code-lens](https://github.com/michael-borck/code-lens) |
| **SubmissionLens** | Student submission router & frontend | [submission-lens](https://github.com/michael-borck/submission-lens) |
### Integration Pattern
```mermaid
graph LR
A[Student Submission] --> B[SubmissionLens Frontend]
B --> C{File Type Router}
C -->|Text/PDF/DOCX| D[DocumentLens]
C -->|PPTX| E[PresentationLens]
C -->|Video/Audio| F[RecordingLens]
C -->|Source Code| G[CodeLens]
E --> D
F --> D
G --> D
D --> H[Combined Feedback]
H --> B
B --> I[Student Dashboard]
```
## ๐ Deployment
### Docker Deployment (Recommended)
```bash
git clone https://github.com/michael-borck/document-lens.git
cd document-lens
docker-compose up -d # Single container deployment
```
### Raw/Native Deployment
```bash
git clone https://github.com/michael-borck/document-lens.git
cd document-lens
./deploy.sh # Handles venv, dependencies, and production server
```
## ๐งช Testing
```bash
# Install dev dependencies
uv sync --extra dev
# Run all tests
uv run pytest tests/ -v
# Run specific test file
uv run pytest tests/test_files.py -v
# Run only PDF tests
uv run pytest tests/ -m pdf -v
# Skip slow tests
uv run pytest tests/ -m "not slow" -v
# Run with coverage report
uv run pytest tests/
```
### Test Structure
- `tests/conftest.py` - Shared fixtures and test client setup
- `tests/test_health.py` - Health/smoke tests
- `tests/test_text_analysis.py` - Text analysis endpoint tests
- `tests/test_academic_analysis.py` - Academic analysis endpoint tests
- `tests/test_files.py` - PDF file upload tests
### Test Data
Place test files (PDF, DOCX, etc.) in the `test-data/` directory. The test suite automatically discovers and uses these files for parameterized tests.
## ๐ Documentation
- `DEPLOYMENT.md` - Deployment guide for Docker and raw installations
- `DOCUMENTLENS_SETUP.md` - Setup and usage instructions
- `.env.example` - Configuration template
- `docs/` - Additional architecture and integration documentation
---
*DocumentLens: Pure text intelligence at the heart of content analysis*