{"id":34917590,"url":"https://github.com/monokrome/foiacquire","last_synced_at":"2026-01-30T00:55:40.069Z","repository":{"id":326993475,"uuid":"1107368784","full_name":"monokrome/foiacquire","owner":"monokrome","description":"FOIA Acquisition Tool","archived":false,"fork":false,"pushed_at":"2026-01-26T07:41:58.000Z","size":1475,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-26T22:01:18.126Z","etag":null,"topics":["archiving","cli","document-management","foia","freedom-of-information","full-text-search","government-transparency","investigative-journalism","journalism","ocr","open-source-intelligence","osint","public-records","rust","scraper","server"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/monokrome.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["monokrome"]}},"created_at":"2025-12-01T03:25:23.000Z","updated_at":"2026-01-26T07:42:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/monokrome/foiacquire","commit_stats":null,"previous_names":["monokrome/foiacquire"],"tags_count":39,"template":false,"template_full_name":null,"purl":"pkg:github/monokrome/foiacquire","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monokrome%2Ffoiacquire","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monokrome%2Ffoiacquire/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monokrome%2Ffoiacquire/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monokrome%2Ffoiacquire/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/monokrome","download_url":"https://codeload.github.com/monokrome/foiacquire/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/monokrome%2Ffoiacquire/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28892719,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-29T21:06:44.224Z","status":"ssl_error","status_checked_at":"2026-01-29T21:06:42.160Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archiving","cli","document-management","foia","freedom-of-information","full-text-search","government-transparency","investigative-journalism","journalism","ocr","open-source-intelligence","osint","public-records","rust","scraper","server"],"created_at":"2025-12-26T12:52:30.570Z","updated_at":"2026-01-30T00:55:40.052Z","avatar_url":"https://github.com/monokrome.png","language":"Rust","funding_links":["https://github.com/sponsors/monokrome"],"categories":[],"sub_categories":[],"readme":"# foiacquire\n\nA command-line tool for acquiring, organizing, and searching FOIA documents from government archives and other sources.\n\n## Features\n\n- **Multi-source scraping** - Configurable scrapers for FBI Vault, CIA Reading Room, MuckRock, DocumentCloud, and custom sources\n- **Privacy by default** - Routes traffic through Tor with pluggable transports; supports external SOCKS proxies\n- **Smart rate limiting** - Adaptive delays with exponential backoff to avoid blocks, with optional Redis backend for distributed deployments\n- **Content-addressable storage** - Documents stored by SHA-256 + BLAKE3 hash for deduplication\n- **Multiple OCR backends** - Tesseract (default), OCRS (pure Rust), PaddleOCR (GPU), or DeepSeek (VLM)\n- **Browser automation** - Chromium-based scraping for JavaScript-heavy sites with stealth mode for bot detection bypass\n- **Browser pool** - Load balance across multiple browser instances with round-robin, random, or per-domain strategies\n- **WARC import** - Import documents from Web Archive files with filtering and checkpointing\n- **Full-text search** - Search across document content and metadata\n- **Web UI** - Browse, search, and view documents through a local web interface\n- **LLM annotation** - Generate summaries and tags using Ollama, Groq, OpenAI, or Together.ai\n- **Database flexibility** - SQLite (default) or PostgreSQL for larger deployments\n- **Docker support** - Pre-built images for easy deployment\n\n## Installation\n\nDownload a pre-built binary from [Releases](https://github.com/monokrome/foiacquire/releases), or build from source:\n\n```bash\n# Default build (SQLite + browser automation)\ncargo install --git https://github.com/monokrome/foiacquire\n\n# With PostgreSQL support\ncargo install --git https://github.com/monokrome/foiacquire --features postgres\n\n# With all OCR backends\ncargo install --git https://github.com/monokrome/foiacquire --features ocr-all\n```\n\n### Feature Flags\n\n| Feature | Description |\n|---------|-------------|\n| `browser` | Browser automation via Chromium (default) |\n| `postgres` | PostgreSQL database support |\n| `redis-backend` | Redis for distributed rate limiting |\n| `ocr-ocrs` | OCRS pure-Rust OCR backend |\n| `ocr-paddle` | PaddleOCR CNN-based backend |\n| `ocr-all` | All OCR backends |\n\n## Quick Start\n\n```bash\n# Initialize with a target directory\nfoiacquire init --target ./foia-data\n\n# Or use an existing config\ncp etc/example.json foiacquire.json\nfoiacquire init\n\n# List configured sources\nfoiacquire source list\n\n# Scrape documents (crawl + download)\nfoiacquire scrape fbi_vault --limit 100\n\n# Run OCR on downloaded documents\nfoiacquire analyze --workers 4\n\n# Generate summaries with LLM (requires Ollama)\nfoiacquire annotate --limit 50\n\n# Start web UI\nfoiacquire serve\n# Open http://localhost:3030\n```\n\n## Commands\n\n### Document Acquisition\n\n| Command | Description |\n|---------|-------------|\n| `scrape \u003csource\u003e` | Crawl and download documents from a source |\n| `crawl \u003csource\u003e` | Discover document URLs without downloading |\n| `download [source]` | Download documents from the crawl queue |\n| `import \u003cfiles...\u003e` | Import from WARC archive files |\n| `refresh [source]` | Re-fetch metadata for existing documents |\n\n### Document Processing\n\n| Command | Description |\n|---------|-------------|\n| `analyze [source]` | Extract text and run OCR on documents |\n| `analyze-check` | Verify OCR tools are installed |\n| `analyze-compare \u003cfile\u003e` | Compare OCR backends on a file |\n| `annotate [source]` | Generate summaries/tags with LLM |\n| `detect-dates [source]` | Detect publication dates in documents |\n| `archive [source]` | Extract contents from ZIP/email attachments |\n\n### Browsing \u0026 Search\n\n| Command | Description |\n|---------|-------------|\n| `ls` | List documents with filtering |\n| `info \u003cdoc_id\u003e` | Show document metadata |\n| `read \u003cdoc_id\u003e` | Output document content |\n| `search \u003cquery\u003e` | Full-text search |\n| `serve [bind]` | Start web interface (default: 127.0.0.1:3030) |\n\n### Management\n\n| Command | Description |\n|---------|-------------|\n| `init` | Initialize database and directories |\n| `source list` | List configured sources |\n| `source rename` | Rename a source |\n| `config recover` | Recover config from database |\n| `config history` | Show configuration history |\n| `db copy \u003cfrom\u003e \u003cto\u003e` | Migrate between SQLite/PostgreSQL |\n| `state status` | Show crawl state |\n| `state clear \u003csource\u003e` | Reset crawl state |\n\n## Configuration\n\nCreate a `foiacquire.json` in your data directory or use `--config`:\n\n```json\n{\n  \"target\": \"./foia_documents/\",\n  \"scrapers\": {\n    \"my_source\": {\n      \"discovery\": {\n        \"type\": \"html_crawl\",\n        \"base_url\": \"https://example.gov/foia\",\n        \"start_paths\": [\"/documents\"],\n        \"document_links\": [\"a[href*='/doc/']\"],\n        \"document_patterns\": [\"\\\\.pdf$\"],\n        \"pagination\": {\n          \"next_selectors\": [\"a.next-page\"]\n        }\n      },\n      \"fetch\": {\n        \"use_browser\": false\n      }\n    }\n  }\n}\n```\n\nSee [docs/configuration.md](docs/configuration.md) for full options.\n\n## Environment Variables\n\n| Variable | Description |\n|----------|-------------|\n| `DATABASE_URL` | Database connection (e.g., `postgres://user:pass@host/db`) |\n| `BROWSER_URL` | Remote Chrome DevTools URL (e.g., `ws://localhost:9222`), comma-separated for pool |\n| `SOCKS_PROXY` | External SOCKS5 proxy (e.g., `socks5://localhost:9050`) |\n| `FOIACQUIRE_DIRECT` | Set to `1` to disable Tor routing |\n| `LLM_PROVIDER` | LLM provider: `ollama`, `openai`, `groq`, or `together` |\n| `LLM_MODEL` | Model for annotation |\n| `GROQ_API_KEY` | Groq API key (auto-selects Groq provider) |\n| `RUST_LOG` | Log level (e.g., `debug`, `info`) |\n\n## Docker\n\n```bash\n# Run with local data directory\ndocker run -v ./foia-data:/opt/foiacquire \\\n  -e USER_ID=$(id -u) -e GROUP_ID=$(id -g) \\\n  monokrome/foiacquire:latest scrape fbi_vault\n\n# With PostgreSQL\ndocker run -v ./foia-data:/opt/foiacquire \\\n  -e DATABASE_URL=postgres://user:pass@host/foiacquire \\\n  monokrome/foiacquire:latest scrape fbi_vault\n\n# Start web UI\ndocker run -v ./foia-data:/opt/foiacquire \\\n  -p 3030:3030 \\\n  monokrome/foiacquire:latest serve 0.0.0.0:3030\n\n# With browser automation (stealth mode for bot detection bypass)\ndocker run -d --name chromium --shm-size=2g monokrome/chromium:stealth\ndocker run -v ./foia-data:/opt/foiacquire \\\n  -e BROWSER_URL=ws://chromium:9222 \\\n  --link chromium \\\n  monokrome/foiacquire:latest scrape cia_foia\n```\n\nSee [docs/docker.md](docs/docker.md) for Docker Compose examples, VNC setup, and Synology configuration.\n\n## Documentation\n\n- [Getting Started](docs/getting-started.md) - First-time setup guide\n- [Configuration](docs/configuration.md) - All configuration options\n- [Commands](docs/commands.md) - Detailed command reference\n- [Scrapers](docs/scrapers.md) - Writing custom scraper configs\n- [Docker Deployment](docs/docker.md) - Container deployment guide\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmonokrome%2Ffoiacquire","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmonokrome%2Ffoiacquire","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmonokrome%2Ffoiacquire/lists"}