{"id":29224762,"url":"https://github.com/kareemsasa3/arachne","last_synced_at":"2026-05-04T10:34:27.546Z","repository":{"id":302178631,"uuid":"1011518813","full_name":"kareemsasa3/arachne","owner":"kareemsasa3","description":"A resilient, concurrent web scraper service built in Go, featuring a REST API, Redis-backed job queue, and circuit breaker for fault tolerance.","archived":false,"fork":false,"pushed_at":"2025-07-02T02:52:49.000Z","size":93,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-02T03:22:03.223Z","etag":null,"topics":["asynchronous","circuit-breaker","concurrency","crawler","docker","docker-compose","go","golang","job-queue","rate-limiting","redis","rest-api","web-scraper","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kareemsasa3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-01T00:28:08.000Z","updated_at":"2025-07-02T02:51:46.000Z","dependencies_parsed_at":"2025-07-02T03:22:05.256Z","dependency_job_id":null,"html_url":"https://github.com/kareemsasa3/arachne","commit_stats":null,"previous_names":["kareemsasa3/go-practice","kareemsasa3/arachne"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/kareemsasa3/arachne","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kareemsasa3%2Farachne","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kareemsasa3%2Farachne/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kareemsasa3%2Farachne/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kareemsasa3%2Farachne/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kareemsasa3","download_url":"https://codeload.github.com/kareemsasa3/arachne/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kareemsasa3%2Farachne/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263271499,"owners_count":23440396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asynchronous","circuit-breaker","concurrency","crawler","docker","docker-compose","go","golang","job-queue","rate-limiting","redis","rest-api","web-scraper","web-scraping"],"created_at":"2025-07-03T06:07:54.293Z","updated_at":"2026-05-04T10:34:27.539Z","avatar_url":"https://github.com/kareemsasa3.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Arachne - Autonomous Web Research Platform\n\nAn autonomous research platform that searches, scrapes, indexes, and synthesizes web content using AI. Arachne continuously detects changes, keeps version history, and offers full-text search (FTS5) across collected documents.\n\n## 🏗️ Architecture\n\n### Folder Structure\n```\narachne/\n├── services/\n│   ├── ai/                      # AI microservice (git submodule - nexus)\n│   ├── scraper/                 # Standalone scraping engine (git submodule)\n│   └── web/                     # Next.js Arachne web interface (in repo)\n├── infrastructure/              # Deployment \u0026 infrastructure (nginx, compose, scripts)\n└── README.md\n```\n\n### Services\nThis setup orchestrates the following services:\n\n- **Nginx** - Reverse proxy and TLS termination\n- **AI** - AI microservice (Node.js)\n- **Web** - Arachne Web Interface (Next.js)\n- **Scraper** - Web scraping service (Go)\n- **Redis** - Job storage and coordination\n- **Redis Commander** - Optional Redis management UI\n\n## 🤔 What is Arachne?\n\n- Autonomous research agent that orchestrates search, scrape, and synthesis.\n- Web search → scrape → index → AI synthesis pipeline.\n- Change detection and version history across fetched content.\n- Full-text search powered by SQLite FTS5 for collected documents.\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n- Docker\n- Docker Compose\n\n### Running the platform\n\n1. **Start all services:**\n   ```bash\n   cd infrastructure\n   docker compose up --build\n   ```\n\n2. **Access the applications:**\n   - **Arachne Web Interface**: http://localhost/\n   - **AI API**: http://localhost/api/ai/\n   - **Scraper API**: http://localhost/api/arachne/\n   - **Redis Commander**: http://localhost/redis/\n   - **Health Check**: http://localhost/health\n\n3. **Stop all services:**\n   ```bash\n   cd infrastructure\n   docker compose down\n   ```\n\n## 📁 Service Endpoints\n\n### AI\n- **URL**: http://localhost/api/ai/\n- **Internal**: http://ai:3001\n- **Endpoints**:\n  - `GET /health` - Health check\n  - `POST /api/ai/process` - AI processing\n\n### Scraper (Web Scraping)\n- **URL**: http://localhost/api/arachne/\n- **Internal**: http://scraper:8080\n- **Endpoints**:\n  - `POST /api/arachne/scrape` - Submit scraping job\n  - `GET /api/arachne/scrape/status?id=\u003cjob_id\u003e` - Check job status\n  - `GET /health` - Health check\n  - `GET /metrics` - Prometheus metrics\n\n### Scraper API\n- `POST /api/arachne/scrape` - Accepts `{ \"urls\": [\"https://example.com\"] }`\n- `GET /api/arachne/scrape/status?id=\u003cjob_id\u003e`\n- `GET /api/arachne/memory/*`\n- `GET /memory/*` - Direct passthrough to the scraper service\n\n### Redis Commander\n- **URL**: http://localhost/redis/\n- **Purpose**: Web UI for Redis management\n\n## 💾 Persistence Model\n\n- Scraper snapshots are stored in SQLite at `/app/data/snapshots.db`\n- SQLite persistence is backed by the Docker volume `scraper_data`\n- Redis data also persists via a Docker volume\n\n## 🧭 Routing Model\n\n- Nginx is the single entry point and routes requests by path prefix\n- `/api/arachne` → scraper\n- `/memory` → scraper\n- `/api/ai` → AI\n- `/` → web\n\n## 🔧 Configuration\n\n### Environment Variables\n\nArachne uses a centralized environment variable system. For detailed configuration, see [Environment Setup Guide](infrastructure/ENVIRONMENT_SETUP.md).\n\n#### Quick Setup\n\n```bash\ncd infrastructure\n./setup-env.sh\n```\n\nThis interactive script will help you configure:\n- Domain name and SSL email\n- Google Gemini API key for AI features\n- Resource limits and performance settings\n- Development vs production configurations\n\n#### Manual Setup\n\n```bash\ncd infrastructure\ncp env.example .env\n# Edit .env with your configuration\nnano .env\n```\n\n#### Key Configuration Variables\n\n| Variable | Description | Required |\n|----------|-------------|----------|\n| `DOMAIN_NAME` | Your domain name | Yes |\n| `SSL_EMAIL` | Email for SSL certificates | Yes |\n| `GEMINI_API_KEY` | Google Gemini API key | For AI features |\n| `VITE_AI_URL` | AI URL | Auto-configured |\n\n### Nginx Configuration\n\nThe nginx configuration is located in:\n- `infrastructure/nginx/nginx.conf` - Main configuration\n- `infrastructure/nginx/conf.d/default.conf` - Server blocks\n\n## 🐳 Individual Service Development\n\nEach service can be developed independently:\n\n### AI\n```bash\ncd services/ai\nnpm install\nnpm run dev\n```\n\n### Web Console\n```bash\ncd services/web\nnpm install\nnpm run dev\n```\n\n### Scraper\n```bash\ncd services/scraper\ndocker-compose up --build\n```\n\n## 🔗 Submodules\n\nThe `ai` and `scraper` services are git submodules under `services/`. The `web` interface lives directly in this repository. If you clone without `--recurse-submodules`, run:\n\n```bash\ngit submodule update --init --recursive\n```\n\nThis fetches the `ai` and `scraper` submodules. When switching branches that touch submodules, rerun the command or checkout with `git submodule sync --recursive`.\n\n## 📊 Monitoring\n\n### Health Checks\nAll services include health checks that can be monitored:\n```bash\ncd infrastructure\ndocker compose ps\n```\n\n### Logs\nView logs for specific services:\n```bash\ncd infrastructure\ndocker compose logs ai\ndocker compose logs scraper\ndocker compose logs nginx\ndocker compose logs web\ndocker compose logs redis\n```\n\n### Redis Monitoring\nAccess Redis Commander at http://localhost/redis/ to monitor Redis operations.\n\n## 🔒 Security\n\n- All services run as non-root users\n- Rate limiting on API endpoints\n- Security headers configured in nginx\n- CORS properly configured for cross-origin requests\n\n## 🚀 Production Deployment\n\nFor production deployment:\n\n1. **Configure environment variables**:\n   ```bash\n   cd infrastructure\n   ./setup-env.sh\n   ```\n\n2. **Set up SSL certificates**:\n   ```bash\n   docker compose -f prod/docker-compose.prod.yml --profile ssl-setup up certbot\n   ```\n\n3. **Start production services**:\n   ```bash\n   docker compose -f prod/docker-compose.prod.yml up -d\n   ```\n\n4. **Monitor the deployment**:\n   ```bash\n   docker compose -f prod/docker-compose.prod.yml logs -f\n   ```\n\nFor detailed deployment instructions, see [Environment Setup Guide](infrastructure/ENVIRONMENT_SETUP.md).\n\n## System Deploy (Erebus)\n\n`scripts/deploy-system.sh` syncs this repo into the live stack on Erebus and restarts it. It is a manual path — no CI/CD involved.\n\n**Prerequisites:** `scripts/install-system.sh` must have been run at least once to set up `/opt/arachne`, `/etc/arachne/`, and `arachne.service`.\n\n```bash\n# Preview what would change (no writes):\nsudo ./scripts/deploy-system.sh --dry-run\n\n# Deploy:\nsudo ./scripts/deploy-system.sh\n```\n\nThe script:\n1. Checks that all git submodules (`services/ai`, `services/scraper`) are populated in the source tree. If not, it tells you to run `git submodule update --init --recursive` and exits.\n2. `rsync`s the repo into `/opt/arachne`, excluding `.git`, `node_modules`, `.next`, `dist`, `build`, `.env`, and `.vscode`.\n3. Does **not** touch `/etc/arachne/` (runtime config) or `/var/lib/arachne/` (scraper data).\n4. Writes the deployed git SHA and timestamp to `/opt/arachne/.deploy-revision` so you can always answer \"what's live right now?\".\n5. Restarts `arachne.service` (which rebuilds and recreates containers). This is synchronous for the `Type=oneshot` unit — the restart command blocks until the service has fully settled.\n6. Checks `systemctl status`, `docker ps`, and polls the `/health` endpoint with retries.\n\n## 📝 Troubleshooting\n\n### Common Issues\n\n1. **Port conflicts**: Ensure ports 80, 443 are available\n2. **Build failures**: Check Dockerfile syntax in each service\n3. **Service dependencies**: Ensure Redis starts before Arachne\n4. **Network issues**: Check if all services are on the same network\n\n### Debug Commands\n\n```bash\ncd infrastructure\n\n# Check service status\ndocker compose ps\n\n# View logs\ndocker compose logs -f\n\n# Rebuild specific service\ndocker compose build web\n\n# Access service shell\ndocker compose exec web sh\n```\n\n## 🤝 Contributing\n\nEach service is maintained independently. See individual service READMEs for contribution guidelines. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkareemsasa3%2Farachne","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkareemsasa3%2Farachne","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkareemsasa3%2Farachne/lists"}