https://github.com/kareemsasa3/arachne
A resilient, concurrent web scraper service built in Go, featuring a REST API, Redis-backed job queue, and circuit breaker for fault tolerance.
https://github.com/kareemsasa3/arachne
asynchronous circuit-breaker concurrency crawler docker docker-compose go golang job-queue rate-limiting redis rest-api web-scraper web-scraping
Last synced: about 2 months ago
JSON representation
A resilient, concurrent web scraper service built in Go, featuring a REST API, Redis-backed job queue, and circuit breaker for fault tolerance.
- Host: GitHub
- URL: https://github.com/kareemsasa3/arachne
- Owner: kareemsasa3
- License: mit
- Created: 2025-07-01T00:28:08.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-07-02T02:52:49.000Z (12 months ago)
- Last Synced: 2025-07-02T03:22:03.223Z (12 months ago)
- Topics: asynchronous, circuit-breaker, concurrency, crawler, docker, docker-compose, go, golang, job-queue, rate-limiting, redis, rest-api, web-scraper, web-scraping
- Language: Go
- Homepage:
- Size: 90.8 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Arachne - Autonomous Web Research Platform
An autonomous research platform that searches, scrapes, indexes, and synthesizes web content using AI. Arachne continuously detects changes, keeps version history, and offers full-text search (FTS5) across collected documents.
## 🏗️ Architecture
### Folder Structure
```
arachne/
├── services/
│ ├── ai/ # AI microservice (git submodule - nexus)
│ ├── scraper/ # Standalone scraping engine (git submodule)
│ └── web/ # Next.js Arachne web interface (in repo)
├── infrastructure/ # Deployment & infrastructure (nginx, compose, scripts)
└── README.md
```
### Services
This setup orchestrates the following services:
- **Nginx** - Reverse proxy and TLS termination
- **AI** - AI microservice (Node.js)
- **Web** - Arachne Web Interface (Next.js)
- **Scraper** - Web scraping service (Go)
- **Redis** - Job storage and coordination
- **Redis Commander** - Optional Redis management UI
## 🤔 What is Arachne?
- Autonomous research agent that orchestrates search, scrape, and synthesis.
- Web search → scrape → index → AI synthesis pipeline.
- Change detection and version history across fetched content.
- Full-text search powered by SQLite FTS5 for collected documents.
## 🚀 Quick Start
### Prerequisites
- Docker
- Docker Compose
### Running the platform
1. **Start all services:**
```bash
cd infrastructure
docker compose up --build
```
2. **Access the applications:**
- **Arachne Web Interface**: http://localhost/
- **AI API**: http://localhost/api/ai/
- **Scraper API**: http://localhost/api/arachne/
- **Redis Commander**: http://localhost/redis/
- **Health Check**: http://localhost/health
3. **Stop all services:**
```bash
cd infrastructure
docker compose down
```
## 📁 Service Endpoints
### AI
- **URL**: http://localhost/api/ai/
- **Internal**: http://ai:3001
- **Endpoints**:
- `GET /health` - Health check
- `POST /api/ai/process` - AI processing
### Scraper (Web Scraping)
- **URL**: http://localhost/api/arachne/
- **Internal**: http://scraper:8080
- **Endpoints**:
- `POST /api/arachne/scrape` - Submit scraping job
- `GET /api/arachne/scrape/status?id=` - Check job status
- `GET /health` - Health check
- `GET /metrics` - Prometheus metrics
### Scraper API
- `POST /api/arachne/scrape` - Accepts `{ "urls": ["https://example.com"] }`
- `GET /api/arachne/scrape/status?id=`
- `GET /api/arachne/memory/*`
- `GET /memory/*` - Direct passthrough to the scraper service
### Redis Commander
- **URL**: http://localhost/redis/
- **Purpose**: Web UI for Redis management
## 💾 Persistence Model
- Scraper snapshots are stored in SQLite at `/app/data/snapshots.db`
- SQLite persistence is backed by the Docker volume `scraper_data`
- Redis data also persists via a Docker volume
## 🧭 Routing Model
- Nginx is the single entry point and routes requests by path prefix
- `/api/arachne` → scraper
- `/memory` → scraper
- `/api/ai` → AI
- `/` → web
## 🔧 Configuration
### Environment Variables
Arachne uses a centralized environment variable system. For detailed configuration, see [Environment Setup Guide](infrastructure/ENVIRONMENT_SETUP.md).
#### Quick Setup
```bash
cd infrastructure
./setup-env.sh
```
This interactive script will help you configure:
- Domain name and SSL email
- Google Gemini API key for AI features
- Resource limits and performance settings
- Development vs production configurations
#### Manual Setup
```bash
cd infrastructure
cp env.example .env
# Edit .env with your configuration
nano .env
```
#### Key Configuration Variables
| Variable | Description | Required |
|----------|-------------|----------|
| `DOMAIN_NAME` | Your domain name | Yes |
| `SSL_EMAIL` | Email for SSL certificates | Yes |
| `GEMINI_API_KEY` | Google Gemini API key | For AI features |
| `VITE_AI_URL` | AI URL | Auto-configured |
### Nginx Configuration
The nginx configuration is located in:
- `infrastructure/nginx/nginx.conf` - Main configuration
- `infrastructure/nginx/conf.d/default.conf` - Server blocks
## 🐳 Individual Service Development
Each service can be developed independently:
### AI
```bash
cd services/ai
npm install
npm run dev
```
### Web Console
```bash
cd services/web
npm install
npm run dev
```
### Scraper
```bash
cd services/scraper
docker-compose up --build
```
## 🔗 Submodules
The `ai` and `scraper` services are git submodules under `services/`. The `web` interface lives directly in this repository. If you clone without `--recurse-submodules`, run:
```bash
git submodule update --init --recursive
```
This fetches the `ai` and `scraper` submodules. When switching branches that touch submodules, rerun the command or checkout with `git submodule sync --recursive`.
## 📊 Monitoring
### Health Checks
All services include health checks that can be monitored:
```bash
cd infrastructure
docker compose ps
```
### Logs
View logs for specific services:
```bash
cd infrastructure
docker compose logs ai
docker compose logs scraper
docker compose logs nginx
docker compose logs web
docker compose logs redis
```
### Redis Monitoring
Access Redis Commander at http://localhost/redis/ to monitor Redis operations.
## 🔒 Security
- All services run as non-root users
- Rate limiting on API endpoints
- Security headers configured in nginx
- CORS properly configured for cross-origin requests
## 🚀 Production Deployment
For production deployment:
1. **Configure environment variables**:
```bash
cd infrastructure
./setup-env.sh
```
2. **Set up SSL certificates**:
```bash
docker compose -f prod/docker-compose.prod.yml --profile ssl-setup up certbot
```
3. **Start production services**:
```bash
docker compose -f prod/docker-compose.prod.yml up -d
```
4. **Monitor the deployment**:
```bash
docker compose -f prod/docker-compose.prod.yml logs -f
```
For detailed deployment instructions, see [Environment Setup Guide](infrastructure/ENVIRONMENT_SETUP.md).
## System Deploy (Erebus)
`scripts/deploy-system.sh` syncs this repo into the live stack on Erebus and restarts it. It is a manual path — no CI/CD involved.
**Prerequisites:** `scripts/install-system.sh` must have been run at least once to set up `/opt/arachne`, `/etc/arachne/`, and `arachne.service`.
```bash
# Preview what would change (no writes):
sudo ./scripts/deploy-system.sh --dry-run
# Deploy:
sudo ./scripts/deploy-system.sh
```
The script:
1. Checks that all git submodules (`services/ai`, `services/scraper`) are populated in the source tree. If not, it tells you to run `git submodule update --init --recursive` and exits.
2. `rsync`s the repo into `/opt/arachne`, excluding `.git`, `node_modules`, `.next`, `dist`, `build`, `.env`, and `.vscode`.
3. Does **not** touch `/etc/arachne/` (runtime config) or `/var/lib/arachne/` (scraper data).
4. Writes the deployed git SHA and timestamp to `/opt/arachne/.deploy-revision` so you can always answer "what's live right now?".
5. Restarts `arachne.service` (which rebuilds and recreates containers). This is synchronous for the `Type=oneshot` unit — the restart command blocks until the service has fully settled.
6. Checks `systemctl status`, `docker ps`, and polls the `/health` endpoint with retries.
## 📝 Troubleshooting
### Common Issues
1. **Port conflicts**: Ensure ports 80, 443 are available
2. **Build failures**: Check Dockerfile syntax in each service
3. **Service dependencies**: Ensure Redis starts before Arachne
4. **Network issues**: Check if all services are on the same network
### Debug Commands
```bash
cd infrastructure
# Check service status
docker compose ps
# View logs
docker compose logs -f
# Rebuild specific service
docker compose build web
# Access service shell
docker compose exec web sh
```
## 🤝 Contributing
Each service is maintained independently. See individual service READMEs for contribution guidelines.