{"id":51060431,"url":"https://github.com/theognis1002/nimbus-crawler","last_synced_at":"2026-06-23T01:30:58.045Z","repository":{"id":339280575,"uuid":"1161251804","full_name":"theognis1002/nimbus-crawler","owner":"theognis1002","description":"Highly concurrent web crawler written in Go","archived":false,"fork":false,"pushed_at":"2026-02-19T04:26:41.000Z","size":68,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-19T04:30:40.649Z","etag":null,"topics":["crawler","docker","golang","message-queue","postgresql","redis"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/theognis1002.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-18T22:40:30.000Z","updated_at":"2026-02-19T04:26:45.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/theognis1002/nimbus-crawler","commit_stats":null,"previous_names":["theognis1002/nimbus-crawler"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/theognis1002/nimbus-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theognis1002%2Fnimbus-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theognis1002%2Fnimbus-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theognis1002%2Fnimbus-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theognis1002%2Fnimbus-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/theognis1002","download_url":"https://codeload.github.com/theognis1002/nimbus-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theognis1002%2Fnimbus-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34672250,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-22T02:00:06.391Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","docker","golang","message-queue","postgresql","redis"],"created_at":"2026-06-23T01:30:57.425Z","updated_at":"2026-06-23T01:30:58.039Z","avatar_url":"https://github.com/theognis1002.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Nimbus Crawler\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n\nA distributed web crawler built in Go with a message-driven microservices architecture. Nimbus fetches, parses, and stores web pages at scale using a pipeline of loosely coupled workers coordinated through Redis Streams.\n\n## Architecture\n\n```\nseeds.txt\n    |\n    v\n Seeder --\u003e Redis Streams (stream:frontier) --\u003e Crawler Workers --\u003e MinIO (html)\n                                                     |\n                                                     v\n                                              Redis Streams (stream:parse)\n                                                     |\n                                                     v\n                                              Parser Workers --\u003e MinIO (text)\n                                                     |\n                                                     '-\u003e new URLs back to stream:frontier\n                                                          (up to max_depth)\n```\n\n| Component  | Technology | Purpose                                          |\n| ---------- | ---------- | ------------------------------------------------ |\n| PostgreSQL | 18         | URL/domain records, crawl state                  |\n| Redis      | 8          | DNS cache, rate limiting, robots.txt, job queues |\n| MinIO      | pinned     | S3-compatible storage (HTML + text)              |\n\n## Quick Start\n\n```bash\ngit clone https://github.com/theognis1002/nimbus-crawler.git\ncd nimbus-crawler\ncp .env.example .env\nmake dev\n```\n\n- **Logs**: `make logs`\n- **Stop**: `make down`\n- **Seed URLs**: Edit `seeds.txt` (one URL per line), then `make seed`\n\n### Web UIs\n\n- **MinIO Console**: [http://localhost:9001](http://localhost:9001) (`nimbus` / `nimbus_secret`)\n\n## Configuration\n\nConfig loads from `configs/development.yaml` with environment variable overrides (env vars take priority). See [`.env.example`](.env.example) for the full variable list.\n\n| Variable          | Default | Description                    |\n| ----------------- | ------- | ------------------------------ |\n| `MAX_DEPTH`       | 3       | Maximum link-follow depth      |\n| `CRAWLER_WORKERS` | 10      | Goroutines per crawler replica |\n| `PARSER_WORKERS`  | 5       | Goroutines per parser replica  |\n\n## Make Targets\n\n| Command      | Description                       |\n| ------------ | --------------------------------- |\n| `make dev`   | Build and start all services      |\n| `make build` | Build Docker images               |\n| `make test`  | Run Go tests                      |\n| `make seed`  | Run the seeder independently      |\n| `make logs`  | Tail crawler and parser logs      |\n| `make down`  | Stop all services                 |\n| `make clean` | Stop all services and remove data |\n\n## Local Development\n\nStart backing services, then run Go services directly:\n\n```bash\ndocker-compose up -d postgres redis minio\ngo run ./cmd/migrate  # apply database schema migrations\ngo run ./cmd/seeder   # seed initial URLs from seeds.txt into Redis frontier stream\ngo run ./cmd/crawler  # fetch pages, store HTML in MinIO, publish parse jobs\ngo run ./cmd/parser   # extract text/links from HTML, deduplicate, publish new crawl jobs\n```\n\nUpdate `.env` to use `localhost` for `POSTGRES_HOST`, `REDIS_HOST`, and `MINIO_ENDPOINT`.\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and contribution guidelines.\n\n## License\n\nMIT. See [LICENSE](LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftheognis1002%2Fnimbus-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftheognis1002%2Fnimbus-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftheognis1002%2Fnimbus-crawler/lists"}