{"id":47172625,"url":"https://github.com/agxp/finsight","last_synced_at":"2026-03-13T06:05:23.626Z","repository":{"id":344042611,"uuid":"1180198054","full_name":"agxp/finsight","owner":"agxp","description":"Financial research RAG agent over SEC EDGAR filings — multi-stage ingestion pipeline (download → parse → chunk → embed), pgvector semantic search, and a ReAct agent (Claude) that answers questions with citations.","archived":false,"fork":false,"pushed_at":"2026-03-13T00:48:59.000Z","size":71,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-13T02:30:37.005Z","etag":null,"topics":["agents","airflow","anthropic","claude","data-pipeline","duckdb","fastapi","financial-data","llm","minio","openai","parquet","pgvector","postgresql","python","rag","redis","sec-edgar","vector-search"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/agxp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-12T19:57:37.000Z","updated_at":"2026-03-13T00:49:02.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/agxp/finsight","commit_stats":null,"previous_names":["agxp/finsight"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/agxp/finsight","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Ffinsight","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Ffinsight/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Ffinsight/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Ffinsight/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/agxp","download_url":"https://codeload.github.com/agxp/finsight/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Ffinsight/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30459817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T03:55:51.346Z","status":"ssl_error","status_checked_at":"2026-03-13T03:55:33.055Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","airflow","anthropic","claude","data-pipeline","duckdb","fastapi","financial-data","llm","minio","openai","parquet","pgvector","postgresql","python","rag","redis","sec-edgar","vector-search"],"created_at":"2026-03-13T06:05:13.351Z","updated_at":"2026-03-13T06:05:23.621Z","avatar_url":"https://github.com/agxp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FinSight — Financial Research Data Pipeline + RAG Agent\n\nIngests SEC EDGAR filings (10-K, 10-Q) through a multi-stage pipeline and exposes a ReAct agent that answers financial research questions with citations.\n\n## What this demonstrates\n\n| Skill | Implementation |\n|-------|---------------|\n| **Batch pipelines + idempotency** | `ON CONFLICT (accession_number) DO NOTHING` — re-running backfills is always safe |\n| **Lakehouse / Parquet** | PyArrow + DuckDB, partitioned by `ticker/year/quarter/` on MinIO |\n| **Orchestration** | Airflow DAGs with `catchup=True`, SLA alerts, exponential backoff retries |\n| **Data quality gates** | Per-stage `QualityReport` → `QualityGateError` → Airflow task failure |\n| **Semantic retrieval** | pgvector cosine search over 1536-dim embeddings with metadata filters |\n| **ReAct agent** | Claude claude-sonnet-4-6 tool use loop, stateless, full audit log |\n| **Guardrails** | Deterministic regex (no second LLM) — injection detection + ticker hallucination check |\n| **Streaming** | FastAPI `StreamingResponse` + SSE with `event: tool_call` / `event: done` |\n\n## Architecture\n\n```\nSEC EDGAR ──► FilingDownloader ──► MinIO (raw HTML)\n                                      │\n                                   html_parser + chunker\n                                      │\n                                   Parquet (MinIO, DuckDB-queryable)\n                                      │\n                                   OpenAI embedder\n                                      │\n                                   pgvector (filing_chunks)\n                                      │\n                              FastAPI /v1/query\n                                      │\n                              ReAct agent (Claude)\n                                   ┌──┴──┐\n                              search_filings  get_financial_metrics  compare_periods\n```\n\n## Stack\n\n- **Language:** Python 3.11\n- **API:** FastAPI + uvicorn\n- **LLM (agent):** Anthropic Claude claude-sonnet-4-6 (tool use)\n- **Embeddings:** OpenAI text-embedding-3-small (1536 dim)\n- **Vector DB:** PostgreSQL 16 + pgvector\n- **Object storage:** MinIO (S3-compatible)\n- **Orchestration:** Apache Airflow 2.9\n- **Rate limiting:** Redis 7 sliding window\n- **Testing:** pytest + pytest-asyncio + respx\n\n## Quick start\n\n```bash\ngit clone \u003crepo\u003e \u0026\u0026 cd finsight\ncp .env.example .env        # set ANTHROPIC_API_KEY and OPENAI_API_KEY\ndocker-compose up\n```\n\nThat's it. On first boot `docker-compose` will:\n1. Start Postgres, MinIO, Redis\n2. Run DB migrations and create the MinIO bucket (`finsight-init`)\n3. Seed a dev tenant and print your API key\n4. Start the FastAPI app, Airflow webserver, and scheduler\n\n**Get your API key:**\n```bash\ndocker-compose logs finsight-init | grep \"fs_\"\n```\n\n**Open the UI:** http://localhost:8000 (redirects to `/ui`)\n\n**Ingest filings** — use the Filings tab in the UI, or via curl:\n```bash\ncurl -X POST http://localhost:8000/v1/filings/ingest \\\n  -H \"Authorization: Bearer fs_YOUR_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"ticker\": \"AAPL\", \"date_from\": \"2023-01-01\", \"date_to\": \"2023-12-31\"}'\n```\n\nThe ingest DAG runs automatically. Watch progress in the Filings tab — filings move through `ingested → transformed → embedded`. Once `embedded`, the agent can answer questions about them.\n\n**Query the agent:**\n```bash\ncurl -X POST http://localhost:8000/v1/query \\\n  -H \"Authorization: Bearer fs_YOUR_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"What were the main risk factors Apple cited in their most recent 10-K?\", \"stream\": false}'\n```\n\n**Other services:**\n| Service | URL |\n|---------|-----|\n| API + UI | http://localhost:8000 |\n| Airflow | http://localhost:8081 (admin / admin) |\n| MinIO console | http://localhost:9001 (minio / minio123) |\n\n**Re-running is safe** — all init steps are idempotent. If you reset and want a fresh API key:\n```bash\ndocker-compose exec postgres psql -U finsight -d finsight -c \"DELETE FROM tenants WHERE name = 'dev-tenant';\"\ndocker-compose restart finsight-init\ndocker-compose logs finsight-init | grep \"fs_\"\n```\n\n**Run tests:**\n```bash\nmake test-unit\n```\n\n## API\n\n| Method | Path | Description |\n|--------|------|-------------|\n| `POST` | `/v1/query` | Agent query (supports `\"stream\": true` for SSE) |\n| `GET` | `/v1/filings` | List filings with filters |\n| `GET` | `/v1/filings/{id}` | Get single filing |\n| `POST` | `/v1/filings/ingest` | Trigger ingestion for a ticker + date range |\n| `GET` | `/health` | Health check |\n| `GET` | `/ready` | Readiness check (DB + Redis) |\n\n## Pipeline stages\n\n1. **Ingestion** — EDGAR API → raw HTML → MinIO. Quality: HTTP 200, ≥10KB\n2. **Transform** — HTML → sections (Items 1–9) → chunks (≤400 tokens, 50-token overlap) → Parquet\n3. **Embedding** — Parquet → OpenAI batched → pgvector upsert. Quality: count match, dim=1536, no zero vectors\n4. **Agent** — Semantic search → ReAct loop → cited answer\n\n## Key design decisions\n\n**pgvector over Pinecone:** native joins between vector results and relational metadata, one fewer managed service, sufficient at filing corpus scale.\n\n**DuckDB over Spark:** in-process Parquet queries with SQL. No cluster needed at this scale.\n\n**`ON CONFLICT DO NOTHING` as idempotency primitive:** multiple workers can process concurrently; re-running backfills is always safe.\n\n**Deterministic guardrails:** input/output validation via regex and content-matching rather than a second LLM call. Faster, more predictable, easier to audit — important in a financial context.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagxp%2Ffinsight","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagxp%2Ffinsight","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagxp%2Ffinsight/lists"}