https://github.com/agxp/finsight
Financial research RAG agent over SEC EDGAR filings — multi-stage ingestion pipeline (download → parse → chunk → embed), pgvector semantic search, and a ReAct agent (Claude) that answers questions with citations.
https://github.com/agxp/finsight
agents airflow anthropic claude data-pipeline duckdb fastapi financial-data llm minio openai parquet pgvector postgresql python rag redis sec-edgar vector-search
Last synced: 3 months ago
JSON representation
Financial research RAG agent over SEC EDGAR filings — multi-stage ingestion pipeline (download → parse → chunk → embed), pgvector semantic search, and a ReAct agent (Claude) that answers questions with citations.
- Host: GitHub
- URL: https://github.com/agxp/finsight
- Owner: agxp
- Created: 2026-03-12T19:57:37.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-13T00:48:59.000Z (3 months ago)
- Last Synced: 2026-03-13T02:30:37.005Z (3 months ago)
- Topics: agents, airflow, anthropic, claude, data-pipeline, duckdb, fastapi, financial-data, llm, minio, openai, parquet, pgvector, postgresql, python, rag, redis, sec-edgar, vector-search
- Language: Python
- Homepage:
- Size: 69.3 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# FinSight — Financial Research Data Pipeline + RAG Agent
Ingests SEC EDGAR filings (10-K, 10-Q) through a multi-stage pipeline and exposes a ReAct agent that answers financial research questions with citations.
## What this demonstrates
| Skill | Implementation |
|-------|---------------|
| **Batch pipelines + idempotency** | `ON CONFLICT (accession_number) DO NOTHING` — re-running backfills is always safe |
| **Lakehouse / Parquet** | PyArrow + DuckDB, partitioned by `ticker/year/quarter/` on MinIO |
| **Orchestration** | Airflow DAGs with `catchup=True`, SLA alerts, exponential backoff retries |
| **Data quality gates** | Per-stage `QualityReport` → `QualityGateError` → Airflow task failure |
| **Semantic retrieval** | pgvector cosine search over 1536-dim embeddings with metadata filters |
| **ReAct agent** | Claude claude-sonnet-4-6 tool use loop, stateless, full audit log |
| **Guardrails** | Deterministic regex (no second LLM) — injection detection + ticker hallucination check |
| **Streaming** | FastAPI `StreamingResponse` + SSE with `event: tool_call` / `event: done` |
## Architecture
```
SEC EDGAR ──► FilingDownloader ──► MinIO (raw HTML)
│
html_parser + chunker
│
Parquet (MinIO, DuckDB-queryable)
│
OpenAI embedder
│
pgvector (filing_chunks)
│
FastAPI /v1/query
│
ReAct agent (Claude)
┌──┴──┐
search_filings get_financial_metrics compare_periods
```
## Stack
- **Language:** Python 3.11
- **API:** FastAPI + uvicorn
- **LLM (agent):** Anthropic Claude claude-sonnet-4-6 (tool use)
- **Embeddings:** OpenAI text-embedding-3-small (1536 dim)
- **Vector DB:** PostgreSQL 16 + pgvector
- **Object storage:** MinIO (S3-compatible)
- **Orchestration:** Apache Airflow 2.9
- **Rate limiting:** Redis 7 sliding window
- **Testing:** pytest + pytest-asyncio + respx
## Quick start
```bash
git clone && cd finsight
cp .env.example .env # set ANTHROPIC_API_KEY and OPENAI_API_KEY
docker-compose up
```
That's it. On first boot `docker-compose` will:
1. Start Postgres, MinIO, Redis
2. Run DB migrations and create the MinIO bucket (`finsight-init`)
3. Seed a dev tenant and print your API key
4. Start the FastAPI app, Airflow webserver, and scheduler
**Get your API key:**
```bash
docker-compose logs finsight-init | grep "fs_"
```
**Open the UI:** http://localhost:8000 (redirects to `/ui`)
**Ingest filings** — use the Filings tab in the UI, or via curl:
```bash
curl -X POST http://localhost:8000/v1/filings/ingest \
-H "Authorization: Bearer fs_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"ticker": "AAPL", "date_from": "2023-01-01", "date_to": "2023-12-31"}'
```
The ingest DAG runs automatically. Watch progress in the Filings tab — filings move through `ingested → transformed → embedded`. Once `embedded`, the agent can answer questions about them.
**Query the agent:**
```bash
curl -X POST http://localhost:8000/v1/query \
-H "Authorization: Bearer fs_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"query": "What were the main risk factors Apple cited in their most recent 10-K?", "stream": false}'
```
**Other services:**
| Service | URL |
|---------|-----|
| API + UI | http://localhost:8000 |
| Airflow | http://localhost:8081 (admin / admin) |
| MinIO console | http://localhost:9001 (minio / minio123) |
**Re-running is safe** — all init steps are idempotent. If you reset and want a fresh API key:
```bash
docker-compose exec postgres psql -U finsight -d finsight -c "DELETE FROM tenants WHERE name = 'dev-tenant';"
docker-compose restart finsight-init
docker-compose logs finsight-init | grep "fs_"
```
**Run tests:**
```bash
make test-unit
```
## API
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/v1/query` | Agent query (supports `"stream": true` for SSE) |
| `GET` | `/v1/filings` | List filings with filters |
| `GET` | `/v1/filings/{id}` | Get single filing |
| `POST` | `/v1/filings/ingest` | Trigger ingestion for a ticker + date range |
| `GET` | `/health` | Health check |
| `GET` | `/ready` | Readiness check (DB + Redis) |
## Pipeline stages
1. **Ingestion** — EDGAR API → raw HTML → MinIO. Quality: HTTP 200, ≥10KB
2. **Transform** — HTML → sections (Items 1–9) → chunks (≤400 tokens, 50-token overlap) → Parquet
3. **Embedding** — Parquet → OpenAI batched → pgvector upsert. Quality: count match, dim=1536, no zero vectors
4. **Agent** — Semantic search → ReAct loop → cited answer
## Key design decisions
**pgvector over Pinecone:** native joins between vector results and relational metadata, one fewer managed service, sufficient at filing corpus scale.
**DuckDB over Spark:** in-process Parquet queries with SQL. No cluster needed at this scale.
**`ON CONFLICT DO NOTHING` as idempotency primitive:** multiple workers can process concurrently; re-running backfills is always safe.
**Deterministic guardrails:** input/output validation via regex and content-matching rather than a second LLM call. Faster, more predictable, easier to audit — important in a financial context.