{"id":50410824,"url":"https://github.com/lukacerr/lovelytics","last_synced_at":"2026-05-31T03:30:28.634Z","repository":{"id":357152492,"uuid":"1235034676","full_name":"lukacerr/lovelytics","owner":"lukacerr","description":"Lovelytics technical task for AI engineer position","archived":false,"fork":false,"pushed_at":"2026-05-11T14:35:36.000Z","size":603,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-11T16:09:31.590Z","etag":null,"topics":["ai-agents","deepagents","langchain","ml","python","scikit-learn"],"latest_commit_sha":null,"homepage":"https://lovelytics.luka.software","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lukacerr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":"MAINTAINERS.md","copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-05-11T00:21:41.000Z","updated_at":"2026-05-11T14:48:31.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lukacerr/lovelytics","commit_stats":null,"previous_names":["lukacerr/lovelytics"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/lukacerr/lovelytics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukacerr%2Flovelytics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukacerr%2Flovelytics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukacerr%2Flovelytics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukacerr%2Flovelytics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lukacerr","download_url":"https://codeload.github.com/lukacerr/lovelytics/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lukacerr%2Flovelytics/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33718446,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","deepagents","langchain","ml","python","scikit-learn"],"created_at":"2026-05-31T03:30:26.976Z","updated_at":"2026-05-31T03:30:28.626Z","avatar_url":"https://github.com/lukacerr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Financial AI Agent — Lovelytics x Luka Cerrutti\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/lc-favicon.ico\" height=\"64\" alt=\"Luka Cerrutti\" /\u003e\n  \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n  \u003cimg src=\"assets/lovelytics-logo.webp\" height=\"64\" alt=\"Lovelytics\" /\u003e\n\u003c/p\u003e\n\n\u003e A prototype AI assistant for financial fraud analysts. It understands natural-language questions, queries transaction data, applies fraud / purchase ML models, and retrieves cited knowledge from a financial document base — all behind a single streaming chat endpoint.\n\nThis README is the **technical write-up** for the agent. Setup and operational pointers are kept short; the focus is on the architecture, the trade-offs and *why* each piece is there.\n\n---\n\n## 1. TL;DR\n\n- **Main agent**: [DeepAgents](https://docs.langchain.com/oss/python/deepagents) (slimmed down) running `zai-org/glm-5` on Novita's OpenAI-compatible API.\n- **One specialised subagent** spawned via the `task` tool: a `kb_researcher` (ReAct over Pinecone). It runs on the cheaper `deepseek/deepseek-v4-flash`.\n- **One delegated tool**, `analyze_dataframe`, that wraps `create_pandas_dataframe_agent` (also `deepseek/deepseek-v4-flash`) and answers ad-hoc CSV questions in a single call — kept as a flat tool rather than a subagent because there's no need for the planner to round-trip on dataframe work.\n- **Two ML tools** wrapped around scikit-learn `HistGradientBoostingClassifier` (fraud) and `HistGradientBoostingRegressor` (purchase amount), with pydantic-validated inputs.\n- **Knowledge base**: 20 markdown docs split with `MarkdownHeaderTextSplitter` + `RecursiveCharacterTextSplitter`, embedded with Novita `baai/bge-m3` (1024-d), upserted to Pinecone serverless.\n- **Serving**: FastAPI with a single SSE `/chat` endpoint that streams typed events (`token | tool_start | tool_end | subagent_start | subagent_end | citation | final | error`) so the UI can render the agent's reasoning live.\n- **UI**: a small React/TanStack SPA in `web/` (built static for Cloudflare Pages).\n- **Observability**: LangSmith ready to plug in via env vars.\n\n---\n\n## 2. Architecture\n\n```mermaid\nflowchart LR\n  UI[\"Web UI\u003cbr/\u003eReact SPA\"] -- \"POST /chat (SSE)\" --\u003e API[\"FastAPI\"]\n  API --\u003e AG{{\"Main Agent\u003cbr/\u003eDeepAgents · GLM-5\"}}\n  AG --\u003e|tool| FM[\"predict_fraud\u003cbr/\u003esklearn HGB Classifier\"]\n  AG --\u003e|tool| PM[\"predict_purchase\u003cbr/\u003esklearn HGB Regressor\"]\n  AG --\u003e|tool| DA[\"analyze_dataframe\u003cbr/\u003epandas DF agent · v4-flash\"]\n  AG --\u003e|task → subagent| KR[\"kb_researcher\u003cbr/\u003eReAct · v4-flash\"]\n  DA --\u003e CSV[(\"fraud_dataset.csv\u003cbr/\u003eproduct_purchase_dataset.csv\")]\n  KR --\u003e|kb_search| PC[(\"Pinecone\u003cbr/\u003elovelytics-kb\")]\n  AG -. traces .-\u003e LS[\"LangSmith\"]\n\n  classDef store fill:#1f2937,color:#fff,stroke:#111;\n  class CSV,PC store;\n```\n\n**Why this shape?** The brief mixes three very different capabilities (structured data analytics, ML inference, document QA). A flat ReAct agent with all tools on the main loop is simpler but loses focus when questions require multi-step research. DeepAgents gives us a planner + the ability to delegate to specialised subagents, each with a tight toolbelt. The main agent never touches Pinecone or the dataframes directly — it *delegates*, which keeps its context lean and its decisions auditable.\n\n---\n\n## 3. Agent flow — a complex query\n\nExample: *\"Analyze customer CUST7823's transaction history and assess their fraud risk.\"*\n\n```mermaid\nsequenceDiagram\n  autonumber\n  participant U as User (Web UI)\n  participant API as FastAPI /chat\n  participant M as Main Agent (GLM-5)\n  participant DA as analyze_dataframe (v4-flash)\n  participant KR as kb_researcher (v4-flash)\n  participant ML as predict_fraud tool\n\n  U-\u003e\u003eAPI: prompt\n  API-\u003e\u003eM: stream\n  M-\u003e\u003eM: write_todos (plan)\n  M-\u003e\u003eDA: analyze_dataframe(\"pull CUST7823's transactions + aggregates\")\n  DA--\u003e\u003eM: rows + summary stats\n  M-\u003e\u003eML: predict_fraud(features for top suspicious tx)\n  ML--\u003e\u003eM: prob + top contributing features\n  M-\u003e\u003eKR: task(\"indicators of fraud relevant to these patterns\")\n  KR--\u003e\u003eM: cited snippets\n  M--\u003e\u003eAPI: stream tokens + citations\n  API--\u003e\u003eU: SSE (token / tool_* / subagent_* / citation / final)\n```\n\nThe order isn't hard-coded; the planner decides. The point is that a single user turn produces a chain of tool/subagent events, all surfaced over SSE so the UI can render the reasoning timeline.\n\n---\n\n## 4. Knowledge base — ingestion pipeline\n\n```mermaid\nflowchart LR\n  MD[\"financial_documents/*.md\u003cbr/\u003e(20 files)\"] --\u003e H[\"MarkdownHeaderTextSplitter\u003cbr/\u003e(preserve section path)\"]\n  H --\u003e R[\"RecursiveCharacterTextSplitter\u003cbr/\u003echunk_size=800, overlap=120\"]\n  R --\u003e E[\"Embeddings\u003cbr/\u003eNovita baai/bge-m3 · 1024-d\"]\n  E --\u003e P[(\"Pinecone\u003cbr/\u003eindex: lovelytics-kb\u003cbr/\u003enamespace: financial-docs\")]\n  P -. metadata .-\u003e META[\"source · header_path\u003cbr/\u003echunk_id · content_hash\"]\n```\n\nRe-running ingestion **rebuilds** the namespace (delete + re-upsert) so the index always matches the source files. Triggered either via `make ingest` or `POST /kb/ingest` (API-key gated outside development).\n\n---\n\n## 5. Components\n\n### 5.1 Main agent — DeepAgents, slimmed down\n\nWe use `create_deep_agent` with the planner and subagent middleware kept, and the filesystem tools dropped by registering a `HarnessProfile` that lists the FS tool names (`ls`, `read_file`, `write_file`, `edit_file`, `glob`, `grep`, `execute`) in `excluded_tools`. The DeepAgents `FilesystemMiddleware` itself is required scaffolding for the planner — only its *exposed tools* are stripped. The virtual filesystem isn't useful here: none of our tools read or write files at the agent layer.\n\nThe system prompt is short and scoped: *you are a fraud-analyst assistant; delegate KB lookups to `kb_researcher`, call `analyze_dataframe` for ad-hoc CSV questions, and use the ML tools for predictions; always cite KB sources by filename and section*.\n\nLLM parameters are tuned conservatively: `temperature=0.0`, `top_p=0.9`. We want deterministic, factual answers, not creative writing.\n\n### 5.2 `analyze_dataframe` tool\n\nA plain `@tool` (`async def analyze_dataframe(question: str) -\u003e str`) that wraps `langchain_experimental.agents.create_pandas_dataframe_agent` with `agent_type=\"tool-calling\"` over both dataframes loaded once at startup. Required `allow_dangerous_code=True` because the inner agent runs a Python REPL on the dataframes — acceptable in a single-tenant prototype on two static CSVs, called out again in §9.\n\nKept as a tool rather than a subagent: dataframe questions are typically one-shot (run a pandas op, summarise) and don't benefit from the planner round-tripping with a separate agent context. The `DATA_ANALYST_PROMPT` enforces a \"reduce before returning\" rule (`.value_counts()`, `.head(N≤10)`, `.describe()`, etc.) so the tool never dumps raw rows back into the main agent's context.\n\nSame `temperature=0.0` as the main agent.\n\n### 5.3 `kb_researcher` subagent\n\nOwns the Pinecone retriever via a private `kb_search(query: str, k: int = 5)` tool. The main agent **does not** have direct access to `kb_search`; it must spawn this subagent. This forces multi-hop research patterns where they matter and avoids the main planner doing brittle one-shot retrievals.\n\nReturns a synthesised answer plus a list of citations (`source`, `header_path`, snippet) that propagate back through the SSE stream as `citation` events.\n\n### 5.4 ML tools\n\n| Tool | Model | Input schema | Output schema |\n|---|---|---|---|\n| `predict_fraud` | `HistGradientBoostingClassifier` inside a `Pipeline` (OneHot for categoricals with `handle_unknown=\"ignore\"`, passthrough numerics) | `FraudFeatures` | `FraudPrediction` = `{probability, label, top_features}` |\n| `predict_purchase` | `HistGradientBoostingRegressor` (same pipeline shape) | `PurchaseFeatures` | `PurchasePrediction` = `{predicted_amount, top_features}` |\n\nAll four schemas live in `app/ml/schemas.py` and are picked up automatically by LangChain — bad calls fail fast with a structured error returned to the model, no separate \"schema introspection\" tool needed. Categorical fields with small, stable cardinalities use `Literal` for instant validation feedback; open-ended ones (`country`, `merchant_category`, `preferred_category`) accept any `str` because pinning ~50 country names would be brittle and `OneHotEncoder(handle_unknown=\"ignore\")` handles novel values gracefully at inference.\n\n`top_features` is the top-5 features from `sklearn.inspection.permutation_importance`, precomputed at training time with `n_repeats=10`, scored on the headline metric of each model (ROC-AUC for fraud, R² for purchase), and shipped in `models/metrics.json`. Inference reads it from an `lru_cache` — cheap, deterministic, no SHAP runtime dependency.\n\n### 5.5 FastAPI surface\n\n| Method | Path | Purpose |\n|---|---|---|\n| `GET` | `/health` | `{status, models_loaded, kb_indexed}` |\n| `GET` | `/docs` | Auto-generated OpenAPI UI |\n| `POST` | `/chat` | Streams the agent run as SSE |\n| `POST` | `/kb/ingest` | Rebuilds the Pinecone namespace from `financial_documents/`. Requires `X-API-Key` header matching `settings.API_KEY` whenever `ENV != \"development\"` |\n\n`/chat` request:\n\n```json\n{ \"messages\": [{\"role\": \"user\", \"content\": \"...\"}, ...] }\n```\n\n`/chat` SSE event taxonomy:\n\n| event | data |\n|---|---|\n| `token` | `{ \"delta\": \"...\" }` |\n| `tool_start` | `{ \"name\": \"...\", \"args\": {...} }` |\n| `tool_end` | `{ \"name\": \"...\", \"result\": \"...\" }` |\n| `subagent_start` | `{ \"name\": \"data_analyst\" \\| \"kb_researcher\", \"task\": \"...\" }` |\n| `subagent_end` | `{ \"name\": \"...\", \"result\": \"...\" }` |\n| `citation` | `{ \"source\": \"...\", \"header_path\": \"...\", \"snippet\": \"...\" }` |\n| `final` | `{ \"content\": \"...\" }` |\n| `error` | `{ \"message\": \"...\", \"type\": \"...\" }` |\n\nPydantic models for requests/responses live inline in their route module — no `schemas.py` for so little surface area.\n\nThe FastAPI app object lives at `app/main.py`, keeping all application code inside the `app/` package. `make api` runs `fastapi dev app/main.py`; the Dockerfile passes the same path to `fastapi run`.\n\nCORS is wired in `app/main.py`. In `ENV=development` all origins are allowed; in production the allow-list comes from `CORS_ALLOW_ORIGINS` (comma-separated env var).\n\n### 5.6 Configuration\n\nA single `pydantic-settings` `Settings` singleton in `app/config.py`. **Operator-facing knobs** (anything that legitimately changes per environment) are required from the OS env (or a `.env` file in development); everything else — model IDs, index names, paths — lives as in-code defaults on the `Settings` class so swapping them is a code change, not a configuration change.\n\nEnvironment-only variables (the full list — see `.env.example`):\n\n```\nENV=development                 # defaults to \"production\"; toggles X-API-Key enforcement on /kb/ingest\nAPI_KEY=...                     # required when ENV != \"development\"; gates /kb/ingest\nCORS_ALLOW_ORIGINS=https://app.example.com,https://example.com   # comma-separated; ignored in dev (all origins allowed)\nNOVITA_API_KEY=...\nPINECONE_API_KEY=...\nLANGSMITH_TRACING=true\nLANGSMITH_API_KEY=...\nLANGSMITH_PROJECT=lovelytics-task\n```\n\nIn-code defaults (change them in `app/config.py`, not `.env`):\n\n```python\nNOVITA_BASE_URL  = \"https://api.novita.ai/openai/v1\"\nEMBEDDING_MODEL  = \"baai/bge-m3\"          # 1024-d\nPINECONE_INDEX   = \"lovelytics-kb\"\nPINECONE_NAMESPACE = \"financial-docs\"\nKB_DIR           = Path(\"financial_documents\")\nKB_TOP_K         = 10\nDATASETS_DIR     = Path(\"datasets\")\nMODELS_DIR       = Path(\"models\")\n```\n\nWeb (Vite):\n\n```\nVITE_API_BASE_URL=http://localhost:8000\n```\n\n### 5.7 Async by default\n\nEvery code path that performs I/O — model calls, Pinecone queries, embedding requests, the pandas subagent, FastAPI handlers — is written `async`. The convention is consistent end-to-end:\n\n- **Tools** are `async def` even when the underlying work is CPU-bound (e.g. sklearn inference in `predict_fraud` / `predict_purchase`). LangGraph's tool node refuses to call sync `StructuredTool` methods once *any* tool in the set is async; declaring everything async keeps the dispatch path uniform and avoids `NotImplementedError: StructuredTool does not support sync invocation` mismatches.\n- **External clients** use the async API where the library exposes both — `PineconeAsyncio` (not `Pinecone`) for index lifecycle and namespace deletes, `asimilarity_search` (not `similarity_search`) for retrieval, `aembed_documents` / `aembed_query` for embeddings, `ainvoke` / `astream_events` on every LangChain runnable.\n- **Pragmatic fallback.** When an async API is broken or fights the runtime more than it's worth, sync is acceptable as a last resort — but the choice is documented in a comment so the next reader knows it was deliberate, not lazy.\n- **Entry points.** Scripts under `scripts/` are sync `def main()` wrappers around `asyncio.run(...)`; FastAPI routes are `async def`. There are no nested event loops anywhere.\n\n---\n\n## 6. Data \u0026 models\n\n### 6.1 Datasets\n\n| File | Rows | Target | Use |\n|---|---|---|---|\n| `datasets/fraud_dataset.csv` | 100 | `fraud` (binary) | Fraud classifier |\n| `datasets/product_purchase_dataset.csv` | 100 | `purchase_amount` (continuous) | Purchase regressor |\n\n### 6.2 Training\n\nBoth models share the same recipe: `ColumnTransformer` (OneHotEncoder for categoricals with `handle_unknown=\"ignore\"`, passthrough for numerics — `HistGradientBoosting` handles missing values and scaling natively) feeds a `HistGradientBoosting{Classifier,Regressor}` inside a single `Pipeline`. Fixed `random_state=42` everywhere.\n\n**Why 5-fold cross-validation instead of a single 80/20 holdout.** With only 100 rows, a 20-row test set produces metrics with very high variance — a few unlucky points can swing ROC-AUC by 0.1. 5-fold CV reports mean ± std across 5 different splits, costs nothing at this scale (training takes \u003c1s per fold), and is the honest thing to do given the sample size. Fraud uses `StratifiedKFold` to preserve the 50/50 class balance per fold; purchase uses plain `KFold` because the target is continuous. Both fold once with `shuffle=True`.\n\n**Why these metrics.**\n- **Fraud:** `roc_auc` (headline ranking metric), `average_precision` / PR-AUC (more honest than ROC-AUC under class imbalance — kept because real fraud is always imbalanced even though this toy set is 50/50), `f1` and `accuracy` at the 0.5 threshold (interpretable point estimates).\n- **Purchase:** `mae` and `rmse` (in dollars, directly interpretable), `r2` (variance explained).\n\n**Why permutation importance with `n_repeats=10`.** Single-shuffle importances are noisy; 10 repeats gives a stable ranking with negligible runtime at this scale. Importance is scored on the same metric as the headline (`roc_auc` for fraud, `r2` for purchase) and computed against the *raw* pipeline input — so each importance value maps cleanly to one agent-facing feature, with no need to aggregate across one-hot-expanded columns. The result is precomputed at training time and stored in `metrics.json`, so inference reads it from cache instead of re-running `permutation_importance` on every request (per AGENTS.md §9).\n\n**Refit on the full data after CV** for the deployed model — CV is for evaluation only.\n\n| Model | Headline metric | Other reported |\n|---|---|---|\n| Fraud classifier | ROC-AUC | PR-AUC, F1, accuracy |\n| Purchase regressor | R² | MAE, RMSE |\n\nThe actual numbers live in `models/metrics.json` (regenerated by `make train`). With 100 rows they're illustrative, not production-grade — see §9.\n\n### 6.3 Artifacts\n\n`models/fraud.joblib`, `models/purchase.joblib` and `models/metrics.json` are **committed**. They're tiny (tens of KB) and let the API boot without a training step. A `pre-commit` hook re-trains them whenever the CSVs or training code change, so they never go stale. Committing artifacts to git is a prototype shortcut, not a production pattern — see §9.\n\n---\n\n## 7. Running locally\n\nEverything is wrapped in the `Makefile`. Common entry points:\n\n```bash\nmake install         # uv sync + bun install (web)\nmake train           # train both ML models, write models/*.joblib + metrics.json\nmake ingest          # rebuild the Pinecone KB namespace\nmake chat            # CLI smoke test: uv run python -m scripts.chat [--stream] \"...\"\nmake api             # uv run fastapi dev (http://localhost:8000)\nmake web             # bun run dev   (http://localhost:3000)\nmake web-build       # static SPA build → web/dist (Cloudflare Pages ready)\nmake test            # pytest\nmake check           # ruff + basedpyright\nmake pre-commit      # install + autoupdate pre-commit hooks\n```\n\nFirst run:\n\n```bash\ncp .env.example .env   # fill in NOVITA_API_KEY, PINECONE_API_KEY\nmake install\nmake train\nmake ingest\nmake api\n```\n\nThe Dockerfile mirrors this: `uv sync` → copy code → run training during the build step → `fastapi run` at port 8080.\n\n---\n\n## 8. Observability\n\nLangSmith is wired through the env vars in `.env.example`. Set `LANGSMITH_TRACING=true` and a project name, and every `/chat` run shows up as a trace including subagent spans and tool calls. No extra code required.\n\n---\n\n## 9. Trade-offs \u0026 limitations\n\n- **Tiny dataset (100 rows).** Reported metrics are illustrative. Real evaluation needs a meaningful test set and cross-validation.\n- **Pandas REPL subagent uses `allow_dangerous_code=True`.** Acceptable for a single-tenant prototype on two static CSVs. Not acceptable in production.\n- **Joblib artifacts committed to git.** Convenient for a tech test. In production they belong in an artifact registry (MLflow / Unity Catalog Model Registry / S3) with versioning and lineage.\n- **No auth on `/chat`.** Only `/kb/ingest` is gated. A real deployment needs auth, rate limiting and per-user quotas.\n- **Stateless `/chat`.** No DB, no cache, conversation history is client-managed. Matches the brief but won't scale to multi-user persistent sessions.\n- **No guardrails on inputs/outputs.** A real assistant would need PII redaction, prompt-injection defence and output safety.\n- **Pinecone serverless free tier** has cold-start latency on the first query after idle. Fine for a demo.\n- **Models hard-coded to Novita.** Provider abstraction would help; today it's one env-var swap.\n\n---\n\n## 10. What I'd improve with more time\n\n- **Better boosters**: try XGBoost / LightGBM with proper hyperparameter search; HistGradientBoosting is a solid baseline but rarely state-of-the-art on tabular fraud data.\n- **DuckDB-over-CSV tool** instead of a pandas REPL subagent: deterministic SQL, no arbitrary code execution, much smaller blast radius.\n- **Real eval set + LangSmith evaluators** with golden Q/A pairs for each query category (data analysis, prediction, knowledge, complex).\n- **MLflow / Unity Catalog Model Registry** for model artifacts, replacing the committed `.joblib` files.\n- **Databricks-native variant**: Delta tables for the transaction data, Vector Search instead of Pinecone, Model Serving for the ML tools, Mosaic AI Agent Framework as the orchestration layer.\n- **Auth + rate limiting** (e.g. Cloudflare Access in front of the API), persistent threads with a small Postgres store.\n- **Guardrails**: PII filter on the SSE pipe, prompt-injection defence on the KB researcher, output safety classifier.\n- **Streaming the planner's `write_todos` output** to the UI as a checklist, so the user sees the agent's plan evolve.\n- **Caching**: embed cache for the KB ingestion path, response cache for repeated KB queries.\n\n---\n\n## 11. Repo layout\n\nItems marked *(planned)* are expected by the architecture but not yet implemented in the current commit.\n\n```\napp/\n  __init__.py\n  config.py              # pydantic-settings singleton + .env loader\n  llm.py                 # main_chat_model() / subagent_chat_model() factories\n  main.py                # FastAPI app + CORS + router includes\n  sse.py                 # astream_events → typed SSE frame mapper\n  api/\n    __init__.py\n    deps.py              # require_api_key dependency\n    health.py            # GET /health\n    chat.py              # POST /chat (SSE)\n    kb.py                # POST /kb/ingest (API-key gated outside dev)\n  agent/\n    __init__.py\n    builder.py           # build_agent() — registers FS-exclusion HarnessProfile\n    prompts.py           # MAIN_PROMPT, KB_RESEARCHER_PROMPT, DATA_ANALYST_PROMPT\n    subagents.py         # kb_researcher_subagent() — SubAgent TypedDict factory\n    tools/\n      __init__.py\n      ml.py              # predict_fraud, predict_purchase (async @tool)\n      kb.py              # kb_search (used ONLY inside kb_researcher subagent)\n      dataframe.py       # analyze_dataframe — pandas DF agent wrapper\n  retrieval/\n    __init__.py\n    splitter.py          # Markdown → header-aware chunks\n    embeddings.py        # Novita bge-m3 via OpenAIEmbeddings (chunk_size=64)\n    vectorstore.py       # async ensure_index + per-call PineconeVectorStore\n    ingest.py            # rebuild_kb() — wipe + re-upsert\n    retrieve.py          # search() — basis for kb_search tool\n  ml/\n    __init__.py\n    schemas.py           # FraudFeatures / PurchaseFeatures + prediction models\n    preprocess.py        # build_preprocessor() — ColumnTransformer factory\n    train_fraud.py       # CV + permutation importance, returns (Pipeline, metrics)\n    train_purchase.py\n    inference.py         # cached predict_fraud / predict_purchase\nscripts/\n  __init__.py\n  ingest_kb.py\n  train_all.py\n  chat.py                # CLI smoke test: --stream / --verbose\ntests/                   # flat, pytest\n  test_splitter.py\n  test_ml.py\n  test_agent.py\nmodels/                  # fraud.joblib, purchase.joblib, metrics.json (committed)\ndatasets/                # fraud + purchase CSVs\nfinancial_documents/     # 20 markdown docs\nweb/                     # React/TanStack SPA (Vite, SPA-only)\n  src/\n    main.tsx             # RouterProvider entry\n    styles.css           # dark palette, #c47fd5 accent, .prose-chat, animations\n    routeTree.gen.ts     # generated by @tanstack/router-plugin\n    lib/\n      api.ts             # streamChat() SSE consumer + typed AgentEvent union\n      utils.ts           # cn() helper\n    routes/\n      __root.tsx         # minimal \u003cOutlet/\u003e\n      index.tsx          # chat UI: header, timeline, citations, composer\n  index.html\n  vite.config.ts\n  .env.example           # VITE_API_BASE_URL\nassets/                  # logos\nMakefile\nDockerfile\npyproject.toml\n.pre-commit-config.yaml\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flukacerr%2Flovelytics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flukacerr%2Flovelytics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flukacerr%2Flovelytics/lists"}