https://github.com/debu-sinha/coco-reference

CoCo - reference implementation of a natural-language cohort copilot on Databricks. DSPy ReAct agent + Mosaic AI Agent Framework + Lakebase + MLflow Prompt Registry + Unity Catalog + Databricks Apps.
https://github.com/debu-sinha/coco-reference
agent databricks dspy healthcare lakebase mlflow mosaic-ai rag reference-implementation unity-catalog
Last synced: 2 months ago
JSON representation
CoCo - reference implementation of a natural-language cohort copilot on Databricks. DSPy ReAct agent + Mosaic AI Agent Framework + Lakebase + MLflow Prompt Registry + Unity Catalog + Databricks Apps.
Host: GitHub
URL: https://github.com/debu-sinha/coco-reference
Owner: debu-sinha
License: other
Created: 2026-04-20T02:00:21.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-27T19:42:55.000Z (2 months ago)
Last Synced: 2026-04-27T20:23:18.596Z (2 months ago)
Topics: agent, databricks, dspy, healthcare, lakebase, mlflow, mosaic-ai, rag, reference-implementation, unity-catalog
Language: Python
Homepage: https://github.com/debu-sinha/coco-reference
Size: 509 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Security: docs/SECURITY.md
Awesome Lists containing this project

README

          # CoCo v2 -- AI Cohort Copilot for Healthcare RWD

**Author:** [debu-sinha](https://github.com/debu-sinha) (debusinha2009@gmail.com)

CoCo is a natural-language cohort-building assistant for real-world healthcare data. Ask for a cohort ("Type 2 diabetes patients on metformin with recent labs") and CoCo identifies clinical codes, generates and validates SQL against a Databricks warehouse, executes the query, and synthesizes a response with sample rows and clinical context.

This repo is a **turnkey reference implementation**. Clone it, deploy the bundle, run the setup job, and you'll have a working Cohort Copilot in your own Databricks workspace in about 30 minutes. Multiple users can deploy to the same workspace without collisions. Each user gets their own namespaced resources.

## Quick start

### Prerequisites

You'll need these before you start:

- A Databricks workspace with **Unity Catalog** and **Model Serving**

- An **existing serverless SQL warehouse** (the setup job doesn't create one). To find the warehouse ID: **SQL -> SQL Warehouses -> click your warehouse -> copy the 16-char hex from the URL** (e.g. `https:///sql/warehouses/abc123def456789` -> warehouse_id is `abc123def456789`). If you don't have one, ask your workspace admin to create a serverless warehouse (Small is enough for cohort queries).

- An **existing Unity Catalog catalog** where you have CREATE SCHEMA permission. The setup job can't create catalogs on Default Storage workspaces. Run the preflight script (below) to find one you can use.

- `databricks-claude-sonnet-4-6` (or equivalent) **FMAPI endpoint** available in your workspace

- **Databricks CLI** installed and a profile configured (`databricks auth login`)

- **Python 3.10+** locally

- **MLflow Managed Prompt Registry** enabled (Preview feature). The agent loads prompt instructions from the registry, and the optimizer writes tuned prompts back. Enable it under **Settings > Preview features > "Managed MLflow Prompt Registry"**.

> **Note on OBO / user authorization:** CoCo does **not** use on-behalf-of user tokens. Every data access is made by the app's service principal. Its permissions come from typed resource bindings (SQL warehouse `CAN_USE`, agent endpoint `CAN_QUERY`, Lakebase `CAN_CONNECT_AND_CREATE`). Earlier versions required the Apps OBO preview feature. The current version does not. The app deploys cleanly on workspaces without that flag enabled.

- Permissions listed in [`docs/PERMISSIONS.md`](docs/PERMISSIONS.md)

Optional (the setup job creates these if you have permission):

- Lakebase instance (for chat session persistence)

- Vector Search endpoint (for clinical knowledge RAG)

### Step 0: Preflight check (MANDATORY - every deployer runs this)

**Who runs it:** every person who will deploy CoCo - you, your team, your platform admins. Run it against **your own** CLI profile so it probes **your** permissions. Admin-only preflights are not enough. The setup job runs as the deployer's identity, so the deployer is the only one whose permissions matter.

**When to run it:** after you have the CLI profile configured (`databricks auth login`) and before any `databricks bundle` command. It takes about 60 seconds.

```bash

python scripts/preflight_check.py \

  -p PROFILE \

  --warehouse-id WH_ID \

  --catalog CATALOG \

  --unique-id YOUR_ID

```

The script actively exercises the permissions (not just lists APIs). It probes:

- authentication + warehouse access + Claude endpoint

- **`CAN_QUERY` on the Claude endpoint** (1-token smoke test)

- **`CREATE SCHEMA` on your catalog** (creates + drops a probe schema)

- **MLflow Managed Prompt Registry preview flag** (registers + deletes a probe prompt)

- **`config/default.yaml` `llm.endpoint` matches what is actually deployed** (catches the 4-5 vs 4-6 mismatch)

- Lakebase, Vector Search, Databricks Apps API reachability

- Existing coco resources that would collide with your unique_id

**What to do based on the result:**

| Summary line | What to do |

|---|---|

| `Results: N passed, 0 failed, 0 warnings` | proceed to Step 1 below |

| `... 0 failed, 1+ warnings` | read each WARN line. Most are handled gracefully by setup, but verify before running a 25-minute job |

| `... 1+ failed` | **stop.** Each FAIL line has a specific fix hint. If it is a permission issue, send `docs/PERMISSIONS.md` to your admin and re-run preflight after the grant |

If **no catalog** passes the CREATE SCHEMA probe:

- **Ask your workspace admin** to grant: `GRANT USE CATALOG, CREATE SCHEMA ON CATALOG  TO \`\``

- **Or** create a catalog you own: UI -> Catalog -> + -> Create a new catalog (not available on all workspaces - Default Storage workspaces block this for non-admins)

### Deploy in 3 commands

```bash

# 1. Clone

git clone https://github.com/debu-sinha/coco-reference.git

cd coco-reference

# 2. Deploy the bundle

#    YOUR_ID   = your initials or short name (namespaces all resources)

#    WH_ID     = your serverless SQL warehouse id (hex string)

#    CATALOG   = Unity Catalog name you have CREATE SCHEMA access on

#    PROFILE   = your Databricks CLI profile name

databricks bundle deploy -t demo -p PROFILE \

  --var unique_id=YOUR_ID \

  --var warehouse_id=WH_ID \

  --var catalog=CATALOG

# 3. Run the setup job (provisions everything: ~25-40 min)

databricks bundle run setup_workspace -t demo -p PROFILE \

  --var unique_id=YOUR_ID \

  --var warehouse_id=WH_ID \

  --var catalog=CATALOG

```

When the job finishes, it'll print the app URL. Open it in a browser, create a thread, and ask a cohort question.

### What the setup job provisions

All resources are namespaced by `unique_id` so multiple users can deploy to the same workspace:

| Resource | Name pattern | What it does |

|----------|-------------|-------------|

| UC Schema | `cohort_builder_` | Tables, volumes, registered model [(setup Step 2)](notebooks/00_setup_workspace.py) |

| Delta tables | `patients`, `diagnoses`, `prescriptions`, `procedures`, `claims`, `suppliers` | 10k synthetic patients via [`src/coco/data_generator/`](src/coco/data_generator) with clinically realistic correlations |

| Vector Search endpoint | `coco-vs-` | Per-user. [Delta Sync index docs](https://docs.databricks.com/aws/en/generative-ai/vector-search.html) |

| Vector Search index | `coco_knowledge_idx` | Hybrid BM25 + BGE embeddings on `knowledge_chunks` [(setup Step 5)](notebooks/00_setup_workspace.py) |

| Lakebase instance | `coco-lb-` | Managed Postgres for session state [(Lakebase docs)](https://docs.databricks.com/aws/en/generative-ai/databricks-apps/lakebase.html) |

| Lakebase database | `coco` | Schema `coco_sessions` with `threads`, `messages`, `runs`, `feedback` tables [(schema)](src/coco/app/sessions/schema.py) |

| MLflow Prompt Registry | `.cohort_builder_.{cohort_query,clinical_codes,sql_generator,response_synthesizer}` | 3-part UC names with `@production` alias [(loader)](src/coco/agent/prompts/__init__.py) |

| Model Serving endpoint | `coco-agent-` | `dspy.ReAct` agent via [Mosaic AI Agent Framework](https://docs.databricks.com/aws/en/generative-ai/agent-framework/) [(deploy)](src/coco/agent/deploy.py) |

| UC registered model | `.cohort_builder_.coco_agent_` | Versioned agent model [(deploy.py)](src/coco/agent/deploy.py) |

| Databricks App | `coco-` | FastAPI + HTMX chat UI, SP-only auth via `X-Forwarded-Email` [(auth.py)](src/coco/app/auth.py) |

| MLflow experiment | `/Users//coco-agent` | Per-user. Traces, runs, model artifacts. **No fallback** - setup errors loudly if `COCO_MLFLOW_EXPERIMENT` is unset. This is [intentional](src/coco/agent/prompts/__init__.py) so traces can never silently land in a shared experiment. |

### Minimal mode

Pass `--var minimal=true` to `bundle run setup_workspace` and the Vector Search index creation step is skipped. The VS endpoint is still provisioned (cheap, idempotent), and Lakebase still runs. The agent deploys either way - its `retrieve_knowledge` tool just returns empty results when the index isn't populated.

```bash

databricks bundle run setup_workspace -t demo -p PROFILE \

  --var unique_id=YOUR_ID --var warehouse_id=WH_ID --var catalog=CATALOG \

  --var minimal=true

```

Use this for quick learning deploys, workspaces where VS index creation is slow or restricted, or cost-sensitive test runs. Full Lakebase-skip (for workspaces without Lakebase) is planned but not tested yet - see [`CHANGELOG.md`](CHANGELOG.md).

### Multi-user isolation

Two users deploying to the same workspace with different `unique_id` values get completely separate resources:

```bash

# User A (initials: al)

databricks bundle deploy -t demo -p PROFILE --var unique_id=al --var warehouse_id=WH_ID --var catalog=my_catalog

# User B (initials: mj)

databricks bundle deploy -t demo -p PROFILE --var unique_id=mj --var warehouse_id=WH_ID --var catalog=my_catalog

```

User A gets `coco-agent-al`, `coco-al` app, `cohort_builder_al` schema. User B gets `coco-agent-mj`, `coco-mj` app, `cohort_builder_mj` schema. Zero collisions. This works through the entire pipeline including agent deployment, model registration, and app creation.

### Teardown

To tear down everything your user created (app, serving endpoint, Lakebase instance, UC schema with tables and volumes, prompts, registered model, per-user MLflow experiment):

```bash

databricks bundle run teardown_workspace -t demo -p PROFILE \

  --var unique_id=YOUR_ID \

  --var catalog=CATALOG

```

Shared resources stay intact by default. The teardown never drops the UC catalog or the shared Vector Search endpoint unless you explicitly pass `--var delete_catalog=YES` or `--var delete_vs_endpoint=YES`. Two users running teardown at the same time touch disjoint namespaced resources and cannot interfere with each other. The script is idempotent, so a partial teardown is safe to re-run.

**Cleaning up multiple deploys at once.** If you deployed with several `unique_id` values (for example, `dev`, then later `ds`), one teardown invocation only handles the namespace matching the current vars. Flip the notebook widget `scan_all_my_deploys` to `YES` to scan the workspace for every CoCo app, serving endpoint, Lakebase instance, and `cohort_builder_*` schema whose creator/owner is the current user. Matches are merged with the widget-driven list and torn down together. Run it from the notebook UI (the CLI `bundle run` command uses the bundle's `base_parameters`, which keep `scan_all_my_deploys=NO`).

### Troubleshooting

| Problem | Cause | Fix |

|---------|-------|-----|

| `Only serverless compute is supported` | Workspace is serverless-only | Already handled (jobs use serverless environments) |

| `Metastore storage root URL doesn't exist` | Default Storage workspace | Pass `--var catalog=` with a catalog you already have access to |

| `Cannot create Lakebase instance` | No Lakebase permission | Ask admin to create the instance, then pass `--var lakebase_instance=`. CU_1 is plenty. |

| `token passthrough feature is not enabled` | Apps OBO not turned on | Workspace admin: Settings > Preview features > enable "User authorization for Databricks Apps" |

| `typing_extensions` / `deprecated` import error | Serverless sys.path issue | Already handled in the notebook. If you see it, redeploy the bundle to pick up the fix. |

| `cannot import name 'agents' from 'databricks'` | Serverless namespace collision | Already handled. The post-restart cell extends `databricks.__path__`. |

| `AiGatewayConfig` import error | Old system databricks-sdk | Already handled. The sys.path fix prioritizes pip-installed packages. |

| `Bad model name: please specify all three levels` | Config env vars not set | Already handled. Notebook sets `COCO_*` env vars from widget values before deploy. |

| Agent returns empty results | Config missing warehouse_id | The deploy resolves config at log_model time. Re-deploy the agent. |

| `error downloading Terraform: openpgp: key expired` on `databricks bundle deploy` | Older Databricks CLI ships with an expired GPG-signing key for its embedded Terraform download | Install Terraform locally (`curl -sSL https://releases.hashicorp.com/terraform/1.9.8/terraform_1.9.8_$(uname -s | tr A-Z a-z)_$(uname -m).zip -o /tmp/tf.zip && unzip /tmp/tf.zip -d ~/bin/ && chmod +x ~/bin/terraform`). Then `export DATABRICKS_TF_EXEC_PATH=~/bin/terraform DATABRICKS_TF_VERSION=1.9.8` before running bundle commands. |

For the full permissions checklist, see [`docs/PERMISSIONS.md`](docs/PERMISSIONS.md).

### Known issues

Things that work today but might bite you in specific conditions. Check [`CHANGELOG.md`](CHANGELOG.md) for the list of what's validated end-to-end.

- **Lakebase credential rotation.** Tokens are minted on demand with a ~1h TTL and the pool rotates at 55 minutes. If Databricks changes the TTL, the pool will hit auth errors on the next query. We don't poll or log credential health. See [`src/coco/app/sessions/lakebase.py`](src/coco/app/sessions/lakebase.py).

- **Prompt Registry flag can be disabled after deploy.** The preflight script catches it before setup, but nothing checks it at request time. If an admin flips the flag off post-deploy, `load_prompt` falls back to bundled DEFAULTS silently. See [`src/coco/agent/prompts/__init__.py`](src/coco/agent/prompts/__init__.py).

- **Agent endpoint cold starts.** Scale-to-zero is on by default. First request after idle can take 30-60 seconds while the container warms. The app doesn't queue or show a dedicated "waking up" state - the user sees the standard "agent is thinking" spinner.

- **LLM-as-judge scorers use `asyncio.run`.** `response_relevance_scorer` and `phi_leak_scorer` in [`src/coco/observability/scorers.py`](src/coco/observability/scorers.py) spin an event loop. They worked in our validation environment but treat them as lower-confidence than the code scorers.

- **Vector Search index takes a few minutes to go live** after `setup_workspace` creates it. The agent's `retrieve_knowledge` tool returns empty results until the index finishes syncing. Setup waits but does not gate on index-ready.

## Architecture

![CoCo request flow](docs/design/diagrams/request-flow.svg)

The agent uses `dspy.ReAct` with native tool calling. Claude decides which tools to call (inspect_schema, identify_clinical_codes, generate_sql, execute_sql, retrieve_knowledge) based on tool definitions derived from Python function docstrings. No keyword-matched planner, no separate planning prompt. The model IS the planner. Every tool call is decorated with `@mlflow.trace` for full observability.

For the detailed architecture writeup with gotchas and code pointers, see [`docs/design/apps-mosaic-ai-agent-reference.md`](docs/design/apps-mosaic-ai-agent-reference.md).

## Prompt management and optimization

CoCo uses **MLflow Prompt Registry** for all DSPy signature instructions. The setup notebook registers 4 default prompts under 3-part UC names: `.cohort_builder_.{cohort_query,clinical_codes,sql_generator,response_synthesizer}`. The agent loads them at runtime via the `@production` alias, so you can update prompts without redeploying.

**The automatic optimization loop:**

```

User asks a cohort question

  -> Agent answers (prompt loaded from MLflow Prompt Registry at @production)

  -> User clicks thumbs up or thumbs down (stored in Lakebase)

  -> Weekly job (notebooks/03_optimize_dspy.py) queries thumbs-up pairs

  -> mlflow.genai.optimize_prompts with GepaPromptOptimizer runs evolutionary

     search over instruction variations against a Correctness scorer

  -> New prompt version registered, production alias flipped to it

  -> Agent picks up the new prompt on the next request (no redeploy)

```

To enable auto-optimization: collect 2+ thumbs-up interactions (the workshop-demo default) or 10+ for production use, then unpause the `CoCo DSPy Optimization` job in the Workflows UI. It runs weekly on Sunday at 2am UTC. Override `min_examples` via the job widget for production.

**Always gate promotions on evaluation.** GEPA can overfit on small feedback samples and regress on held-out scenarios. The recommended flow is: run the optimize job, run the eval job against the new version, compare metrics against the baseline, and only keep the `@production` alias on the new version if metrics improved. To roll back, call `mlflow.genai.set_prompt_alias(name=..., version=, alias="production")`. The alias flip takes effect on the next request, no redeploy.

To manually edit a prompt: go to the MLflow Prompt Registry in the Databricks UI, find `.cohort_builder_.cohort_query`, and create a new version with the `production` alias.

## Running tests locally

```bash

python -m venv .venv

source .venv/activate

pip install -e ".[dev]"

pytest -m unit                  # fast, no external deps (~65 tests)

pytest -m integration           # mocked Databricks services

ruff check src tests

ruff format src tests

```

## Running the app locally (mocked backend)

```bash

export COCO_CONFIG_PATH=config/default.yaml

export DATABRICKS_HOST=https://example.cloud.databricks.com

export COCO_WAREHOUSE_ID=your_warehouse_id

export COCO_CATALOG_NAME=coco_demo

export COCO_SCHEMA_NAME=cohort_builder

export COCO_AGENT_ENDPOINT_NAME=coco-agent

cd src/coco/app && uvicorn main:app --reload

```

The app should come up at `http://localhost:8000`. SSE streaming won't work without a deployed agent endpoint (`COCO_AGENT_ENDPOINT_URL`).

## Configuration

`config/default.yaml` is the single config file. Environment variables are interpolated via `${VAR}` tokens. Key env vars:

| Env var | What it does | Set by |

|---------|-------------|--------|

| `DATABRICKS_HOST` | Workspace URL | Auto-injected in Databricks Apps |

| `COCO_WAREHOUSE_ID` | SQL warehouse for queries | App resource binding |

| `COCO_AGENT_ENDPOINT_URL` | Agent serving endpoint URL | App resource binding |

| `COCO_AGENT_ENDPOINT_NAME` | Agent serving endpoint name | Setup notebook widget |

| `COCO_CATALOG_NAME` | Unity Catalog name | App env var / notebook widget |

| `COCO_SCHEMA_NAME` | Schema name | App env var / notebook widget |

| `COCO_LAKEBASE_INSTANCE` | Lakebase instance name | Setup notebook widget |

## Key docs

| Doc | What it covers |

|-----|---------------|

| [`docs/PERMISSIONS.md`](docs/PERMISSIONS.md) | Every permission needed for end-to-end deploy |

| [`docs/design/apps-mosaic-ai-agent-reference.md`](docs/design/apps-mosaic-ai-agent-reference.md) | Full architecture with gotchas (Lakebase, token rotation, code_paths, planner) |

| [`docs/examples/`](docs/examples/) | Working Python snippets for calling FMAPI from DSPy |

| [`docs/cost-attribution/`](docs/cost-attribution/) | Cost tracking queries, tagging policy, warehouse setup template |

| [`docs/WORKSHOP_PREREQS.md`](docs/WORKSHOP_PREREQS.md) | Workshop-specific prerequisites checklist |

## Knowledge base: demo vs production

The markdown files in `src/coco/knowledge/` are the **demo knowledge base** that get chunked, embedded, and indexed in Vector Search during setup. The agent's `retrieve_knowledge` tool queries this index.

**This is a demo shortcut, not production practice.** In production:

| Concern | Demo approach | Production approach |

|---------|--------------|-------------------|

| **Schema knowledge** | Hand-written markdown per table | Auto-generate from UC column comments on a schedule |

| **Clinical rules** | Prose in `clinical_rules.md` | A `clinical_reference` table in UC that the clinical team maintains |

| **Unstructured docs** | Static files in git | DLT pipeline from source systems, Delta table with CDC, auto-syncing VS index |

| **Multi-domain** | One set of files for one schema | Each domain gets its own UC schema + VS index |

**The right long-term pattern:**

1. **Put knowledge in UC column comments**, not separate files:

   ```sql

   COMMENT ON COLUMN diagnoses.icd10_code IS 'ICD-10-CM code. E11.* = Type 2 diabetes.';

   ```

   The agent's `inspect_schema` tool already reads column metadata.

2. **Use a reference table for clinical rules**, not markdown:

   ```sql

   CREATE TABLE clinical_reference (

       condition STRING, icd10_pattern STRING, first_line_drugs ARRAY,

       source STRING, last_reviewed DATE, reviewed_by STRING

   );

   ```

3. **Use a managed VS pipeline for unstructured docs:**

   ```

   Source docs (PDFs, wikis) -> DLT pipeline -> Delta table (CDC) -> auto-syncing VS index

   ```

## Extending CoCo

After the initial deploy, Claude Code is the recommended path for day-2 changes. Extension points:

- **Tools**: add a function to `src/coco/agent/responses_agent.py` and pass it to `dspy.ReAct(tools=[...])`

- **Prompts**: edit in MLflow Prompt Registry UI (no redeploy needed) or update defaults in `src/coco/agent/prompts/__init__.py`

- **DSPy signatures**: edit `src/coco/agent/signatures.py` for field changes, run `notebooks/03_optimize_dspy.py` for GEPA instruction tuning against thumbs-up feedback

- **Guardrails**: add schemas to `config.guardrails.allowed_schemas` in `config/default.yaml`

- **UI**: edit templates in `src/coco/app/templates/`, CSS in `src/coco/app/static/styles.css`

- **Cost tracking**: customize queries in `docs/cost-attribution/queries/`

## Repo structure

```

coco-reference/

  app.yaml                     Databricks App manifest

  databricks.yml               DABs bundle definition (jobs, variables, targets)

  config/default.yaml          Runtime config (env var interpolated)

  requirements.txt             App container pip deps

  pyproject.toml               Package metadata + dev deps

  scripts/

    preflight_check.py         Pre-deploy workspace permission checker

  src/coco/

    agent/                     The dspy.ReAct agent + tools + deploy

      prompts/                 MLflow Prompt Registry integration + defaults

      signatures.py            DSPy signature definitions (typed I/O contracts)

      responses_agent.py       Main agent: ReAct loop, tool functions

      deploy.py                Model logging, UC registration, endpoint deploy

      guardrails.py            SQL validation (read-only, schema allowlist)

      tools/                   Tool implementations (SQL, VS, schema, codes)

    app/                       FastAPI + HTMX + SSE chat UI

      sessions/                Lakebase CRUD (threads, messages, feedback)

    config.py                  Config loader

    gateway/                   LLM gateway client

    sql/                       Statement Execution API client

    data_generator/            Synthetic patient data generator

    knowledge/                 Clinical knowledge markdown docs (demo)

  notebooks/

    00_setup_workspace.py      Full provisioning notebook (auto-namespaces per user)

    02_evaluate.py             Scenario-based evaluation against the live agent endpoint

    03_optimize_dspy.py        GEPA prompt optimization from Lakebase thumbs-up feedback

    99_teardown.py             Removes every resource the setup created, per-user namespace

  docs/

    PERMISSIONS.md             Required permissions

    design/                    Architecture reference doc

    examples/                  FMAPI + DSPy snippets

    cost-attribution/          Cost tracking queries + policy

  tests/                       Unit + integration tests

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/debu-sinha/coco-reference

Awesome Lists containing this project

README