https://github.com/tdspora/tdm_rulesgen
FastAPI service for safe rule parsing, compilation, preview execution, and sandbox-backed dataset generation.
https://github.com/tdspora/tdm_rulesgen
fastapi openapi pydantic python rules-engine synthetic-data
Last synced: 2 months ago
JSON representation
FastAPI service for safe rule parsing, compilation, preview execution, and sandbox-backed dataset generation.
- Host: GitHub
- URL: https://github.com/tdspora/tdm_rulesgen
- Owner: tdspora
- License: apache-2.0
- Created: 2026-04-20T22:56:31.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-21T01:01:11.000Z (2 months ago)
- Last Synced: 2026-04-21T01:28:53.584Z (2 months ago)
- Topics: fastapi, openapi, pydantic, python, rules-engine, synthetic-data
- Language: Python
- Size: 271 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Notice: NOTICE
Awesome Lists containing this project
README
# rulesgen
`rulesgen` is a secure rule-processing service for synthetic data workflows.
It converts natural-language instructions into a restricted DSL, validates and compiles that DSL into an executable artifact, lets you preview rule behavior locally, and can generate datasets through either a local subprocess executor or an OpenSandbox-backed runtime.
The project is designed for teams that need:
- natural-language rule authoring
- deterministic validation and compilation
- safe local previews
- optional sandboxed execution for full dataset generation
## What it does
`rulesgen` supports a staged rule lifecycle:
1. **Parse** natural language into an untrusted intermediate form (`semantic_frame`) and DSL candidate
2. **Compile** the DSL into a validated executable artifact (`compiled_rule`)
3. **Preview** rule execution against a sample row
4. **Generate** datasets using a local or sandbox-backed execution path
This separation is intentional. Natural-language output is never trusted directly. A rule only becomes executable after validation and compilation succeed.
---
## Quick start
The fastest way to get started is with Docker Compose.
### Prerequisites
- Docker
- `docker compose`
- `curl`
- `jq`
### Start the stack
The default setup runs:
- `rulesgen`
- OpenSandbox
- an LLM-backed translation path through LiteLLM when provider credentials are present
- the built-in stub translation backend when provider credentials are absent
Optional provider credentials for LiteLLM (examples):
- `OPENAI_API_KEY` (OpenAI / OpenAI-compatible)
- `ANTHROPIC_API_KEY`
- `GEMINI_API_KEY`
- `AZURE_API_KEY` (Azure OpenAI; often used with `AZURE_API_VERSION`)
`./scripts/run_stack.sh` now falls back to `RULESGEN_LLM_GATEWAY_BACKEND=stub` when none of these credentials are set. Docker Compose still forwards the provider variables into the `rulesgen` container via `${VAR:-}` entries in the compose files.
Start the stack:
```bash
./scripts/run_stack.sh
```
If a provider key is not set, the stack still starts and uses the stub translation backend.
### Service endpoints
Once the stack is up, the API is available at:
* `http://127.0.0.1:8000`
* `http://127.0.0.1:8000/docs` for OpenAPI documentation, when enabled
### Verify readiness
```bash
curl -s http://127.0.0.1:8000/health/ready
```
### Run the example workflows
In a new terminal:
```bash
export BASE_URL=http://127.0.0.1:8000
```
Then continue with the two workflows below.
---
## Example workflows
## 1. Parse → Compile → Preview
This workflow shows the safest path from natural-language input to executable rule behavior.
### Step 1: Parse a natural-language instruction
The parse endpoint returns:
* inferred intent
* a candidate DSL expression
* diagnostics
* prompt-audit metadata
* explainability and metrics data
At this stage, the output is still **untrusted**.
```bash
curl -s "$BASE_URL/rules/parse" \
-H "Content-Type: application/json" \
-d '{
"table_name": "employees",
"schema": [
{"name": "salary", "type": "FLOAT", "nullable": false, "source": "syngen"},
{"name": "job_level", "type": "INT", "nullable": false, "source": "syngen"},
{
"name": "bonus",
"type": "FLOAT",
"nullable": true,
"source": "rule",
"source_text": "If job_level is 5 or higher, set bonus to 10 percent of salary.",
"source_type": "natural_language"
}
]
}'
```
Important fields in the response:
* `dsl_candidate`
* `diagnostics`
* `prompt_audit`
* `metrics`
* `explainability_trace`
For schema-embedded rule requests, the target column is inferred from the schema row `name`. Supported row-level `source_type` values are:
* `natural_language`
* `domain_specific_language`
### Step 2: Compile the DSL into a validated rule artifact
Use the `dsl_candidate` from the parse response, or submit a DSL expression directly.
```bash
COMPILE_RESPONSE="$(
curl -s "$BASE_URL/rules/compile" \
-H "Content-Type: application/json" \
--data-binary @- <<'EOF'
{
"expression": "0.1 * col('salary') if col('job_level') >= 5 else 0",
"target_column": "bonus"
}
EOF
)"
export ARTIFACT_ID="$(echo "$COMPILE_RESPONSE" | jq -r '.artifact_id')"
echo "ARTIFACT_ID=$ARTIFACT_ID"
```
Save the returned `artifact_id`. You will use it in the preview step.
### Step 3: Preview the rule against a sample row
The preview endpoint executes the compiled rule locally and supports row-phase helpers only.
```bash
curl -s "$BASE_URL/rules/preview" \
-H "Content-Type: application/json" \
--data-binary @- <= 5 else 0',
target_column="bonus",
)
preview = preview_rule(
compiled,
row={"salary": 120000, "job_level": 6},
seed=99,
)
print(preview.value)
```
### Parse a natural-language rule
```python
from rulesgen import Settings, SourceType, parse_rule
settings = Settings(
llm_gateway_backend="litellm", # or: "http" / "stub"
llm_model_name="gpt-4",
)
frame = parse_rule(
"If job_level is 5 or higher, set bonus to 10 percent of salary.",
source_type=SourceType.NATURAL_LANGUAGE,
table_name="employees",
schema_columns=["salary", "job_level", "bonus"],
target_column="bonus",
settings=settings,
)
print(frame.dsl_candidate)
```
### Execute multiple compiled rules in-process
```python
from rulesgen import Settings, compile_rule, execute_generation_plan
settings = Settings()
rows = [
{"order_id": "A", "line_amount": 10},
{"order_id": "A", "line_amount": 5},
{"order_id": "B", "line_amount": 7},
]
compiled_rules = [
compile_rule(
'col("line_amount") * 2',
target_column="line_amount_x2",
settings=settings,
),
compile_rule(
'group_sum(key=col("order_id"), value=col("line_amount"))',
target_column="order_total",
settings=settings,
),
]
run = execute_generation_plan(
rows=rows,
compiled_rules=compiled_rules,
seed=17,
references={},
max_length=settings.dsl_max_length,
max_depth=settings.dsl_max_depth,
max_nodes=settings.dsl_max_nodes,
)
print(run.rows)
print(run.column_sources)
```
### Copy a generated artifact for a completed job
When the library shares the same local repositories as the API service, you can copy
completed job artifacts to another local path:
```python
from rulesgen import Settings, download_job_artifact, download_job_dataset
settings = Settings(
jobs_repository_dir=".rulesgen-data/jobs",
artifacts_repository_dir=".rulesgen-data/artifacts",
ossfs_root_dir=".rulesgen-data/ossfs",
)
dataset_copy = download_job_dataset(
"job-id",
destination="downloads/generated_rows.json",
settings=settings,
)
manifest_copy = download_job_artifact(
"job-id",
"artifact-id",
destination="downloads/sandbox_manifest.json",
settings=settings,
)
print(dataset_copy)
print(manifest_copy)
```
---
## API reference
### Health
* `GET /health/live`
* `GET /health/ready`
### Rules
* `POST /rules/parse`
* `POST /rules/compile`
* `POST /rules/preview`
* `POST /rules/execute`
### Datasets and jobs
* `POST /datasets/uploads`
* `POST /datasets/generate`
* `POST /jobs`
* `GET /jobs/{job_id}`
* `GET /jobs/{job_id}/dataset`
* `GET /jobs/{job_id}/artifacts/{artifact_id}`
---
## Architecture summary
The HTTP layer is intentionally thin.
* routers depend on services
* services depend on compiler and repository interfaces
* execution is limited to validated AST artifacts with a restricted runtime surface
Natural-language parsing always produces an untrusted `semantic_frame` and DSL candidate first. Only validated DSL can be compiled into an executable artifact.
Execution paths are separated by purpose:
* **preview execution** uses the local preview executor and supports row-phase helpers
* **dataset generation** uses either the subprocess executor or the OpenSandbox adapter
* both generation modes preserve the same manifest and artifact contract under the configured OSSFS root
---
## What’s included
* production-oriented ASGI application skeleton under `src/rulesgen/`
* structured logging and request context propagation
* RFC 9457 problem-details responses
* versioned API modules for health, rules, datasets, and jobs
* restricted DSL compilation based on Python AST validation
* filesystem-backed repositories for rules, jobs, prompt audits, and artifacts
* local preview execution
* subprocess dataset execution
* optional OpenSandbox integration for sandboxed dataset generation
* tests, CI, and container build support
---
## Development
Install development dependencies:
```bash
uv sync --extra api --extra dev
```
Useful commands:
```bash
uv sync --extra api --extra dev
uv run pytest
uv run ruff check .
uv run ruff format .
uv run mypy src
uv run pip-audit
```
---
## Documentation
The project documentation site uses MkDocs with the Material theme and sources
its published pages from `docs/`.
For a full contributor environment with the docs toolchain enabled:
```bash
uv sync --extra api --extra dev --extra docs --locked
uv run mkdocs serve
```
To build the site the same way as the GitHub Pages workflow:
```bash
uv run mkdocs build --strict
```
Once GitHub Pages is configured to deploy from GitHub Actions, the published
site lives at `https://tdspora.github.io/tdm_rulesgen/`.
The Pages site is intentionally smaller than the full set of repository docs.
Longer design and contributor references remain in the repository and are
linked from the published site.
---
## Release process
Pushes to `main` run CI, build the wheel and sdist, create the GitHub Release via semantic-release, attach release artifacts to that release, publish the same distributions to PyPI, and build/push a Docker image to Docker Hub.
Before enabling automated releases, configure these repository secrets:
- `DEPLOY_KEY` for the SSH deploy key that semantic-release uses to push version bump commits and tags.
- `PYPI_TOKEN` for a PyPI API token scoped to the `rulesgen` project.
- `DOCKER_HUB_USER` for the Docker Hub namespace that owns the published `rulesgen` image.
- `DOCKER_HUB_TOKEN` for a Docker Hub access token with permission to push images for that namespace.
The PyPI and Docker publishing jobs both check out the GitHub release tag created by semantic-release and verify that `project.version` matches the released semver before publishing.
When a release is published, the workflow pushes `DOCKER_HUB_USER/rulesgen` with `latest`, `vX.Y.Z`, and `X.Y.Z` tags, while PyPI receives distributions built from that same tagged source.
Before the first automated release, create a baseline tag that matches `project.version` in `pyproject.toml`:
```bash
git tag v0.1.0
git push origin v0.1.0
```
If branch protection requires pull requests on `main`, allow the GitHub Actions app to bypass that requirement so semantic-release can push its version bump commit and publish release assets.
The script below applies the recommended repository and branch-protection settings and prints the one remaining manual UI step:
```text
scripts/configure-github-repo-oss.sh
```
---
## Project guidance
* Domain vocabulary for contributors and agents lives in `docs/domain-dictionary.md`
* Cursor rules for this repository live in `.cursor/rules/`
---
## License
This project is licensed under the **Apache License 2.0**. See [`LICENSE`](LICENSE) for the full text and [`NOTICE`](NOTICE) for copyright attribution.
## Contributing
See [`CONTRIBUTING.md`](CONTRIBUTING.md). This project follows the [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md). Report security issues according to [`SECURITY.md`](SECURITY.md).