An open API service indexing awesome lists of open source software.

https://github.com/boshu2/12-factor-agentops

DevOps + SRE principles for operating LLM applications reliably at scale. Complementary to 12-Factor Agents for building
https://github.com/boshu2/12-factor-agentops

12-factor agent-orchestration agentops agents ai-agents ai-agents-framework ai-operations argocd context-engineering devops flux gitops infrastructure-as-code kubernetes kyverno llm openshift platform-engineering production-operations sre

Last synced: 2 months ago
JSON representation

DevOps + SRE principles for operating LLM applications reliably at scale. Complementary to 12-Factor Agents for building

Awesome Lists containing this project

README

          

# 12-Factor AgentOps

**Operational patterns from the intersection: infrastructure FOR AI + AI FOR infrastructure**


Code License: Apache 2.0


Content License: CC BY-SA 4.0

Status: Alpha

---

> [!IMPORTANT]
> **Status: Alpha** - Patterns proven at production scale in federal infrastructure. Now validating generalization across domains.
>
> **Looking for Context Engineering?** See [12-Factor Agents - Factor 3](https://github.com/humanlayer/12-factor-agents/blob/main/content/factor-03-own-your-context-window.md) by [@dexhorthy](https://github.com/dexhorthy)

---

## The Intersection

**I build GPU/HPC platforms that enable AI workloads.**

**I use AI agents to automate infrastructure operations.**

**I operate both at production scale in federal, security-hardened environments.**

This framework documents operational patterns from both sides of the AI equation.

---

## The Problem

Everyone's building AI agents. Nobody's figured out how to operate them reliably.

- **Week 1:** "This is amazing!"
- **Week 4:** Errors piling up
- **Week 8:** Back to manual work

Sound familiar? **It's 2015 microservices chaos all over again.**

We know how to build reliable infrastructure. We know how to build reliable software.

**But operating AI agents in production? We're still figuring that out.**

---

## What This Is

Platform engineer with 10+ years climbing the IT stack—systems, networking, storage, security, platforms, automation.

**Current work:**
- Building GPU/HPC infrastructure for AI inference/training workloads (20+ production clusters)
- Using AI agents to automate platform operations (GitOps validation, runbooks, policy)
- Operating in mission-critical, multi-tenant, federal environments

**12-Factor AgentOps = Meta-patterns extracted from real production work at this intersection.**

---

## Why This Perspective Matters

Most people have **ONE** of these:
- Infrastructure ops (no AI exposure)
- AI/ML engineering (no infrastructure ops)
- AI agent users (no production operations)

**This framework comes from having ALL THREE:**
1. Building platforms **FOR** AI workloads
2. Using AI **TO BUILD** platforms
3. Operating both at **PRODUCTION SCALE** in **HIGH-STAKES** environments

---

## The Approach

```
Production Operations → Extract Patterns → Document → Validate → Refine
↓ ↓ ↓ ↓ ↓
(What works?) (Why works?) (Share it) (Test it) (Improve it)
```

1. **Document patterns** proven at production scale
2. **Extract meta-patterns** that generalize across contexts
3. **Share early**, validate with community
4. **Refine** based on diverse implementations

**Not theory. Production.**

---

## The Invitation

If you're at a similar intersection:
- Operating AI/ML infrastructure at scale
- Using AI agents for DevOps/SRE work
- Building platforms in constrained environments

**Try these patterns. Share what works in your context. Help prove whether operational discipline transfers.**

---

## Framework: The Factors

The 12 factors are being published as they're validated for generalization.

### Coming Soon

| Factor | Focus | Status |
|--------|-------|--------|
| **I: Git as Knowledge OS** | Commits = memory, branches = isolation, logs = audit trail | Documenting |
| **II: Context Engineering** | JIT loading, 40% rule, progressive disclosure | Documenting |
| **III: Small Specialized Agents** | Single responsibility, composable workflows | Documenting |
| **IV: Validation Gates** | Test before deploy, fail fast, rollback easy | Planned |
| **V: Observability** | Metrics, logs, traces for agent operations | Planned |
| **VI: Session Continuity** | Pause/resume, state preservation, recovery | Planned |

> [!TIP]
> **Subscribe to releases** to get notified when factors are published

---

## Background

**Platform Engineer**
- 10+ years: Systems → Networks → Security → Platforms → Automation → AI
- 20+ production Kubernetes clusters in federal/DoD environments
- GPU/HPC infrastructure for AI inference/training
- AI-assisted infrastructure operations (GitOps, observability, compliance)

**Unfair advantage:** Deep ops + automation + AI fluency + cultural translation

---

## Contributing

Early-stage documentation of production patterns.

**Want to help?**
- ✅ Implement patterns in your context
- ✅ Share results (successes AND failures)
- ✅ Suggest adaptations for your domain
- ✅ Challenge assumptions constructively

See [CLAUDE.md](CLAUDE.md) for AgentOps principles and contribution guidelines.

---

## Attribution & Inspiration

This framework builds on foundational work from:

### [12-Factor Apps](https://12factor.net) (Heroku)
The original methodology for building software-as-a-service apps. Established principles for:
- Configuration management
- Dependency isolation
- Stateless processes
- Environment parity

**Their insight:** Operational discipline makes applications reliable and portable.

### [12-Factor Agents](https://github.com/humanlayer/12-factor-agents) (Dex Horthy, HumanLayer)
Framework for building reliable LLM applications. Pioneered:
- Context engineering principles
- Human-in-the-loop patterns
- Agent reliability practices
- Production-grade AI systems

**Their insight:** AI agents need the same rigor as traditional software.

### This Project's Focus

**12-Factor AgentOps** extends these foundations to **operations**:
- Not just building reliable agents (12-Factor Agents covers this)
- Not just building reliable apps (12-Factor Apps covers this)
- **Operating AI agents and infrastructure at production scale**

We document patterns from the intersection: infrastructure FOR AI + AI FOR infrastructure.

---

## Related Work

**If you're building AI agents, read these first:**
- [12-Factor Agents](https://github.com/humanlayer/12-factor-agents) by [@dexhorthy](https://github.com/dexhorthy) - Building reliable LLM applications
- [Building Effective Agents](https://www.anthropic.com/engineering/building-effective-agents) by Anthropic - Agent design patterns
- [The Outer Loop](https://theouterloop.substack.com) by Dex Horthy - AI agent development insights

**If you're operating infrastructure, you know these:**
- [12-Factor Apps](https://12factor.net) - SaaS application methodology
- [Site Reliability Engineering](https://sre.google/books/) - Google's SRE practices
- [DevOps Handbook](https://itrevolution.com/product/the-devops-handbook-second-edition/) - DevOps principles

**This framework sits at the intersection.**

---

## License

Code: [Apache 2.0 License](LICENSE) (permissive, use freely)

Documentation: [CC BY-SA 4.0 License](LICENSE) (share alike, attribute)

Full license text: [LICENSE](LICENSE)

---

**Let's make AI agents as reliable as the infrastructure they run on.**

*Patterns proven at production scale in federal infrastructure. Validating generalization across domains.*