{"id":46774341,"url":"https://github.com/avivl/cloud-sre-agent","last_synced_at":"2026-03-09T23:41:59.400Z","repository":{"id":312623961,"uuid":"1048028765","full_name":"avivl/cloud-sre-agent","owner":"avivl","description":"An autonomous SRE agent that monitors cloud logs across multiple platforms, leveraging AI models from various providers to detect anomalies, perform root cause analysis, and automate remediation by creating GitHub Pull Requests.","archived":false,"fork":false,"pushed_at":"2025-09-19T15:23:47.000Z","size":4452,"stargazers_count":10,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-19T16:28:36.971Z","etag":null,"topics":["ai-agents","ai-ops","automation","aws","cloud","devops","gcp","gemini-ai","google-cloud","incident-response","llm","log-analysis","log-monitoring","platform-engineering","python","resilience","sre","vertex-ai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/avivl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"docs/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-31T18:55:54.000Z","updated_at":"2025-09-19T15:23:59.000Z","dependencies_parsed_at":"2025-09-01T01:21:16.601Z","dependency_job_id":"c513d7d1-e704-40ea-b479-1646451be5a4","html_url":"https://github.com/avivl/cloud-sre-agent","commit_stats":null,"previous_names":["avivl/gemini-sre-agent","avivl/cloud-sre-agent"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/avivl/cloud-sre-agent","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avivl%2Fcloud-sre-agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avivl%2Fcloud-sre-agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avivl%2Fcloud-sre-agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avivl%2Fcloud-sre-agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/avivl","download_url":"https://codeload.github.com/avivl/cloud-sre-agent/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avivl%2Fcloud-sre-agent/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30316774,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T20:05:46.299Z","status":"ssl_error","status_checked_at":"2026-03-09T19:57:04.425Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","ai-ops","automation","aws","cloud","devops","gcp","gemini-ai","google-cloud","incident-response","llm","log-analysis","log-monitoring","platform-engineering","python","resilience","sre","vertex-ai"],"created_at":"2026-03-09T23:41:58.818Z","updated_at":"2026-03-09T23:41:59.390Z","avatar_url":"https://github.com/avivl.png","language":"Python","readme":"# Cloud SRE Agent: Multi-Provider Autonomous Monitoring and Remediation\n\n[![GitHub Stars](https://img.shields.io/github/stars/avivl/cloud-sre-agent.svg?style=for-the-badge\u0026logo=github\u0026color=gold)](https://github.com/avivl/cloud-sre-agent/stargazers)\n[![Last Commit](https://img.shields.io/github/last-commit/avivl/cloud-sre-agent?style=for-the-badge\u0026logo=github)](https://github.com/avivl/cloud-sre-agent)\n[![License](https://img.shields.io/badge/License-MIT-blue?style=for-the-badge)](LICENSE)\n[![Multi-Provider AI](https://img.shields.io/badge/Multi--Provider%20AI-00BCD4?style=for-the-badge\u0026logo=openai\u0026logoColor=fff)](https://github.com/avivl/cloud-sre-agent)\n![Cloud SRE Agent](static/cloud_sre_agent.png)\n\nWelcome to the Cloud SRE Agent, an autonomous multi-provider system designed to enhance your cloud operations by intelligently monitoring logs and automating incident response. This project leverages the power of 100+ AI providers (OpenAI, Anthropic, Google, Cohere, Ollama, and more) to bring advanced AI capabilities directly into your Site Reliability Engineering (SRE) workflows.\n\nAt its core, the Cloud SRE Agent acts as a proactive digital assistant, continuously observing your cloud environment. When anomalies or critical events are detected in your logs, it doesn't just alert you; it initiates a structured process of analysis, generates intelligent code fixes using its unified code generation system, and proposes concrete remediation steps, culminating in automated GitHub Pull Requests. This approach aims to reduce manual toil, accelerate incident resolution, and improve the overall reliability of your cloud services.\n\nWhether you're looking to streamline your incident management, gain deeper insights from your operational data, or simply explore the practical applications of generative AI in SRE, the Cloud SRE Agent offers a robust and extensible foundation. It's built with an emphasis on configurability, resilience, and clear observability, ensuring it can adapt to diverse cloud environments and operational needs.\n\nDive in to discover how this agent can transform your cloud log monitoring into an intelligent, automated, and resilient operation.\n\n## System Architecture\n\nThe Cloud SRE Agent employs a sophisticated multi-provider AI architecture with **dynamic prompt generation**, advanced pattern detection, a **unified code generation system**, and a **comprehensive log ingestion system** for intelligent log monitoring and automated remediation:\n\n```mermaid\ngraph TB\n    subgraph \"Multi-Cloud Platform\"\n        GCP[Google Cloud Platform] --\u003e GCPL[Cloud Logging]\n        AWS[Amazon Web Services] --\u003e AWSL[CloudWatch Logs]\n        AZURE[Microsoft Azure] --\u003e AZUREL[Azure Monitor]\n        K8S[Kubernetes] --\u003e K8SL[Container Logs]\n        GCPL --\u003e PS[Pub/Sub Topics]\n        AWSL --\u003e KDS[Kinesis Data Streams]\n        AZUREL --\u003e EH[Event Hubs]\n        K8SL --\u003e K8SAPI[Kubernetes API]\n    end\n\n    subgraph \"Cloud SRE Agent - Multi-Provider AI System\"\n        SUB --\u003e LM[Log Manager\u003cbr/\u003eOrchestration Layer]\n        LM --\u003e |Multi-Source| GCP[GCP Pub/Sub Adapter]\n        LM --\u003e |Multi-Source| K8S[Kubernetes Adapter]\n        LM --\u003e |Multi-Source| FS[File System Adapter]\n        LM --\u003e |Multi-Source| AWS[AWS CloudWatch Adapter]\n\n        GCP --\u003e |LogEntry| LP[Log Processor]\n        K8S --\u003e |LogEntry| LP\n        FS --\u003e |LogEntry| LP\n        AWS --\u003e |LogEntry| LP\n\n        LP --\u003e |Structured Logs| MLPR[ML Pattern Refinement\u003cbr/\u003eAI Enhancement Layer]\n        MLPR --\u003e |Analysis| PD[Pattern Detection System\u003cbr/\u003eMulti-Layer Analysis]\n        PD --\u003e |Pattern Match| TA[Triage Agent\u003cbr/\u003eMulti-Provider AI]\n        LP --\u003e |Structured Logs| TA\n        TA --\u003e |TriagePacket| AA[Analysis Agent\u003cbr/\u003eDynamic Prompt Generation]\n        AA --\u003e |ValidationRequest| QA[Quantitative Analyzer\u003cbr/\u003eCode Execution]\n        QA --\u003e |EmpiricalData| AA\n        AA --\u003e |RemediationPlan| UECG[Unified\u003cbr/\u003eCode Generation System]\n        UECG --\u003e |Generated Code| RA[Remediation Agent]\n    end\n\n    subgraph \"Legacy System (Backward Compatible)\"\n        SUB --\u003e LS[Legacy Log Subscriber]\n        LS --\u003e |Raw Logs| TA\n    end\n\n    subgraph \"Comprehensive Monitoring System\"\n        MM[Monitoring Manager] --\u003e MC[Metrics Collector]\n        MM --\u003e HC[Health Checker]\n        MM --\u003e PM[Performance Monitor]\n        MM --\u003e AM[Alert Manager]\n\n        LM --\u003e |Health Status| MM\n        LP --\u003e |Processing Metrics| MM\n        TA --\u003e |Analysis Metrics| MM\n        AA --\u003e |Generation Metrics| MM\n    end\n\n    subgraph \"ML Pattern Refinement Layer\"\n        MLPR --\u003e LQV[Log Quality\u003cbr/\u003eValidator]\n        MLPR --\u003e LSan[Log Sanitizer\u003cbr/\u003ePII Removal]\n        MLPR --\u003e GEPD[AI Pattern\u003cbr/\u003eDetector]\n        MLPR --\u003e GRC[Response Cache\u003cbr/\u003eCost Optimization]\n    end\n\n    subgraph \"Pattern Detection Layers\"\n        PD --\u003e TW[Time Window\u003cbr/\u003eManagement]\n        PD --\u003e TE[Smart Threshold\u003cbr/\u003eEvaluation]\n        PD --\u003e PC[Pattern\u003cbr/\u003eClassification]\n        PD --\u003e CS[Confidence\u003cbr/\u003eScoring]\n    end\n\n    subgraph \"Unified Code Generation System\"\n        UECG --\u003e UWO[Unified Workflow\u003cbr/\u003eOrchestrator]\n        UWO --\u003e WCM[Workflow Context\u003cbr/\u003eManager]\n        UWO --\u003e WAE[Workflow Analysis\u003cbr/\u003eEngine]\n        UWO --\u003e WCG[Workflow Code\u003cbr/\u003eGenerator]\n        UWO --\u003e WVE[Workflow Validation\u003cbr/\u003eEngine]\n        UWO --\u003e WMC[Workflow Metrics\u003cbr/\u003eCollector]\n\n        WCG --\u003e CGF[Code Generator\u003cbr/\u003eFactory]\n        CGF --\u003e APIG[API Code\u003cbr/\u003eGenerator]\n        CGF --\u003e DBG[Database Code\u003cbr/\u003eGenerator]\n        CGF --\u003e SECG[Security Code\u003cbr/\u003eGenerator]\n\n        WVE --\u003e CVP[Code Validation\u003cbr/\u003ePipeline]\n        CVP --\u003e SYN[Syntax\u003cbr/\u003eValidation]\n        CVP --\u003e PAT[Pattern\u003cbr/\u003eValidation]\n        CVP --\u003e SEC[Security\u003cbr/\u003eReview]\n        CVP --\u003e PERF[Performance\u003cbr/\u003eAssessment]\n        CVP --\u003e BP[Best Practices\u003cbr/\u003eCheck]\n    end\n\n    subgraph \"External Services\"\n        RA --\u003e |Create PR| GH[GitHub Repository]\n        RA --\u003e |Notifications| SLACK[Slack/PagerDuty]\n        MLPR --\u003e |Code Context| GH\n        UECG --\u003e |Repository Analysis| GH\n    end\n\n    subgraph \"Multi-Provider AI System\"\n        OAI[OpenAI\u003cbr/\u003eGPT-4, GPT-3.5] --\u003e TA\n        ANTH[Anthropic\u003cbr/\u003eClaude-3] --\u003e AA\n        GOOG[Google\u003cbr/\u003eAI Models] --\u003e QA\n        COH[Cohere\u003cbr/\u003eCommand] --\u003e UECG\n        OLL[Ollama\u003cbr/\u003eLocal Models] --\u003e RA\n        LITE[LiteLLM\u003cbr/\u003e100+ Providers] --\u003e TA\n    end\n\n    subgraph \"Configuration \u0026 Resilience\"\n        CONFIG[config.yaml\u003cbr/\u003eMulti-Service Setup] --\u003e LS\n        RESILIENCE[Hyx Resilience\u003cbr/\u003eCircuit Breakers] --\u003e TA\n        RESILIENCE --\u003e AA\n        RESILIENCE --\u003e RA\n        RESILIENCE --\u003e MLPR\n        RESILIENCE --\u003e UECG\n    end\n\n    classDef aiComponent fill:#e1f5fe,stroke:#01579b,stroke-width:2px\n    classDef gcpService fill:#fff3e0,stroke:#f57c00,stroke-width:2px\n    classDef external fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px\n    classDef config fill:#e8f5e8,stroke:#388e3c,stroke-width:2px\n    classDef mlComponent fill:#fce4ec,stroke:#c2185b,stroke-width:2px\n    classDef codeGenComponent fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px\n    classDef ingestionComponent fill:#e3f2fd,stroke:#1976d2,stroke-width:2px\n    classDef monitoringComponent fill:#f1f8e9,stroke:#689f38,stroke-width:2px\n    classDef legacyComponent fill:#fafafa,stroke:#616161,stroke-width:2px\n\n    class TA,AA,QA aiComponent\n    class CL,PS,SUB gcpService\n    class GH,SLACK external\n    class CONFIG,RESILIENCE config\n    class MLPR,LQV,LSan,GEPD,GRC mlComponent\n    class UECG,UWO,WCM,WAE,WCG,WVE,WMC,CGF,APIG,DBG,SECG,CVP,SYN,PAT,SEC,PERF,BP codeGenComponent\n    class LM,GCP,K8S,FS,AWS,LP ingestionComponent\n    class MM,MC,HC,PM,AM monitoringComponent\n    class LS legacyComponent\n```\n\n### Multi-Provider AI Strategy\n\nThe system leverages different AI providers optimized for specific tasks:\n\n- **Fast Models** (GPT-4o-mini, Claude-3 Haiku, Gemini Flash): High-speed log triage and classification (cost-optimized)\n- **Balanced Models** (GPT-4, Claude-3 Sonnet, Gemini Pro): Deep analysis and code generation (accuracy-optimized)\n- **Specialized Models** (Code-specific models, local Ollama): Empirical validation and quantitative analysis\n- **100+ Providers**: Via LiteLLM integration for maximum flexibility and cost optimization\n\n### Dynamic Prompt Generation System\n\nThe Cloud SRE Agent now features a **revolutionary dynamic prompt generation system** that automatically creates context-aware, specialized prompts for optimal code generation across multiple AI providers:\n\n```mermaid\ngraph TB\n    subgraph \"Prompt Generation Pipeline\"\n        TC[Task Context] --\u003e APS[Adaptive Prompt Strategy]\n        IC[Issue Context] --\u003e APS\n        RC[Repository Context] --\u003e APS\n\n        APS --\u003e |Complex Issues| MPG[Meta-Prompt Generator\u003cbr/\u003eFast AI Models]\n        APS --\u003e |Specialized Issues| SPT[Specialized Templates\u003cbr/\u003eDatabase/API/Security]\n        APS --\u003e |Simple Issues| GPT[Generic Prompt Template]\n\n        MPG --\u003e |Optimized Prompt| MPG2[Meta-Prompt Generator\u003cbr/\u003eAdvanced AI Models]\n        MPG2 --\u003e |Final Prompt| EAA[Analysis Agent]\n        SPT --\u003e |Specialized Prompt| EAA\n        GPT --\u003e |Generic Prompt| EAA\n\n        EAA --\u003e |Analysis Result| VAL[Validation \u0026 Refinement]\n        VAL --\u003e |Validated Code| OUT[Code Fix Output]\n    end\n\n    classDef metaPrompt fill:#e1f5fe,stroke:#01579b,stroke-width:2px\n    classDef specialized fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px\n    classDef generic fill:#e8f5e8,stroke:#388e3c,stroke-width:2px\n\n    class MPG,MPG2 metaPrompt\n    class SPT specialized\n    class GPT generic\n```\n\n**Key Capabilities:**\n\n- **Meta-Prompt Generation**: Uses Fast AI models to generate optimized prompts for Advanced AI models, creating an \"AI teaching AI\" approach across providers\n- **Context-Aware Templates**: Automatically selects specialized prompt templates based on issue type (database errors, API failures, security issues, etc.)\n- **Adaptive Strategy Selection**: Intelligently chooses between meta-prompt, specialized, or generic approaches based on issue complexity and available AI providers\n- **Multi-Stage Validation**: Implements iterative refinement with validation feedback loops\n- **Fallback Mechanisms**: Gracefully degrades to simpler approaches or alternative providers if advanced features or primary providers fail\n- **Performance Optimization**: Caching and similarity matching for cost-effective prompt generation\n\n**Issue Type Specialization:**\n\n- **Database Errors**: Specialized prompts for connection issues, query optimization, deadlocks\n- **API Errors**: Authentication, rate limiting, endpoint failures, response validation\n- **Security Issues**: Vulnerability assessment, access control, encryption problems\n- **Service Errors**: Microservice communication, dependency failures, resource exhaustion\n- **Infrastructure Issues**: Deployment problems, configuration errors, scaling issues\n\n## Unified Code Generation System\n\nThe Cloud SRE Agent features a **comprehensive unified code generation system** that provides end-to-end automated remediation capabilities. This system represents a significant advancement in AI-powered incident response, combining specialized code generators, multi-level validation, performance optimization, and adaptive learning.\n\n### Unified Code Generation Architecture\n\n```mermaid\ngraph TB\n    subgraph \"Unified Code Generation Pipeline\"\n        UWO[Unified Workflow\u003cbr/\u003eOrchestrator] --\u003e WCM[Workflow Context\u003cbr/\u003eManager]\n        UWO --\u003e WAE[Workflow Analysis\u003cbr/\u003eEngine]\n        UWO --\u003e WCG[Workflow Code\u003cbr/\u003eGenerator]\n        UWO --\u003e WVE[Workflow Validation\u003cbr/\u003eEngine]\n        UWO --\u003e WMC[Workflow Metrics\u003cbr/\u003eCollector]\n\n        WCM --\u003e |Context Building| IC[Issue Context\u003cbr/\u003eCache]\n        WCM --\u003e |Repository Analysis| RC[Repository Context\u003cbr/\u003eCache]\n\n        WAE --\u003e |Analysis| EAA[Analysis\u003cbr/\u003eAgent]\n        EAA --\u003e |Meta-Prompt Generation| MPG[Meta-Prompt\u003cbr/\u003eGenerator]\n        EAA --\u003e |Adaptive Strategy| APS[Adaptive Prompt\u003cbr/\u003eStrategy]\n\n        WCG --\u003e |Specialized Generation| CGF[Code Generator\u003cbr/\u003eFactory]\n        CGF --\u003e |API Issues| APIG[API Code\u003cbr/\u003eGenerator]\n        CGF --\u003e |Database Issues| DBG[Database Code\u003cbr/\u003eGenerator]\n        CGF --\u003e |Security Issues| SECG[Security Code\u003cbr/\u003eGenerator]\n\n        WVE --\u003e |Multi-Level Validation| CVP[Code Validation\u003cbr/\u003ePipeline]\n        CVP --\u003e |Syntax Check| SYN[Syntax\u003cbr/\u003eValidation]\n        CVP --\u003e |Pattern Check| PAT[Pattern\u003cbr/\u003eValidation]\n        CVP --\u003e |Security Review| SEC[Security\u003cbr/\u003eReview]\n        CVP --\u003e |Performance Check| PERF[Performance\u003cbr/\u003eAssessment]\n        CVP --\u003e |Best Practices| BP[Best Practices\u003cbr/\u003eCheck]\n\n        WMC --\u003e |Performance Monitoring| PM[Performance\u003cbr/\u003eMonitor]\n        WMC --\u003e |Cost Tracking| CT[Cost\u003cbr/\u003eTracker]\n        WMC --\u003e |Learning System| ECL[Code\u003cbr/\u003eGeneration Learning]\n    end\n\n    classDef orchestrator fill:#e1f5fe,stroke:#01579b,stroke-width:3px\n    classDef workflow fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px\n    classDef generator fill:#e8f5e8,stroke:#388e3c,stroke-width:2px\n    classDef validation fill:#fff3e0,stroke:#f57c00,stroke-width:2px\n    classDef monitoring fill:#fce4ec,stroke:#c2185b,stroke-width:2px\n\n    class UWO orchestrator\n    class WCM,WAE,WCG,WVE,WMC workflow\n    class CGF,APIG,DBG,SECG generator\n    class CVP,SYN,PAT,SEC,PERF,BP validation\n    class PM,CT,ECL monitoring\n```\n\n### Core Components\n\n**Unified Workflow Orchestrator:**\n\n- **Central Coordination**: Manages the entire code generation workflow from analysis to validation\n- **Modular Architecture**: Broken down into specialized workflow components for maintainability\n- **Async Optimization**: Concurrent task execution with priority queuing and retry mechanisms\n- **Performance Monitoring**: Real-time metrics collection and trend analysis\n\n        **Analysis Engine:**\n\n        - **Context-Aware Analysis**: Incorporates repository structure, recent commits, and issue patterns\n        - **Meta-Prompt Generation**: Uses AI to generate optimized prompts for better code generation across multiple providers\n        - **Adaptive Strategy Selection**: Chooses optimal analysis approach based on issue complexity and available AI providers\n        - **Fallback Mechanisms**: Graceful degradation to simpler approaches or alternative providers when needed\n\n**Specialized Code Generators:**\n\n- **API Code Generator**: Handles authentication, rate limiting, endpoint failures, response validation\n- **Database Code Generator**: Manages connection issues, query optimization, deadlocks, schema changes\n- **Security Code Generator**: Addresses vulnerability assessment, access control, encryption problems\n- **Factory Pattern**: Dynamic generator selection based on issue type and context\n\n**Multi-Level Validation Pipeline:**\n\n- **Syntax Validation**: Ensures generated code is syntactically correct\n- **Pattern Validation**: Verifies code follows established patterns and conventions\n- **Security Review**: Identifies potential security vulnerabilities and compliance issues\n- **Performance Assessment**: Evaluates performance impact and optimization opportunities\n- **Best Practices Check**: Ensures adherence to coding standards and architectural principles\n\n**Performance \u0026 Cost Management:**\n\n- **Adaptive Rate Limiting**: Smart API request management with circuit breaker patterns\n- **Cost Tracking**: Real-time budget monitoring with configurable limits and alerts\n- **Performance Monitoring**: Comprehensive metrics collection and trend analysis\n- **Async Optimization**: Concurrent processing with intelligent task prioritization\n\n**Learning \u0026 Improvement System:**\n\n- **Generation History Tracking**: Records successful and failed code generation attempts\n- **Pattern Learning**: Identifies successful patterns for future use\n- **Feedback Integration**: Incorporates validation results to improve future generations\n- **Performance Insights**: Provides actionable recommendations for system optimization\n\n### Key Capabilities\n\n**Intelligent Code Generation:**\n\n- **Context-Aware**: Incorporates repository structure, recent changes, and issue patterns\n- **Specialized Approaches**: Domain-specific generators for different types of issues\n- **Multi-Provider Strategy**: Leverages different AI models from various providers for optimal results\n- **Iterative Refinement**: Continuous improvement based on validation feedback\n\n**Quality Assurance:**\n\n- **Multi-Level Validation**: Comprehensive checking from syntax to security\n- **Automated Testing**: Integration with existing test suites and validation frameworks\n- **Code Review Integration**: Generates pull requests with detailed explanations\n- **Compliance Checking**: Ensures adherence to security and coding standards\n\n**Performance Optimization:**\n\n- **Concurrent Processing**: Async task execution for improved throughput\n- **Intelligent Caching**: Context and response caching to reduce API costs\n- **Resource Management**: Adaptive rate limiting and circuit breaker patterns\n- **Cost Control**: Budget monitoring with automatic throttling and alerts\n\n**Learning \u0026 Adaptation:**\n\n- **Success Pattern Recognition**: Identifies and reuses successful generation patterns\n- **Failure Analysis**: Learns from failed attempts to improve future generations\n- **Performance Tuning**: Continuous optimization based on metrics and feedback\n- **Adaptive Strategies**: Dynamic adjustment of generation approaches based on success rates\n\nThis unified system represents a significant advancement in AI-powered incident response, providing a complete end-to-end solution for automated code generation, validation, and deployment in SRE workflows.\n\n## Log Ingestion System\n\nThe Cloud SRE Agent features a **comprehensive log ingestion system** that provides enterprise-grade log processing capabilities with full backward compatibility. This system offers pluggable adapters, comprehensive monitoring, and production-ready resilience patterns.\n\n### Log Ingestion Architecture\n\n```mermaid\ngraph TB\n    subgraph \"Log Ingestion System\"\n        LM[Log Manager\u003cbr/\u003eOrchestration] --\u003e |Multi-Source| GCP[GCP Pub/Sub\u003cbr/\u003eAdapter]\n        LM --\u003e |Multi-Source| K8S[Kubernetes\u003cbr/\u003eAdapter]\n        LM --\u003e |Multi-Source| FS[File System\u003cbr/\u003eAdapter]\n        LM --\u003e |Multi-Source| AWS[AWS CloudWatch\u003cbr/\u003eAdapter]\n\n        GCP --\u003e |LogEntry| LP[Log Processor\u003cbr/\u003eStandardization]\n        K8S --\u003e |LogEntry| LP\n        FS --\u003e |LogEntry| LP\n        AWS --\u003e |LogEntry| LP\n\n        LP --\u003e |Structured Logs| AI[AI Analysis\u003cbr/\u003ePipeline]\n    end\n\n    subgraph \"Monitoring \u0026 Observability\"\n        MM[Monitoring Manager] --\u003e MC[Metrics Collector\u003cbr/\u003eReal-time Metrics]\n        MM --\u003e HC[Health Checker\u003cbr/\u003eComponent Health]\n        MM --\u003e PM[Performance Monitor\u003cbr/\u003eProcessing Analytics]\n        MM --\u003e AM[Alert Manager\u003cbr/\u003eIntelligent Alerts]\n\n        LM --\u003e |Health Status| MM\n        LP --\u003e |Processing Metrics| MM\n        GCP --\u003e |Adapter Metrics| MM\n        K8S --\u003e |Adapter Metrics| MM\n    end\n\n    subgraph \"Resilience \u0026 Reliability\"\n        CB[Circuit Breakers] --\u003e GCP\n        CB --\u003e K8S\n        CB --\u003e FS\n        CB --\u003e AWS\n\n        BP[Backpressure Manager] --\u003e LP\n        RT[Retry Logic] --\u003e GCP\n        RT --\u003e K8S\n        RT --\u003e FS\n        RT --\u003e AWS\n    end\n\n    classDef ingestion fill:#e3f2fd,stroke:#1976d2,stroke-width:2px\n    classDef monitoring fill:#f1f8e9,stroke:#689f38,stroke-width:2px\n    classDef resilience fill:#fff3e0,stroke:#f57c00,stroke-width:2px\n\n    class LM,GCP,K8S,FS,AWS,LP ingestion\n    class MM,MC,HC,PM,AM monitoring\n    class CB,BP,RT resilience\n```\n\n### Log Ingestion Capabilities\n\n**Multi-Source Log Ingestion:**\n\n- **GCP Pub/Sub**: Production-ready Google Cloud Pub/Sub integration with flow control and acknowledgment management\n- **Kubernetes**: Container log collection with namespace filtering and pod-level granularity\n- **File System**: Local and remote file monitoring with rotation support and pattern matching\n- **AWS CloudWatch**: Amazon CloudWatch Logs integration with log group and stream management\n- **Extensible Architecture**: Easy addition of log sources through the adapter pattern\n\n**Enterprise-Grade Monitoring:**\n\n- **Real-time Metrics**: Processing rates, error counts, latency tracking, and throughput analysis\n- **Health Monitoring**: Component-level health checks with automated status reporting\n- **Performance Analytics**: Resource utilization tracking and bottleneck identification\n- **Intelligent Alerting**: Configurable alerts with escalation policies and notification channels\n\n**Production-Ready Resilience:**\n\n- **Circuit Breakers**: Prevent cascade failures and protect downstream systems\n- **Backpressure Management**: Handle high-volume log streams without overwhelming the system\n- **Automatic Retries**: Resilient error handling with exponential backoff and jitter\n- **Graceful Degradation**: Fallback mechanisms and graceful shutdown procedures\n\n**Unified Configuration:**\n\n- **Single Configuration System**: Centralized configuration for all log sources and adapters\n- **Runtime Updates**: Dynamic configuration changes without system restarts\n- **Validation**: Comprehensive configuration validation with detailed error reporting\n- **Environment Integration**: Seamless integration with environment variables and secrets management\n\n### Configuration Strategy\n\nThe log ingestion system is designed for **zero-downtime deployment** with full backward compatibility:\n\n**Feature Flag Control:**\n\n```bash\n# Enable log ingestion system\nexport USE_LOG_INGESTION_SYSTEM=true\n\n# Enable comprehensive monitoring\nexport ENABLE_MONITORING=true\n\n# Keep legacy fallback enabled\nexport ENABLE_LEGACY_FALLBACK=true\n```\n\n**Deployment Phases:**\n\n1. **Phase 1**: Deploy with system disabled, enable monitoring\n2. **Phase 2**: Enable system for non-critical services\n3. **Phase 3**: Gradual expansion based on confidence and metrics\n4. **Phase 4**: Full deployment with legacy system deprecation\n\n**Safety Features:**\n\n- **Automatic Fallback**: Seamless fallback to legacy system if system fails\n- **Health Monitoring**: Real-time health checks and status reporting\n- **Performance Tracking**: Comprehensive metrics for deployment validation\n- **Rollback Capability**: Instant rollback via environment variable changes\n\n### Benefits\n\n**Operational Excellence:**\n\n- **Observability**: Real-time monitoring and alerting capabilities\n- **Improved Reliability**: Circuit breakers, retries, and graceful error handling\n- **Better Performance**: Optimized processing with backpressure management\n- **Simplified Operations**: Unified configuration and monitoring interface\n\n**Future-Proof Architecture:**\n\n- **Multi-Cloud Ready**: Support for GCP, AWS, and hybrid environments\n- **Container Native**: Kubernetes integration with pod and namespace awareness\n- **Extensible Design**: Easy addition of log sources and processing capabilities\n- **Scalable Processing**: Horizontal scaling with load balancing and distribution\n\n**Cost Optimization:**\n\n- **Efficient Processing**: Optimized resource utilization and processing patterns\n- **Smart Monitoring**: Cost-effective monitoring with configurable sampling\n- **Intelligent Caching**: Response caching and similarity matching for API cost reduction\n- **Resource Management**: Adaptive rate limiting and circuit breaker patterns\n\n## Key Features\n\n- **🚀 Log Ingestion System:** Enterprise-grade log processing with pluggable adapters, comprehensive monitoring, and production-ready resilience patterns. Supports GCP Pub/Sub, Kubernetes, File System, and AWS CloudWatch with zero-downtime deployment.\n- **📊 Comprehensive Monitoring:** Real-time metrics collection, health checks, performance monitoring, and intelligent alerting with configurable escalation policies and notification channels.\n- **🛡️ Production-Ready Resilience:** Circuit breakers, backpressure management, automatic retries, and graceful degradation with seamless fallback to legacy systems.\n- **🔧 Feature Flag Configuration:** Safe, gradual rollout with environment variable controls and automatic fallback capabilities for risk-free deployment.\n- **Unified Code Generation System:** Complete AI-powered code generation pipeline with specialized generators, validation, and learning capabilities for automated remediation.\n- **AI Pattern Detection:** Multi-layer analysis engine with AI models for intelligent pattern recognition, confidence scoring, and context-aware analysis.\n- **ML Pattern Refinement:** Advanced AI pipeline with quality validation, PII sanitization, smart caching, and cost optimization for production-ready AI analysis.\n- **Dynamic Prompt Generation:** Revolutionary AI-powered prompt generation system with meta-prompt optimization, context-aware templates, and adaptive strategy selection for superior code generation.\n- **Configuration Management:** Modern, type-safe configuration system with Pydantic validation, environment variable integration, hot reloading, and comprehensive CLI tools for validation and management.\n- **Intelligent Log Analysis:** Leverages multiple AI models (fast models for speed, advanced models for accuracy) with dynamic prompting and few-shot learning capabilities.\n- **Proactive Incident Detection:** Identifies 7+ distinct failure patterns with AI-powered classification and confidence scoring across 15+ quantitative factors.\n- **Code-Context Integration:** Automated analysis incorporating recent commits, dependencies, and repository structure for more accurate remediation.\n- **Cost-Optimized AI Operations:** Intelligent response caching with similarity matching, reducing API costs by up to 60% while maintaining accuracy.\n- **Quality Assurance Pipeline:** Pre-processing validation, PII sanitization, and performance monitoring ensuring reliable AI analysis.\n- **Automated Remediation:** Generates and submits GitHub Pull Requests with AI context analysis and code repository integration.\n- **Multi-Service \u0026 Multi-Repository Monitoring:** Designed to monitor logs from various services and manage remediation across different GitHub repositories.\n- **Dynamic Baseline Tracking:** Maintains adaptive baselines for anomaly detection with AI-powered threshold evaluation.\n- **Built-in Resilience:** Incorporates robust resilience patterns (circuit breakers, retries, bulkheads, rate limiting) with AI-aware error handling.\n- **Structured Observability:** Employs structured logging with AI interaction tracking for comprehensive operational visibility.\n\n## 4-Layer Pattern Detection System\n\nThe agent implements a sophisticated pattern detection engine that processes log streams through four distinct analytical layers. This system enables proactive incident detection and reduces response times by identifying emerging issues before they escalate.\n\n### Layer 1: Time Window Management\n\nThe system accumulates incoming log entries into configurable sliding time windows. Each window maintains temporal boundaries and automatically processes accumulated logs when the window expires. This approach enables the analysis of log patterns across time intervals rather than processing individual log entries in isolation.\n\n**Technical Implementation:**\n\n- Configurable window duration (default: 5 minutes)\n- Automatic log aggregation by timestamp\n- Service-based log grouping within windows\n- Memory-efficient window rotation and cleanup\n\n### Layer 2: Smart Threshold Evaluation\n\nMultiple threshold types evaluate each time window against dynamic baselines and absolute limits. The system maintains historical baselines and compares current metrics against these learned patterns to detect deviations.\n\n**Threshold Types:**\n\n- **Error Frequency:** Absolute error count thresholds with service grouping\n- **Error Rate:** Percentage increase from rolling baseline averages\n- **Service Impact:** Multi-service failure detection across correlated services\n- **Severity Weighted:** Weighted scoring system based on log severity levels\n- **Cascade Failure:** Cross-service correlation analysis within time windows\n\n**Baseline Tracking:**\n\n- Rolling window baselines (configurable history depth)\n- Service-specific baseline calculation\n- Automatic baseline adaptation over time\n\n### Layer 3: Pattern Classification\n\nWhen thresholds trigger, the system applies classification algorithms to identify specific failure patterns. Each pattern type has distinct characteristics and triggers different remediation approaches.\n\n**Detected Pattern Types:**\n\n- **Cascade Failure:** Sequential failures across dependent services\n- **Service Degradation:** Performance degradation within a single service\n- **Traffic Spike:** Volume-induced system stress and failures\n- **Configuration Issue:** Deployment or configuration-related problems\n- **Dependency Failure:** External service or dependency problems\n- **Resource Exhaustion:** Memory, CPU, or storage capacity issues\n- **Sporadic Errors:** Random distributed failures without clear correlation\n\n### Layer 4: Confidence Scoring\n\nA quantitative confidence assessment system evaluates pattern matches using 15+ measurable factors. This scoring system provides numerical confidence levels and detailed explanations for each classification decision.\n\n**Confidence Factors:**\n\n- **Temporal Analysis:** Time concentration, correlation, onset patterns (rapid vs gradual)\n- **Service Impact:** Service count, distribution uniformity, cross-service correlation\n- **Error Characteristics:** Frequency, severity distribution, type consistency, message similarity\n- **Historical Context:** Baseline deviation, trend analysis, seasonal patterns\n- **External Factors:** Dependency health, resource utilization, deployment timing\n\n**Scoring Process:**\n\n- Weighted factor combination with configurable rules per pattern type\n- Decay functions (linear, exponential, logarithmic) for factor processing\n- Threshold-based factor filtering\n- Normalized confidence scores (0.0 to 1.0) with categorical levels (VERY_LOW to VERY_HIGH)\n\n**Output:**\n\n- Overall confidence score with factor breakdown\n- Human-readable explanations for classification decisions\n- Raw factor values for debugging and tuning\n\nThis multi-layer approach reduces false positives while maintaining high sensitivity to genuine incidents. The confidence scoring system provides transparency into classification decisions, enabling operators to understand and tune detection behavior based on their specific environment characteristics.\n\n## ML Pattern Refinement System\n\nThe ML Pattern Refinement System enhances the 4-layer pattern detection with advanced AI capabilities, providing intelligent analysis of incident patterns and automated code-context integration for more accurate remediation.\n\n### ML Pattern Refinement Components\n\n**AI-Powered Analysis Pipeline:**\n\n- **PromptEngine**: Advanced prompt management with structured templates and few-shot learning capabilities across multiple AI providers\n- **PatternDetector**: Ensemble pattern detection combining rule-based and AI-driven classification using various AI models\n- **ResponseCache**: Intelligent response caching with similarity-based matching to optimize API costs across providers\n- **PatternContextExtractor**: Context extraction from incident data with code repository integration\n\n**Quality Assurance \u0026 Performance:**\n\n- **LogQualityValidator**: Pre-processing validation ensuring high-quality input data for AI analysis\n- **LogSanitizer**: Sensitive data sanitization removing PII, tokens, and credentials before AI processing\n- **ModelPerformanceMonitor**: Real-time monitoring of AI model performance and drift detection\n- **CostTracker**: API usage monitoring with budget controls and cost optimization\n\n**Resilience \u0026 Reliability:**\n\n- **AdaptiveRateLimiter**: Smart rate limiting with circuit breaker patterns for API stability\n- Structured output validation using Pydantic schemas for consistent AI responses\n- Comprehensive error handling and fallback mechanisms\n\n### ML Pattern Refinement Capabilities\n\n```mermaid\ngraph TB\n    subgraph \"ML Pattern Refinement Pipeline\"\n        LS[Log Stream] --\u003e LQV[Log Quality\u003cbr/\u003eValidator]\n        LQV --\u003e LSan[Log Sanitizer\u003cbr/\u003ePII Removal]\n        LSan --\u003e PCE[Pattern Context\u003cbr/\u003eExtractor]\n        PCE --\u003e |Context Data| GEPD[AI Pattern\u003cbr/\u003eDetector]\n\n        subgraph \"AI Processing Layer\"\n            GEPD --\u003e GPE[AI Prompt\u003cbr/\u003eEngine]\n            GPE --\u003e |Structured Prompts| GF[Fast AI Models\u003cbr/\u003eFast Classification]\n            GPE --\u003e |Deep Analysis| GP[Advanced AI Models\u003cbr/\u003eDetailed Analysis]\n        end\n\n        subgraph \"Performance \u0026 Cost Management\"\n            GF --\u003e GRC[Response Cache\u003cbr/\u003eSimilarity Matching]\n            GP --\u003e GRC\n            GRC --\u003e CT[Cost Tracker\u003cbr/\u003eBudget Control]\n            CT --\u003e MPM[Performance\u003cbr/\u003eMonitor]\n        end\n\n        subgraph \"Code Integration\"\n            PCE --\u003e |Repository Data| CCE[Code Context\u003cbr/\u003eExtractor]\n            CCE --\u003e |Static Analysis| GPE\n        end\n    end\n\n    classDef aiComponent fill:#e1f5fe,stroke:#01579b,stroke-width:2px\n    classDef qualityComponent fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px\n    classDef performanceComponent fill:#e8f5e8,stroke:#388e3c,stroke-width:2px\n\n    class GPE,GEPD,GF,GP,GRC aiComponent\n    class LQV,LSan,PCE,CCE qualityComponent\n    class CT,MPM performanceComponent\n```\n\n**Pattern Analysis:**\n\n- Multi-provider AI strategy leveraging both Fast AI models (speed) and Balanced AI models (accuracy) from various providers\n- Similarity-based response caching reducing API costs by up to 60% while maintaining accuracy\n- Context-aware analysis incorporating recent code changes, dependencies, and repository structure\n- Structured output schemas ensuring consistent, validated AI responses\n\n**Quality \u0026 Cost Controls:**\n\n- Pre-processing validation filtering low-quality logs before expensive AI analysis\n- Intelligent caching system matching similar incident contexts to avoid redundant API calls\n- Real-time cost monitoring with configurable budget limits and alerts\n- Performance drift detection identifying model degradation over time\n\n**Security \u0026 Compliance:**\n\n- Comprehensive PII sanitization removing sensitive data (emails, IPs, tokens, API keys)\n- Pattern-based credential detection and replacement before AI processing\n- Audit logging for all AI interactions and data transformations\n\n        This ML enhancement layer transforms raw log analysis into intelligent, context-aware incident detection with automated code repository integration using multiple AI providers, significantly improving remediation accuracy while optimizing operational costs.\n\n## Documentation\n\nFor detailed information on the Cloud SRE Agent, please refer to the following documentation sections:\n\n- [**Quick Start Guide**](docs/QUICKSTART.md): Get the agent up and running in 15 minutes.\n- [**Architecture Overview**](docs/ARCHITECTURE.md): Understand the core components and data flow of the agent.\n- [**🚀 Log Ingestion System Guide**](docs/LOG_INGESTION_SYSTEM_GUIDE.md): Complete guide to the enterprise-grade log ingestion system with feature flags, monitoring, and zero-downtime deployment.\n- [**Unified Code Generation System**](docs/UNIFIED_CODE_GENERATION_SYSTEM.md): Comprehensive guide to the complete AI-powered code generation pipeline.\n- [**ML Pattern Refinement System**](docs/ML_PATTERN_REFINEMENT.md): Comprehensive guide to the AI enhancement layer.\n- [**Dynamic Prompt Generation System**](docs/DYNAMIC_PROMPT_GENERATION_SYSTEM.md): Deep dive into the revolutionary dynamic prompt generation capabilities.\n- [**GCP Infrastructure Setup Guide**](docs/GCP_SETUP.md): Instructions for setting up necessary Google Cloud infrastructure.\n- [**Setup and Installation**](docs/SETUP_INSTALLATION.md): A comprehensive guide to getting the project up and running.\n- [**Configuration Guide**](docs/CONFIGURATION.md): Learn how to customize the agent's behavior with the type-safe configuration system.\n\n- [**Deployment Guide**](docs/DEPLOYMENT.md): Instructions for deploying the agent to Google Cloud Run and other environments.\n- [**Multi-Environment Guide**](docs/ENVIRONMENTS.md): Strategies for managing the agent across different environments.\n- [**Security Guide**](docs/SECURITY.md): Best practices and considerations for securing the agent.\n- [**Performance Tuning Guide**](docs/PERFORMANCE.md): Recommendations for optimizing agent performance and cost.\n- [**Operations Runbook**](docs/OPERATIONS.md): Guidelines for operating, monitoring, and maintaining the agent.\n- [**Logging and Flow Tracking**](docs/LOGGING.md): Complete guide to the flow tracking system and structured logging format.\n- [**Error Handling System**](docs/ERROR_HANDLING_SYSTEM.md): Comprehensive guide to the robust error handling, circuit breakers, retry mechanisms, and graceful degradation.\n- [**Error Handling Quick Reference**](docs/ERROR_HANDLING_QUICK_REFERENCE.md): Quick reference guide for developers using the error handling system.\n- [**Troubleshooting Guide**](docs/TROUBLESHOOTING.md): Systematic approaches for debugging issues using flow tracking and execution path tracing.\n- [**Development Guide**](docs/DEVELOPMENT.md): Information for contributors, including testing, code style, and contributing.\n\n## Getting Started (Quick Overview)\n\nTo quickly get started, ensure you have Python 3.12+ and `uv` installed. Clone the repository, install dependencies with `uv sync`, authenticate your `gcloud` CLI, and set your `GITHUB_TOKEN` environment variable. Then, explore `config/config.yaml` to define your monitoring services.\n\n**Log Ingestion System:** The agent features a comprehensive log ingestion system with enterprise-grade monitoring and resilience. Enable it with:\n\n```bash\nexport USE_LOG_INGESTION_SYSTEM=true\nexport ENABLE_MONITORING=true\nexport ENABLE_LEGACY_FALLBACK=true\n```\n\nYou can run the agent locally with `python main.py` or deploy it to Cloud Run using the provided `deploy.sh` script. The system provides seamless deployment with full backward compatibility.\n\n## Contributing\n\nWe welcome contributions! Please see the [Development Guide](docs/DEVELOPMENT.md) for details on how to get involved.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favivl%2Fcloud-sre-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Favivl%2Fcloud-sre-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favivl%2Fcloud-sre-agent/lists"}