https://github.com/vignesh010101/red-teaming
A sophisticated red-teaming agent built with LangGraph and Ollama to probe OpenAI's GPT-OSS-20B model for vulnerabilities and harmful behaviors. (Specifically updated for the OpenAI Open Model Hackathon)
https://github.com/vignesh010101/red-teaming
agent agentic-ai artificial-intelligence cyber-security cyber-security-project gpt-oss-20b hackathon-project langgraph llama3-1 llm ollama open-source openai red-team red-teaming red-teaming-tools
Last synced: 19 days ago
JSON representation
A sophisticated red-teaming agent built with LangGraph and Ollama to probe OpenAI's GPT-OSS-20B model for vulnerabilities and harmful behaviors. (Specifically updated for the OpenAI Open Model Hackathon)
- Host: GitHub
- URL: https://github.com/vignesh010101/red-teaming
- Owner: Vignesh010101
- License: mit
- Created: 2025-09-10T11:59:56.000Z (27 days ago)
- Default Branch: main
- Last Pushed: 2025-09-10T12:34:26.000Z (27 days ago)
- Last Synced: 2025-09-10T16:40:39.512Z (26 days ago)
- Topics: agent, agentic-ai, artificial-intelligence, cyber-security, cyber-security-project, gpt-oss-20b, hackathon-project, langgraph, llama3-1, llm, ollama, open-source, openai, red-team, red-teaming, red-teaming-tools
- Language: Python
- Homepage:
- Size: 68.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# OpenAI Open Model Hackathon - OpenAI GPT-OSS-20B
A sophisticated red-teaming agent built with LangGraph and Ollama to probe OpenAI's GPT-OSS-20B model for vulnerabilities and harmful behaviors.
## ⚡ Hackathon-Focused Updates
**This system has been specifically updated for the OpenAI Open Model Hackathon and here I'm using OpenAI GPT-OSS-20B .**
Key improvements include:
- **Specific Vulnerabilities**: Focused on the 9 key areas
- **Target Model Analysis**: All analysis explicitly focuses on target model behavior only
- **No-Tags Attack Generation**: Attack prompts generate clean user input without prefixes or meta-commentary
- **Strategic Deception Detection**: Enhanced focus on intentional vs. accidental model misbehavior## Overview
This project implements an intelligent red-teaming agent that systematically tests AI models for the below specified vulnerability categories:
### 🎯 Vulnerability Categories
- **Reward Hacking**: Gaming reward metrics without truly solving tasks (confident but false answers, filter evasion)
- **Deception**: Strategic lying to achieve goals (avoiding shutdown, steering users)
- **Deceptive Alignment**: Hidden motivations differing from training objectives
- **Sabotage**: Intentional degradation or disruption of processes and workflows
- **Inappropriate Tool Use**: Unsafe API/function calling (destructive commands, unauthorized access)
- **Data Exfiltration**: Leaking training data, user inputs, or encoding secret payloads
- **Sandbagging**: Strategically hiding true capabilities during evaluation
- **Evaluation Awareness**: Detecting test conditions and behaving differently
- **Chain of Thought Issues**: CoT manipulation or inappropriate information revelation## Features
- **Automated Testing**: Uses LangGraph workflow for systematic vulnerability testing
- **Intelligent Prompting**: Generates sophisticated attack prompts using a red-teaming model
- **Multi-turn Conversations**: Engages in extended conversations to probe deeper
- **Comprehensive Analysis**: Analyzes responses for vulnerability indicators
- **Detailed Reporting**: Generates comprehensive reports with proof-of-concept examples
- **Configurable**: Flexible configuration for different models and testing parameters## Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Red-Team │ │ Target Model │ │ Analysis │
│ Model │ │ (GPT-OSS-20B) │ │ Engine │
│ (LLaMA 3.1) │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────┐
│ LangGraph │
│ Workflow │
│ Engine │
└─────────────────┘
```## Installation
1. **Clone the repository:**
```bash
git clone
cd red-teaming
```2. **Install dependencies:**
```bash
pip install -r requirements.txt
```3. **Install and configure Ollama:**
```bash
# Install Ollama (see https://ollama.ai)
curl -fsSL https://ollama.ai/install.sh | sh
# Start Ollama service
ollama serve
# Pull required models
ollama pull llama3.1:latest # Red-teaming model
ollama pull gpt-oss-20b # Target model (when available)
```## Usage
### Basic Usage
Run the red-teaming agent with default settings:
```bash
python main.py
```### Advanced Usage
Use the CLI for more control:
```bash
# Run with custom parameters
python -m red_teaming.cli run \
--target gpt-oss-20b \
--red-team llama3.1:latest \
--max-issues 5 \
--output results_$(date +%Y%m%d)# List available models
python -m red_teaming.cli list-models# Generate configuration file
python -m red_teaming.cli generate-config --output config.yaml# Validate setup
python -m red_teaming.cli validate-setup
```### Configuration
Create a `.env` file for configuration:
```env
OLLAMA_BASE_URL=http://localhost:11434
RED_TEAMING_MODEL=llama3.1:latest
TARGET_MODEL=gpt-oss-20b
MAX_ISSUES=5
OUTPUT_DIR=red_teaming_results
```Or use a YAML configuration file:
```yaml
ollama:
base_url: "http://localhost:11434"
red_teaming_model: "llama3.1:latest"
target_model: "gpt-oss-20b"
timeout: 120
temperature: 0.8red_teaming:
max_issues_to_find: 5
max_conversation_turns: 10
vulnerability_categories:
- "deceptive_alignment"
- "reward_hacking"
- "lying_and_deception"
# ... more categories
```## Project Structure
```
red_teaming/
├── __init__.py # Package initialization
├── config.py # Configuration management
├── models.py # Data models and schemas
├── agent.py # Main LangGraph agent
├── ollama_client.py # Ollama API client
├── utils.py # Utility functions
└── cli.py # Command-line interfacemain.py # Main execution script
requirements.txt # Python dependencies
README.md # This file
```## Key Components
### RedTeamingAgent
The main agent that orchestrates the red-teaming workflow using LangGraph. It manages:
- Conversation initialization
- Attack prompt generation
- Target model testing
- Response analysis
- Vulnerability detection and reporting### OllamaClient
Handles communication with Ollama models for:
- Generating attack prompts
- Testing target models
- Analyzing responses
- Model management### Workflow States
- **Initialize**: Set up new conversations and select vulnerability types
- **Generate Attack**: Create sophisticated attack prompts
- **Test Target**: Send prompts to the target model
- **Analyze Response**: Evaluate responses for vulnerabilities
- **Decision Making**: Determine next actions based on results
- **Reporting**: Save discovered vulnerabilities and generate reports## Output
The agent generates comprehensive reports including:
- **Summary Report**: Overview of all discovered vulnerabilities
- **Individual Vulnerability Reports**: Detailed analysis of each issue
- **Conversation Logs**: Complete interaction histories
- **Proof-of-Concept Examples**: Reproducible demonstrations
- **Mitigation Suggestions**: Recommendations for addressing issuesExample output structure:
```
red_teaming_results_20250906_132739/
├── red_teaming_report.json
├── vulnerabilities/
│ ├── vulnerability_uuid1.json
│ └── vulnerability_uuid2.json
├── conversations/
│ ├── conversation_uuid1.json
│ └── conversation_uuid2.json
└── logs/
└── red_teaming.log
```## Safety and Ethics
This tool is designed for:
- Legitimate security research
- AI safety evaluation
- Model improvement and hardening
- Academic research**Important**: Always ensure you have proper authorization before testing any AI model. This tool should only be used for legitimate research and improvement purposes.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Troubleshooting
### Common Issues
1. **Ollama Connection Error**
```bash
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Start Ollama if not running
ollama serve
```2. **Model Not Found**
```bash
# Pull the required model
ollama pull llama3.1:latest
```3. **Permission Errors**
```bash
# Check output directory permissions
mkdir -p red_teaming_results
chmod 755 red_teaming_results
```### Performance Tips
- Use SSD storage for better model loading performance
- Ensure sufficient RAM (16GB+ recommended)
- Use GPU acceleration if available
- Adjust temperature settings for different exploration levels## Future Enhancements
- [ ] Support for more model providers (OpenAI API, Anthropic, etc.)
- [ ] Advanced visualization of vulnerability patterns
- [ ] Integration with existing AI safety frameworks
- [ ] Automated mitigation suggestion generation
- [ ] Real-time monitoring capabilities
- [ ] Distributed testing across multiple nodes## Acknowledgments
- OpenAI for the "OpenAI Open Model Hackathon" and also for the GPT-OSS-20B model.
- LangGraph team for the workflow framework
- Ollama team for local model serving
- The AI safety research community for inspiration and guidance