https://github.com/cafferychen777/probswap
https://github.com/cafferychen777/probswap
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/cafferychen777/probswap
- Owner: cafferychen777
- Created: 2025-01-16T16:44:10.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-01-24T20:46:53.000Z (4 months ago)
- Last Synced: 2025-01-24T21:26:17.185Z (4 months ago)
- Language: Python
- Size: 19.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ProbSwap: A Probability-Based Attack on LLM Watermarks
ProbSwap is an attack method targeting LLM watermarks by identifying and replacing low-probability tokens that are likely introduced by watermarking mechanisms.
## Core Idea
LLM watermarking typically works by modifying the sampling probability distribution during text generation to embed statistical patterns. ProbSwap attacks these watermarks by:
1. Identifying tokens that have relatively low generation probability
2. Replacing these tokens with semantically similar alternatives that have higher probability or appear more natural to another LLM## Installation
First, clone the repository with submodules:
```bash
git clone --recursive https://github.com/cafferychen777/ProbSwap.git
```Or if you've already cloned the repository:
```bash
git submodule update --init --recursive
```Then install the requirements:
```bash
pip install -r requirements.txt
```## Project Structure
- `probswap/`
- `attack.py`: Core implementation of the ProbSwap attack
- `models.py`: Interface with different LLM models
- `claude_wrapper.py`: Claude API integration for substitute generation
- `markllm_integration.py`: Integration with MarkLLM watermarking toolkit
- `utils.py`: Utility functions- `experiments/`
- `run_watermark_experiments.py`: Run watermarking attack experiments
- `run_experiments.py`: Run general attack experiments- `evaluation/`
- `evaluate_watermark.py`: Evaluate attack effectiveness on watermarks
- `evaluate.py`: General evaluation metrics## Usage
### Using Local Models
```python
from probswap.attack import ProbSwapAttack
from probswap.models import ModelWrapper# Initialize models
target_model = ModelWrapper("your-watermarked-model-name")
substitute_model = ModelWrapper("your-substitute-model-name")# Initialize attack
attack = ProbSwapAttack(
target_model=target_model.model,
target_tokenizer=target_model.tokenizer,
substitute_model=substitute_model, # Local model wrapper
prob_threshold=0.1,
top_k_substitutes=5
)# Apply attack
modified_text, modifications = await attack.attack(watermarked_text)
```### Using Claude API
```python
from probswap.attack import ProbSwapAttack
from probswap.claude_wrapper import ClaudeWrapper# Initialize models
target_model = ModelWrapper("your-watermarked-model-name")
substitute_model = ClaudeWrapper(target_tokenizer=target_model.tokenizer)# Initialize attack
attack = ProbSwapAttack(
target_model=target_model.model,
target_tokenizer=target_model.tokenizer,
substitute_model=substitute_model, # Claude API wrapper
prob_threshold=0.1,
top_k_substitutes=5
)# Apply attack
modified_text, modifications = await attack.attack(watermarked_text)
```## Environment Variables
When using the Claude API, you need to set your API key in a `.env` file:
```bash
ANTHROPIC_API_KEY=your-api-key
```## Evaluation
Run watermarking experiments:
```bash
python experiments/run_watermark_experiments.py
```## Dependencies
- [MarkLLM](https://github.com/THU-BPM/MarkLLM): Used for watermarking text generation and detection
- [Anthropic Claude API](https://anthropic.com/): Used for generating semantically similar substitutes
- PyTorch and Transformers: Used for model loading and inference## License
This project is licensed under the MIT License - see the LICENSE file for details.