https://github.com/haaziq386/keystroke_ai
A biometric authentication system using keystroke dynamics with modified Manhattan distance metric. Implements digraph timing feature extraction and achieves 7.75% EER on free-text keystroke data. Course project for CSN-371 Artificial Intelligence.
https://github.com/haaziq386/keystroke_ai
artificial-intelligence authentication biometric-authentication data-processing eer equal-error-rate feature-extraction keystroke-dynamics machine-learning python
Last synced: 2 months ago
JSON representation
A biometric authentication system using keystroke dynamics with modified Manhattan distance metric. Implements digraph timing feature extraction and achieves 7.75% EER on free-text keystroke data. Course project for CSN-371 Artificial Intelligence.
- Host: GitHub
- URL: https://github.com/haaziq386/keystroke_ai
- Owner: Haaziq386
- Created: 2025-12-22T10:35:55.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-22T10:46:46.000Z (6 months ago)
- Last Synced: 2025-12-23T21:52:08.201Z (6 months ago)
- Topics: artificial-intelligence, authentication, biometric-authentication, data-processing, eer, equal-error-rate, feature-extraction, keystroke-dynamics, machine-learning, python
- Language: Python
- Homepage:
- Size: 6.12 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Keystroke Dynamics Authentication System
**Course Project:** CSN-371 Artificial Intelligence
**Instructor:** Prof. Pradumn K. Pandey
A biometric authentication system that identifies users based on their unique typing patterns using keystroke dynamics. This implementation uses digraph timing features and evaluates performance using the Equal Error Rate (EER) metric, based on research by Iapa & Cretu (2021).
## ๐ Overview
This project implements a keystroke dynamics authentication system that:
- Extracts timing features from keystroke data (digraphs)
- Uses modified Manhattan distance metric for improved accuracy
- Evaluates authentication performance using leave-one-out methodology
- Compares standard and modified distance metrics
- Analyzes the effect of feature selection on authentication accuracy
## ๐ฏ Key Features
- **Digraph Feature Extraction**: Analyzes timing patterns between consecutive keystrokes
- DU1: First key down to first key up (dwell time)
- DU2: Second key down to second key up (dwell time)
- DUtotal: First key down to second key up (total time)
- **Modified Manhattan Distance**: Implements weighted distance metric with reduced weight for DUtotal features (default: 1/3)
- **Multiple Normalization Techniques**:
- Decimal scaling for standard Manhattan distance
- Min-max scaling for modified Manhattan distance
- **Comprehensive Evaluation**:
- Leave-one-out cross-validation
- FAR (False Accept Rate) and FRR (False Reject Rate) calculations
- EER (Equal Error Rate) computation
- Visualization of error rates vs. thresholds
## ๐ Project Structure
```
keystroke_AI/
โโโ data/
โ โโโ raw/ # Raw keystroke data files (user0001.txt - user0080.txt)
โ โโโ processed/ # Processed feature vectors
โ โโโ all_features.csv
โ โโโ decimal_vectors.csv
โ โโโ minmax_vectors.csv
โโโ src/
โ โโโ data_processing.py # Data loading and digraph extraction
โ โโโ feature_extraction.py # Feature vector creation
โ โโโ metrics.py # Distance metric implementations
โ โโโ authentication.py # Authentication logic and evaluation
โโโ main.py # Main execution script
โโโ README.md # This file
```
## ๐ Getting Started
### Prerequisites
```bash
pip install numpy pandas matplotlib
```
### Installation
1. Clone the repository:
```bash
git clone https://github.com/Haaziq386/keystroke_AI.git
cd keystroke_AI
```
2. Ensure your data is in the correct format in `data/raw/`:
- Files named `user####.txt` (e.g., `user0001.txt`)
- Format: `key_code event_type timestamp`
- `key_code`: ASCII code of the key
- `event_type`: 0 for press, 1 for release
- `timestamp`: Milliseconds since epoch
### Usage
Run the main script to process data and evaluate authentication:
```bash
python main.py
```
This will:
1. Process raw keystroke data files
2. Extract digraph features
3. Create and normalize feature vectors
4. Compare standard vs. modified Manhattan distance metrics
5. Evaluate the effect of different numbers of digraphs
6. Generate visualization plots
## ๐ Data Format
### Raw Data Format
Each user file contains keystroke events in the format:
```
key_code event_type timestamp
16 0 434889 # Key 16 pressed at time 434889
86 0 435006 # Key 86 pressed at time 435006
86 1 435146 # Key 86 released at time 435146
16 1 435221 # Key 16 released at time 435221
```
### Feature Vector Format
Each segment is represented by timing features for the most common digraphs:
```
segment_id, user_id, [digraph]_DU1, [digraph]_DU2, [digraph]_DUtotal, ...
```
## ๐ฌ Methodology
### 1. Data Processing
- Load raw keystroke data
- Match press and release events
- Segment into chunks of ~1000 keystrokes
### 2. Feature Extraction
- Identify the most common digraphs (default: 12)
- Calculate DU1, DU2, and DUtotal for each digraph
- Create feature vectors using median values per segment
### 3. Normalization
- **Decimal Scaling**: For standard Manhattan distance
- **Min-Max Scaling**: For modified Manhattan distance
### 4. Authentication
- Leave-one-out cross-validation
- Distance calculation between feature vectors
- Threshold-based classification
- FAR/FRR/EER computation
## ๐ Performance Metrics
- **FAR (False Accept Rate)**: Percentage of impostor attempts incorrectly accepted
- **FRR (False Reject Rate)**: Percentage of genuine attempts incorrectly rejected
- **EER (Equal Error Rate)**: Point where FAR equals FRR (lower is better)
## ๐จ Visualizations
The system generates several plots:
- `Standard_Manhattan_Distance.png`: FAR/FRR curves for standard metric
- `Modified_Manhattan_Distance.png`: FAR/FRR curves for modified metric
- `Manhattan_Distance_Comparison.png`: Side-by-side comparison
- `Digraph_Count_Effect.png`: EER vs. number of digraphs
## ๐ง Configuration
### Adjustable Parameters
In `main.py`:
- `segment_size`: Number of keystrokes per segment (default: 1000)
- `num_digraphs`: Number of most common digraphs to use (default: 12)
In `authentication.py`:
- `du_total_weight`: Weight for DUtotal in modified distance (default: 1/3)
## ๐ Module Reference
### `KeystrokeProcessor`
Handles raw data loading and preprocessing:
- `read_raw_file(user_id)`: Load data for a specific user
- `process_all_users()`: Process all users and extract digraphs
- `extract_digraph_features(events)`: Calculate timing features
### `KeystrokeFeatureExtractor`
Creates feature vectors:
- `identify_common_digraphs(all_features)`: Find most frequent digraphs
- `create_feature_vectors(all_features)`: Build feature vectors
### `KeystrokeMetrics`
Implements distance metrics:
- `manhattan_distance(v1, v2)`: Standard Manhattan distance
- `modified_manhattan_distance(v1, v2)`: Weighted Manhattan distance
### `KeystrokeAuthenticator`
Performs authentication and evaluation:
- `leave_one_out_evaluation(feature_vectors)`: Cross-validation
- `calculate_error_rates(evaluation_results, thresholds)`: Compute FAR/FRR/EER
## ๐งช Experimental Results
The modified Manhattan distance metric with reduced DUtotal weight typically achieves:
- Lower EER compared to standard Manhattan distance
- Better discrimination between genuine and impostor attempts
- Optimal performance with ~12 most common digraphs
## ๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## ๐ License
This project is open source and available under the MIT License.
## ๐ Acknowledgments
This implementation is based on research in keystroke dynamics authentication, particularly the use of digraph timing features and modified distance metrics for improved accuracy.
## ๐ง Contact
For questions or feedback, please open an issue on GitHub.