https://github.com/bitbynik/substitution_cipher
SIL765 Assignment-1
https://github.com/bitbynik/substitution_cipher
cryptanalysis decipher iitd security substitution-cipher
Last synced: 5 months ago
JSON representation
SIL765 Assignment-1
- Host: GitHub
- URL: https://github.com/bitbynik/substitution_cipher
- Owner: BitByNIK
- Created: 2025-01-09T08:59:41.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-19T10:44:34.000Z (12 months ago)
- Last Synced: 2025-01-19T11:29:22.387Z (12 months ago)
- Topics: cryptanalysis, decipher, iitd, security, substitution-cipher
- Language: Python
- Homepage:
- Size: 2.23 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Substitution Cipher Decoder
This project is a Python-based implementation of a substitution cipher decoder. It uses **hill climbing**, a search optimization algorithm, to iteratively improve decryption keys. By analyzing patterns in the ciphertext and comparing them with English language characteristics, the script deciphers encrypted text.
---
## How It Works
A substitution cipher replaces each letter in plaintext with a specific character (e.g., numbers, symbols, or other letters). To decrypt such a cipher, the script finds the mapping between ciphertext characters and plaintext letters.
### Steps in the Decryption Process
1. **Initial Setup:**
- The script starts with a **random key**, mapping ciphertext characters to plaintext letters.
- A key represents how each ciphertext character translates into a plaintext letter.
2. **Fitness Calculation:**
- The script evaluates how good the current key is by calculating a **fitness score**.
- This score measures how similar the decrypted text is to English using the frequency of groups of four letters (called **quadgrams**).
- Formula used for fitness:
$$\text{Fitness} = \sum_{\text{quadgram}} \log(\text{QF}[\text{quadgram}])$$
- **QF** represents the frequency of quadgrams in standard English text.
- Taking the logarithm ensures to avoid underflow because frequencies can be small.
3. **Heuristic-Based Key Selection:**
- A **heuristic function** is used to prioritize which parts of the key to adjust.
- The heuristic identifies ciphertext characters whose frequencies differ the most from expected English frequencies.
- Formula for the heuristic:
$$H(b) = \max(1, |\text{FC}(b) - \text{FE}(\text{E}^{-1}(b))|)$$
Where:
- $$H(b)$$: Heuristic score for a ciphertext letter $$b$$.
- $$\text{FC}(b)$$: Rank of $$b$$ in ciphertext frequencies.
- $$\text{FE}(a)$$: Rank of the corresponding plaintext letter $$a$$ in English frequencies.
- A ciphertext letter $$b$$ with a large heuristic value is **more likely to be swapped** because it is far from its expected frequency rank. The goal is to quickly minimize these differences.
- Once a character $$b$$ is selected, it is swapped with another randomly chosen character in the key.
4. **Key Evaluation:**
- The script decrypts the text using the modified key.
- If the fitness score improves, the new key is kept. Otherwise, it is discarded.
5. **Stopping Condition:**
- If the fitness score does not improve after \( T \) iterations, the process stops.
- The entire procedure is repeated multiple times, and the best key across all attempts is chosen.
---
## Why Use a Heuristic?
The heuristic focuses on ciphertext letters whose observed frequencies differ the most from expected English frequencies. This:
- **Speeds Up Convergence:** Quickly minimizes large frequency mismatches.
- **Improves Accuracy:** Guides the algorithm to better keys, especially for longer ciphertexts where frequency distributions are more reliable.
---
## Features
- **Hill Climbing Optimization:** Iteratively improves the decryption key using fitness scoring.
- **Quadgram-Based Scoring:** Evaluates how "English-like" the decrypted text is.
- **Heuristic Guidance:** Targets the most impactful key adjustments to improve efficiency.
---
## Requirements
- **Python 3.6 or Higher**
- A file containing a large sample of English text is required to calculate the letter and quadgram frequencies. Here, we have used big.txt, which should be a plain text file with diverse English content (e.g., books, articles). The quality of decryption improves with a larger and more representative reference text.
### Dependencies:
- Built-in Python libraries: `collections`, `math`, `random`.
---