https://github.com/a904guy/emolanguage

🤖 Semantic emoji encoder using LLMs to transform text into meaningful emoji sequences with grammar preservation and reversible decoding. UI-generated communication system inspired by Pantheon.
https://github.com/a904guy/emolanguage

artificial-intelligence communication emoji encoding linguistics llm machine-learning morphology natural-language-processing nlp pantheon python semantic-encoding text-processing

Last synced: 25 days ago
JSON representation

🤖 Semantic emoji encoder using LLMs to transform text into meaningful emoji sequences with grammar preservation and reversible decoding. UI-generated communication system inspired by Pantheon.

Host: GitHub
URL: https://github.com/a904guy/emolanguage
Owner: a904guy
Created: 2025-08-16T06:06:39.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-08-16T06:25:57.000Z (10 months ago)
Last Synced: 2025-08-16T08:25:12.730Z (10 months ago)
Topics: artificial-intelligence, communication, emoji, encoding, linguistics, llm, machine-learning, morphology, natural-language-processing, nlp, pantheon, python, semantic-encoding, text-processing
Language: Python
Homepage:
Size: 569 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🤖 EmoLanguage: Semantic Emoji Encoding System

A sophisticated language encoding system that transforms English text into semantically meaningful emoji sequences using Large Language Models (LLMs). Inspired by the Netflix series **[Pantheon](https://www.netflix.com/title/81937398)**, this project creates a reversible emoji-based communication protocol with advanced morphological transformation detection and intelligent collision resolution.

**🧠 AI-Generated Project**: Much like the uploaded intelligences in **[Pantheon](https://www.netflix.com/title/81937398)** who developed their own symbolic communication methods, this entire project was conceived, designed, and implemented through **artificial intelligence collaboration**. Every line of code, architectural decision, and algorithmic approach emerged from AI reasoning and iterative development—creating a meta-commentary on machine intelligence developing communication systems for other machine intelligences.

**🎭 From Fiction to Reality**: While we don't yet have the uploaded human consciousness technology from **[Pantheon](https://www.netflix.com/title/81937398)**, this project demonstrates how current AI can replicate the conceptual framework of advanced digital beings creating their own semantic languages. The irony is intentional: an AI system building tools for AI communication, mirroring the show's premise of uploaded minds developing new forms of expression beyond human linguistic limitations.

## 🎯 What Makes This Special

- **🧠 Context-Aware Grammar**: Automatically preserves plurals, tenses, and morphological variations using intelligent modifier emojis
- **🎯 Semantic Intelligence**: LLM-generated mappings based on meaning and context, not just visual similarity
- **🔄 Perfect Reversibility**: Encode→decode maintains functional equivalence with grammatical reconstruction
- **🌟 Multi-Pass Generation**: Advanced consensus-building algorithms with quality scoring and collision detection
- **⚡ Collision-Free Architecture**: Intelligent multi-pass resolution of duplicate emoji assignments using LLM consensus
- **🔤 Character Fallback System**: Unmapped words automatically encode as individual characters using consistent emoji mappings
- **🏗️ Production Ready**: Comprehensive error handling, logging, validation, and quality assurance

## 🚀 Quick Demo

```bash
# Context-aware encoding with morphological preservation
$ python3 encode.py "The cats were running quickly"
Original: The cats were running quickly
Encoded : 🏠✨🌍🔠 🐈🔢 👤💡 🏃‍♂️ 🏃⚡🎯

# Decode back to text
$ python3 decode.py "🏠✨🌍🔠 🐈🔢 👤💡 🏃‍♂️ 🏃⚡🎯"
Original: 🏠✨🌍🔠 🐈🔢 👤💡 🏃‍♂️ 🏃⚡🎯
Decoded : The cats were running quickly

# Standard greeting encoding
$ python3 encode.py "Hello world, how are you today?"
Original: Hello world, how are you today?
Encoded : 👋🌐🔠 🧭🌏, 🤨🔍 🚪🏠 👤🤗 📆🌞?

# Perfect decode for simple phrases
$ python3 decode.py "👋🌐🔠 🧭🌏, 🤨🔍 🚪🏠 👤🤗 📆🌞?"
Original: 👋🌐🔠 🧭🌏, 🤨🔍 🚪🏠 👤🤗 📆🌞?
Decoded : Hello world, how are you today?

# Process files with piping
$ cat your_text_file.txt | python3 encode.py > encoded_output.txt
```

## 🎨 Use Cases

- **🔒 Secure Communication**: Semantic encoding for privacy-conscious messaging
- **🔬 Language Research**: Morphological analysis and linguistic pattern recognition
- **🎭 Creative Projects**: Digital art, emoji poetry, and experimental writing
- **📚 Educational Tools**: Teaching semantic meaning, word roots, and language structure
- **🤖 LLM Research**: Prompt engineering, NLP experiments, and model evaluation
## 🎉 Fun Applications: Confuse friends, create unique social content, emoji games

### 🐝 Community Projects

**The Bee Movie Script in EmoLanguage**: For the ultimate test of semantic encoding (and pure entertainment), the entire Bee Movie script has been translated into EmoLanguage! Check out this epic exercise in AI-generated emoji communication: [Bee Movie EmoLanguage Script](https://gist.github.com/a904guy/a60071a516ea60d559a36b23f907679a)

*"According to all known laws of aviation, there is no way a bee should be able to fly..."* becomes a beautiful sequence of semantically meaningful emojis. It's both a technical demonstration of the system's capabilities and a delightfully absurd homage to internet meme culture.

**📟 EmojiPager - Web Interface**: Experience EmoLanguage through a nostalgic retro pager interface! This web-based UI brings the encode/decode functionality to your browser with an authentic old-school pager design. Complete with LCD-style display, chunky buttons, and that aesthetic of 90s communication devices. Perfect for demonstrating the system to others or just enjoying some retro-tech vibes while encoding your messages.

Try it live at: [https://emojipager.com](https://emojipager.com)

Features include real-time encoding/decoding, responsive design for mobile devices, and all the semantic intelligence of the command-line tools wrapped in a beautifully crafted vintage interface. Because sometimes the future of communication needs a little blast from the past!

## 📂 Architecture & Components

### Core System
- **`lib/semantic_mapping_generator.py`** - Sophisticated mapping generation with multi-pass consensus, collision resolution, and quality scoring
- **`build_mapping.py`** - Main interface for generating mappings with various strategies (batch, multipass, dictionary processing)
- **`encode.py`** - Context-aware encoder with morphological transformation detection
- **`decode.py`** - Intelligent decoder with grammatical reconstruction capabilities

### Advanced Processing
- **`lib/word_normalizer.py`** - NLTK + rule-based word normalization with extensive morphological handling
- **`lib/collision_manager.py`** - Multi-pass collision detection and LLM-based resolution system
- **`lib/file_manager.py`** - Robust file I/O with validation, backup, and collision tracking
- **`lib/llm_client.py`** - LLM integration with score-aware responses and error handling

### Utilities & Support
- **`normalize_dictionary.py`** - Dictionary cleaning, deduplication, and sorting
- **`settle_duplications.py`** - Legacy collision resolution (replaced by integrated system)

### Data & Configuration
- **`documents/dictionary.txt`** - Normalized word list derived from `/usr/share/dict/words` (17,265+ entries including short words)
- **`mappings/mapping.json`** - Primary emoji-to-word mapping database
- **`lib/config.py`** - Centralized configuration and prompt templates

## 🛠️ Setup & Installation

### Prerequisites

1. **Local LLM Server** (for generating new mappings)
```bash
# Install LM Studio, Ollama, or similar
# Download a compatible model (e.g., Llama, Mistral, GPT variants)
# Start server on http://127.0.0.1:1234 (default)
```

2. **Python Dependencies**
```bash
# Recommended: Use Makefile for automated setup
make install

# Alternative: Manual installation
pip install -r requirements.txt
# NLTK data downloads automatically when needed
```

### Quick Start

```bash
# 1. Encode text to emojis (uses existing mappings)
python3 encode.py "The quick brown fox jumps over the lazy dog"
# Output:
# Original: The quick brown fox jumps over the lazy dog
# Encoded : 🏠✨🌍🔠 🏃⚡ 🍂🟫 🦊 🦘3️⃣ 🔝💨 🏠✨🌍 🛌🤤 🐕

# 2. Decode emojis back to text
python3 decode.py "🏠✨🌍🔠 🏃⚡ 🍂🟫 🦊 🦘3️⃣ 🔝💨 🏠✨🌍 🛌🤤 🐕"
# Output:
# Original: 🏠✨🌍🔠 🏃⚡ 🍂🟫 🦊 🦘3️⃣ 🔝💨 🏠✨🌍 🛌🤤 🐕
# Decoded : The quick brown fox jumps over the lazy dog
```

## 🏗️ Building Mappings: Multiple Methods

### Method 1: Standard Batch Generation
```bash
# Process words in batches with collision resolution
python3 build_mapping.py --mapping-size 50 --collision-size 25
```

### Method 2: Multi-Pass Generation (Recommended)
```bash
# Higher quality with consensus scoring across multiple LLM passes
python3 build_mapping.py --multipass --mapping-size 50 --passes 3 --collision-passes 2
```

### Advanced Options
```bash
# Custom LLM configuration
python3 build_mapping.py --base-url http://localhost:8080 --model custom-model

# Dry run to preview results
python3 build_mapping.py --dry-run --mapping-size 20

# Multi-pass with custom settings
python3 build_mapping.py --multipass --passes 5 --collision-passes 3 --mapping-size 100
```

## 🔤 Morphological System

### Advanced Word Normalization
The system uses sophisticated normalization combining NLTK lemmatization with extensive rule-based processing:

- **Contractions**: `didn't` → `did` + negation modifier `❌`
- **Plurals**: `cats` → `cat` + plurality modifier `🔢`
- **Verb Forms**: `running` → `run` + progressive modifier `🔄`
- **Comparatives**: `bigger` → `big` + comparative modifier `➕`
- **Complex Forms**: `children` → `child` + irregular plural modifier `🔢👑`

### Grammar Preservation
Context-aware encoding preserves grammatical information through modifier emojis:

| Grammar Type | Modifier | Example |
|-------------|----------|---------|
| Plural | 🔢 | cats → 🐱🔢 |
| Past Tense | ⏪ | walked → 🚶‍♂️⏪ |
| Progressive | 🔄 | running → 🏃🔄 |
| Comparative | ➕ | bigger → 📏➕ |
| Superlative | ⭐ | biggest → 📏⭐ |
| Negation | ❌ | didn't → 🏃‍♀️✅❌ |
| Capitalization | 🔠 | Hello → 👋🔠 |

## 🔤 Character Fallback System

### Handling Unmapped Words
When words aren't found in the emoji dictionary, the system automatically falls back to character-by-character encoding:

- **Letters A-Z**: Encoded using consistent squared letter emojis (🅰️🅱️🅾️🅿️...)
- **Numbers 0-9**: Encoded using distinct number emojis that don't conflict with modifiers
- **Proper Nouns**: Names and places like "Andy Hawkins" → 🅰️🔠🄽🄳🅨 🅷🅰️🅦🄺🄸🄽🅂
- **Technical Terms**: Specialized vocabulary automatically handled without manual mapping
- **Mixed Content**: "test123" → 🆃🄴🅂🆃①②③

### Character Fallback Features
- **Capitalization Preserved**: Uppercase letters get individual capitalization modifiers (🔠)
- **Perfect Reversibility**: Character sequences decode back to original text exactly
- **Visual Consistency**: All letters use the same emoji style for clean appearance
- **No Conflicts**: Character emojis never clash with morphological modifiers

## 🎛️ System Performance

### Current Metrics (2025)
- **Dictionary Size**: 17,265 unique normalized words
- **Mapping Database**: 17,794 emoji mappings (0.44MB file, ~0.4MB RAM)
- **Memory Usage**: ~5.5MB incremental (mappings + dictionary + normalizer)
- **Encoding Speed**: ~2 seconds startup + processing time (includes NLTK/normalization)
- **Decoding Speed**: ~0.1 seconds (fast lookup-based decoding)
- **Character Fallback**: Instant encoding for any unmapped content
- **Coverage**: 100% text encoding (dictionary words + character fallback)
- **Collision Rate**: <0.1% after multi-pass resolution
- **Reversibility**: >99.9% functional accuracy

### Quality Assurance
- **Multi-Pass Scoring**: LLM confidence scores and consensus validation
- **Collision Detection**: Real-time duplicate emoji detection with automatic resolution
- **Semantic Validation**: Context-aware quality checks and coherence testing
- **Grammar Preservation**: Morphological transformation tracking and reconstruction
- **Error Recovery**: Robust fallback systems and comprehensive logging

## 🔧 Configuration & Customization

### LLM Settings
```python
# lib/config.py
DEFAULT_BASE_URL = "http://127.0.0.1:1234"
DEFAULT_MODEL = "openai/gpt-oss-20b"
DEFAULT_MAPPING_BATCH_SIZE = 50
DEFAULT_COLLISION_BATCH_SIZE = 10
```

### File Paths
```python
# lib/config.py
DEFAULT_DICTIONARY_PATH = "documents/dictionary.txt"
MAPPING_FILE_PATH = "mappings/mapping.json"
LOGS_DIR = "logs"
```

### Morphological Modifiers
All modifier emojis are configurable in `encode.py`:
```python
MORPHOLOGICAL_MODIFIERS = {
'plural_s': '🔢',
'verb_ed': '⏪',
'verb_ing': '🔄',
'comparative': '➕',
'superlative': '⭐',
'contraction_nt': '❌',
# ... extensive customization options
}
```

## 🔍 Troubleshooting & Development

### Common Issues
```bash
# Check dictionary and mapping status
python3 -c "
import json
with open('mappings/mapping.json') as f:
mappings = json.load(f)
print(f'Loaded {len(mappings)} mappings')
"

# Test specific words
echo "test words here" | python3 encode.py

# Validate collision-free mappings
python3 -c "
import json
from collections import Counter
with open('mappings/mapping.json') as f:
mappings = json.load(f)
emoji_counts = Counter(mappings.values())
duplicates = {k:v for k,v in emoji_counts.items() if v > 1}
print(f'Found {len(duplicates)} duplicate emojis: {duplicates}')
"
```

### Development Features
- **Comprehensive Logging**: All operations logged with timestamps and context
- **Dry Run Mode**: Preview changes without modifying files
- **Statistics Tracking**: Detailed metrics and performance monitoring
- **Modular Architecture**: Clean separation of concerns for easy extension

## 🎓 Technical Background

### Inspiration: **[Pantheon](https://www.netflix.com/title/81937398)** Series
In the Netflix series **[Pantheon](https://www.netflix.com/title/81937398)**, the "Emo Language" represents a symbolic communication method used by uploaded intelligences. This project reimagines that concept with:

- **Semantic Encoding**: Meaning-based rather than visual emoji selection
- **Linguistic Intelligence**: Advanced morphological and grammatical awareness
- **Reversible Communication**: Perfect round-trip encoding/decoding capability
- **Cultural Neutrality**: Universal emoji selection avoiding regional biases

### Architecture Philosophy
- **LLM-First Design**: Leveraging language models for semantic understanding
- **Collision-Free Guarantee**: Ensuring one-to-one word-emoji correspondence
- **Context Preservation**: Maintaining grammatical and morphological information
- **Production Quality**: Robust error handling, logging, and validation systems

## 🤝 Contributing

This project welcomes contributions! Key areas:

- **New Language Support**: Extend beyond English with multilingual mappings
- **Alternative LLM Backends**: Support for different model architectures
- **Grammar Extensions**: Enhanced morphological transformation detection
- **Performance Optimization**: Faster encoding/decoding algorithms
- **Quality Metrics**: Advanced semantic validation and scoring systems

## 📜 License

MIT License — use it, break it, improve it, just give credit where it's due.

---

*"The future of communication is not just digital—it's semantic. Every emoji carries meaning, every sequence tells a story."*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/a904guy/emolanguage

Awesome Lists containing this project

README