An open API service indexing awesome lists of open source software.

https://github.com/a904guy/emolanguage

πŸ€– Semantic emoji encoder using LLMs to transform text into meaningful emoji sequences with grammar preservation and reversible decoding. UI-generated communication system inspired by Pantheon.
https://github.com/a904guy/emolanguage

artificial-intelligence communication emoji encoding linguistics llm machine-learning morphology natural-language-processing nlp pantheon python semantic-encoding text-processing

Last synced: 25 days ago
JSON representation

πŸ€– Semantic emoji encoder using LLMs to transform text into meaningful emoji sequences with grammar preservation and reversible decoding. UI-generated communication system inspired by Pantheon.

Awesome Lists containing this project

README

          

# πŸ€– EmoLanguage: Semantic Emoji Encoding System

A sophisticated language encoding system that transforms English text into semantically meaningful emoji sequences using Large Language Models (LLMs). Inspired by the Netflix series **[Pantheon](https://www.netflix.com/title/81937398)**, this project creates a reversible emoji-based communication protocol with advanced morphological transformation detection and intelligent collision resolution.

**🧠 AI-Generated Project**: Much like the uploaded intelligences in **[Pantheon](https://www.netflix.com/title/81937398)** who developed their own symbolic communication methods, this entire project was conceived, designed, and implemented through **artificial intelligence collaboration**. Every line of code, architectural decision, and algorithmic approach emerged from AI reasoning and iterative developmentβ€”creating a meta-commentary on machine intelligence developing communication systems for other machine intelligences.

**🎭 From Fiction to Reality**: While we don't yet have the uploaded human consciousness technology from **[Pantheon](https://www.netflix.com/title/81937398)**, this project demonstrates how current AI can replicate the conceptual framework of advanced digital beings creating their own semantic languages. The irony is intentional: an AI system building tools for AI communication, mirroring the show's premise of uploaded minds developing new forms of expression beyond human linguistic limitations.

## 🎯 What Makes This Special

- **🧠 Context-Aware Grammar**: Automatically preserves plurals, tenses, and morphological variations using intelligent modifier emojis
- **🎯 Semantic Intelligence**: LLM-generated mappings based on meaning and context, not just visual similarity
- **πŸ”„ Perfect Reversibility**: Encodeβ†’decode maintains functional equivalence with grammatical reconstruction
- **🌟 Multi-Pass Generation**: Advanced consensus-building algorithms with quality scoring and collision detection
- **⚑ Collision-Free Architecture**: Intelligent multi-pass resolution of duplicate emoji assignments using LLM consensus
- **πŸ”€ Character Fallback System**: Unmapped words automatically encode as individual characters using consistent emoji mappings
- **πŸ—οΈ Production Ready**: Comprehensive error handling, logging, validation, and quality assurance

## πŸš€ Quick Demo

```bash
# Context-aware encoding with morphological preservation
$ python3 encode.py "The cats were running quickly"
Original: The cats were running quickly
Encoded : πŸ βœ¨πŸŒπŸ”  πŸˆπŸ”’ πŸ‘€πŸ’‘ πŸƒβ€β™‚οΈ πŸƒβš‘πŸŽ―

# Decode back to text
$ python3 decode.py "πŸ βœ¨πŸŒπŸ”  πŸˆπŸ”’ πŸ‘€πŸ’‘ πŸƒβ€β™‚οΈ πŸƒβš‘πŸŽ―"
Original: πŸ βœ¨πŸŒπŸ”  πŸˆπŸ”’ πŸ‘€πŸ’‘ πŸƒβ€β™‚οΈ πŸƒβš‘πŸŽ―
Decoded : The cats were running quickly

# Standard greeting encoding
$ python3 encode.py "Hello world, how are you today?"
Original: Hello world, how are you today?
Encoded : πŸ‘‹πŸŒπŸ”  🧭🌏, πŸ€¨πŸ” πŸšͺ🏠 πŸ‘€πŸ€— πŸ“†πŸŒž?

# Perfect decode for simple phrases
$ python3 decode.py "πŸ‘‹πŸŒπŸ”  🧭🌏, πŸ€¨πŸ” πŸšͺ🏠 πŸ‘€πŸ€— πŸ“†πŸŒž?"
Original: πŸ‘‹πŸŒπŸ”  🧭🌏, πŸ€¨πŸ” πŸšͺ🏠 πŸ‘€πŸ€— πŸ“†πŸŒž?
Decoded : Hello world, how are you today?

# Process files with piping
$ cat your_text_file.txt | python3 encode.py > encoded_output.txt
```

## 🎨 Use Cases

- **πŸ”’ Secure Communication**: Semantic encoding for privacy-conscious messaging
- **πŸ”¬ Language Research**: Morphological analysis and linguistic pattern recognition
- **🎭 Creative Projects**: Digital art, emoji poetry, and experimental writing
- **πŸ“š Educational Tools**: Teaching semantic meaning, word roots, and language structure
- **πŸ€– LLM Research**: Prompt engineering, NLP experiments, and model evaluation
## πŸŽ‰ Fun Applications: Confuse friends, create unique social content, emoji games

### 🐝 Community Projects

**The Bee Movie Script in EmoLanguage**: For the ultimate test of semantic encoding (and pure entertainment), the entire Bee Movie script has been translated into EmoLanguage! Check out this epic exercise in AI-generated emoji communication: [Bee Movie EmoLanguage Script](https://gist.github.com/a904guy/a60071a516ea60d559a36b23f907679a)

*"According to all known laws of aviation, there is no way a bee should be able to fly..."* becomes a beautiful sequence of semantically meaningful emojis. It's both a technical demonstration of the system's capabilities and a delightfully absurd homage to internet meme culture.

**πŸ“Ÿ EmojiPager - Web Interface**: Experience EmoLanguage through a nostalgic retro pager interface! This web-based UI brings the encode/decode functionality to your browser with an authentic old-school pager design. Complete with LCD-style display, chunky buttons, and that aesthetic of 90s communication devices. Perfect for demonstrating the system to others or just enjoying some retro-tech vibes while encoding your messages.

Try it live at: [https://emojipager.com](https://emojipager.com)

Features include real-time encoding/decoding, responsive design for mobile devices, and all the semantic intelligence of the command-line tools wrapped in a beautifully crafted vintage interface. Because sometimes the future of communication needs a little blast from the past!

## πŸ“‚ Architecture & Components

### Core System
- **`lib/semantic_mapping_generator.py`** - Sophisticated mapping generation with multi-pass consensus, collision resolution, and quality scoring
- **`build_mapping.py`** - Main interface for generating mappings with various strategies (batch, multipass, dictionary processing)
- **`encode.py`** - Context-aware encoder with morphological transformation detection
- **`decode.py`** - Intelligent decoder with grammatical reconstruction capabilities

### Advanced Processing
- **`lib/word_normalizer.py`** - NLTK + rule-based word normalization with extensive morphological handling
- **`lib/collision_manager.py`** - Multi-pass collision detection and LLM-based resolution system
- **`lib/file_manager.py`** - Robust file I/O with validation, backup, and collision tracking
- **`lib/llm_client.py`** - LLM integration with score-aware responses and error handling

### Utilities & Support
- **`normalize_dictionary.py`** - Dictionary cleaning, deduplication, and sorting
- **`settle_duplications.py`** - Legacy collision resolution (replaced by integrated system)

### Data & Configuration
- **`documents/dictionary.txt`** - Normalized word list derived from `/usr/share/dict/words` (17,265+ entries including short words)
- **`mappings/mapping.json`** - Primary emoji-to-word mapping database
- **`lib/config.py`** - Centralized configuration and prompt templates

## πŸ› οΈ Setup & Installation

### Prerequisites

1. **Local LLM Server** (for generating new mappings)
```bash
# Install LM Studio, Ollama, or similar
# Download a compatible model (e.g., Llama, Mistral, GPT variants)
# Start server on http://127.0.0.1:1234 (default)
```

2. **Python Dependencies**
```bash
# Recommended: Use Makefile for automated setup
make install

# Alternative: Manual installation
pip install -r requirements.txt
# NLTK data downloads automatically when needed
```

### Quick Start

```bash
# 1. Encode text to emojis (uses existing mappings)
python3 encode.py "The quick brown fox jumps over the lazy dog"
# Output:
# Original: The quick brown fox jumps over the lazy dog
# Encoded : πŸ βœ¨πŸŒπŸ”  πŸƒβš‘ πŸ‚πŸŸ« 🦊 🦘3️⃣ πŸ”πŸ’¨ 🏠✨🌍 πŸ›ŒπŸ€€ πŸ•

# 2. Decode emojis back to text
python3 decode.py "πŸ βœ¨πŸŒπŸ”  πŸƒβš‘ πŸ‚πŸŸ« 🦊 🦘3️⃣ πŸ”πŸ’¨ 🏠✨🌍 πŸ›ŒπŸ€€ πŸ•"
# Output:
# Original: πŸ βœ¨πŸŒπŸ”  πŸƒβš‘ πŸ‚πŸŸ« 🦊 🦘3️⃣ πŸ”πŸ’¨ 🏠✨🌍 πŸ›ŒπŸ€€ πŸ•
# Decoded : The quick brown fox jumps over the lazy dog
```

## πŸ—οΈ Building Mappings: Multiple Methods

### Method 1: Standard Batch Generation
```bash
# Process words in batches with collision resolution
python3 build_mapping.py --mapping-size 50 --collision-size 25
```

### Method 2: Multi-Pass Generation (Recommended)
```bash
# Higher quality with consensus scoring across multiple LLM passes
python3 build_mapping.py --multipass --mapping-size 50 --passes 3 --collision-passes 2
```

### Advanced Options
```bash
# Custom LLM configuration
python3 build_mapping.py --base-url http://localhost:8080 --model custom-model

# Dry run to preview results
python3 build_mapping.py --dry-run --mapping-size 20

# Multi-pass with custom settings
python3 build_mapping.py --multipass --passes 5 --collision-passes 3 --mapping-size 100
```

## πŸ”€ Morphological System

### Advanced Word Normalization
The system uses sophisticated normalization combining NLTK lemmatization with extensive rule-based processing:

- **Contractions**: `didn't` β†’ `did` + negation modifier `❌`
- **Plurals**: `cats` β†’ `cat` + plurality modifier `πŸ”’`
- **Verb Forms**: `running` β†’ `run` + progressive modifier `πŸ”„`
- **Comparatives**: `bigger` β†’ `big` + comparative modifier `βž•`
- **Complex Forms**: `children` β†’ `child` + irregular plural modifier `πŸ”’πŸ‘‘`

### Grammar Preservation
Context-aware encoding preserves grammatical information through modifier emojis:

| Grammar Type | Modifier | Example |
|-------------|----------|---------|
| Plural | πŸ”’ | cats β†’ πŸ±πŸ”’ |
| Past Tense | βͺ | walked β†’ πŸšΆβ€β™‚οΈβͺ |
| Progressive | πŸ”„ | running β†’ πŸƒπŸ”„ |
| Comparative | βž• | bigger β†’ πŸ“βž• |
| Superlative | ⭐ | biggest β†’ πŸ“β­ |
| Negation | ❌ | didn't β†’ πŸƒβ€β™€οΈβœ…βŒ |
| Capitalization | πŸ”  | Hello β†’ πŸ‘‹πŸ”  |

## πŸ”€ Character Fallback System

### Handling Unmapped Words
When words aren't found in the emoji dictionary, the system automatically falls back to character-by-character encoding:

- **Letters A-Z**: Encoded using consistent squared letter emojis (πŸ…°οΈπŸ…±οΈπŸ…ΎοΈπŸ…ΏοΈ...)
- **Numbers 0-9**: Encoded using distinct number emojis that don't conflict with modifiers
- **Proper Nouns**: Names and places like "Andy Hawkins" β†’ πŸ…°οΈπŸ” πŸ„½πŸ„³πŸ…¨ πŸ…·πŸ…°οΈπŸ…¦πŸ„ΊπŸ„ΈπŸ„½πŸ…‚
- **Technical Terms**: Specialized vocabulary automatically handled without manual mapping
- **Mixed Content**: "test123" β†’ πŸ†ƒπŸ„΄πŸ…‚πŸ†ƒβ‘ β‘‘β‘’

### Character Fallback Features
- **Capitalization Preserved**: Uppercase letters get individual capitalization modifiers (πŸ” )
- **Perfect Reversibility**: Character sequences decode back to original text exactly
- **Visual Consistency**: All letters use the same emoji style for clean appearance
- **No Conflicts**: Character emojis never clash with morphological modifiers

## πŸŽ›οΈ System Performance

### Current Metrics (2025)
- **Dictionary Size**: 17,265 unique normalized words
- **Mapping Database**: 17,794 emoji mappings (0.44MB file, ~0.4MB RAM)
- **Memory Usage**: ~5.5MB incremental (mappings + dictionary + normalizer)
- **Encoding Speed**: ~2 seconds startup + processing time (includes NLTK/normalization)
- **Decoding Speed**: ~0.1 seconds (fast lookup-based decoding)
- **Character Fallback**: Instant encoding for any unmapped content
- **Coverage**: 100% text encoding (dictionary words + character fallback)
- **Collision Rate**: <0.1% after multi-pass resolution
- **Reversibility**: >99.9% functional accuracy

### Quality Assurance
- **Multi-Pass Scoring**: LLM confidence scores and consensus validation
- **Collision Detection**: Real-time duplicate emoji detection with automatic resolution
- **Semantic Validation**: Context-aware quality checks and coherence testing
- **Grammar Preservation**: Morphological transformation tracking and reconstruction
- **Error Recovery**: Robust fallback systems and comprehensive logging

## πŸ”§ Configuration & Customization

### LLM Settings
```python
# lib/config.py
DEFAULT_BASE_URL = "http://127.0.0.1:1234"
DEFAULT_MODEL = "openai/gpt-oss-20b"
DEFAULT_MAPPING_BATCH_SIZE = 50
DEFAULT_COLLISION_BATCH_SIZE = 10
```

### File Paths
```python
# lib/config.py
DEFAULT_DICTIONARY_PATH = "documents/dictionary.txt"
MAPPING_FILE_PATH = "mappings/mapping.json"
LOGS_DIR = "logs"
```

### Morphological Modifiers
All modifier emojis are configurable in `encode.py`:
```python
MORPHOLOGICAL_MODIFIERS = {
'plural_s': 'πŸ”’',
'verb_ed': 'βͺ',
'verb_ing': 'πŸ”„',
'comparative': 'βž•',
'superlative': '⭐',
'contraction_nt': '❌',
# ... extensive customization options
}
```

## πŸ” Troubleshooting & Development

### Common Issues
```bash
# Check dictionary and mapping status
python3 -c "
import json
with open('mappings/mapping.json') as f:
mappings = json.load(f)
print(f'Loaded {len(mappings)} mappings')
"

# Test specific words
echo "test words here" | python3 encode.py

# Validate collision-free mappings
python3 -c "
import json
from collections import Counter
with open('mappings/mapping.json') as f:
mappings = json.load(f)
emoji_counts = Counter(mappings.values())
duplicates = {k:v for k,v in emoji_counts.items() if v > 1}
print(f'Found {len(duplicates)} duplicate emojis: {duplicates}')
"
```

### Development Features
- **Comprehensive Logging**: All operations logged with timestamps and context
- **Dry Run Mode**: Preview changes without modifying files
- **Statistics Tracking**: Detailed metrics and performance monitoring
- **Modular Architecture**: Clean separation of concerns for easy extension

## πŸŽ“ Technical Background

### Inspiration: **[Pantheon](https://www.netflix.com/title/81937398)** Series
In the Netflix series **[Pantheon](https://www.netflix.com/title/81937398)**, the "Emo Language" represents a symbolic communication method used by uploaded intelligences. This project reimagines that concept with:

- **Semantic Encoding**: Meaning-based rather than visual emoji selection
- **Linguistic Intelligence**: Advanced morphological and grammatical awareness
- **Reversible Communication**: Perfect round-trip encoding/decoding capability
- **Cultural Neutrality**: Universal emoji selection avoiding regional biases

### Architecture Philosophy
- **LLM-First Design**: Leveraging language models for semantic understanding
- **Collision-Free Guarantee**: Ensuring one-to-one word-emoji correspondence
- **Context Preservation**: Maintaining grammatical and morphological information
- **Production Quality**: Robust error handling, logging, and validation systems

## 🀝 Contributing

This project welcomes contributions! Key areas:

- **New Language Support**: Extend beyond English with multilingual mappings
- **Alternative LLM Backends**: Support for different model architectures
- **Grammar Extensions**: Enhanced morphological transformation detection
- **Performance Optimization**: Faster encoding/decoding algorithms
- **Quality Metrics**: Advanced semantic validation and scoring systems

## πŸ“œ License

MIT License β€” use it, break it, improve it, just give credit where it's due.

---

*"The future of communication is not just digitalβ€”it's semantic. Every emoji carries meaning, every sequence tells a story."*