https://github.com/mancrurod/linguaanimae
Exploring emotions and meaning in Bible verses with NLP, transformers, and a custom Streamlit app.
https://github.com/mancrurod/linguaanimae
bert corpus-linguistics digital-humanities emotion-detection huggingface-transformers humanities multi-label-classification natural-language-processing nlp python semantic-analysis streamlit text-classification theme-detection web-scraping
Last synced: 2 months ago
JSON representation
Exploring emotions and meaning in Bible verses with NLP, transformers, and a custom Streamlit app.
- Host: GitHub
- URL: https://github.com/mancrurod/linguaanimae
- Owner: mancrurod
- Created: 2025-04-21T14:10:40.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-26T06:47:22.000Z (about 1 year ago)
- Last Synced: 2025-05-26T08:38:21.857Z (about 1 year ago)
- Topics: bert, corpus-linguistics, digital-humanities, emotion-detection, huggingface-transformers, humanities, multi-label-classification, natural-language-processing, nlp, python, semantic-analysis, streamlit, text-classification, theme-detection, web-scraping
- Language: Jupyter Notebook
- Homepage:
- Size: 21.6 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
π€ Classify, explore, and connect with sacred texts through emotion and theme. β€οΈβπ©Ή
Multilingual NLP pipeline for emotion & theme annotation, with an interactive Streamlit chatbot for personalized Bible verse recommendations.
---
---
---
π Table of Contents (click to expand)
- [π Key Notebooks](#-key-notebooks)
- [π Project Goals](#-project-goals)
- [π§ Core Technologies](#-core-technologies)
- [π Project Structure](#-project-structure)
- [π¦ Data Folders Overview](#-data-folders-overview)
- [π Data Selection, Annotation & Versioning](#-data-selection-annotation--versioning)
- [π Label Mapping and Cleaning](#-label-mapping-and-cleaning)
- [π¦ Model Training & Evaluation](#-model-training--evaluation)
- [πΈ Screenshots](#-screenshots)
- [π Getting Started](#-getting-started)
- [π§° Usage](#-usage)
- [π¬ Streamlit Interface](#-streamlit-interface)
- [π€ Feedback System](#-feedback-system)
- [π Outputs](#-outputs)
- [π Project Status (MVP Completed)](#-project-status-mvp-completed)
- [β οΈ Known Limitations](#-known-limitations)
- [π€ Contributing & Testing](#-contributing--testing)
- [π License](#-license)
- [β¨ Acknowledgements](#-acknowledgements)
---
## π Key Notebooks
Explore the main stages of the pipeline directly in Jupyter notebooks:
- [01_scraping_exploration.ipynb](notebooks/01_scraping_exploration.ipynb) β Data exploration & Bible scraping workflow.
- [02_cleaning.ipynb](notebooks/02_cleaning.ipynb) β Data cleaning and normalization.
- [03_label_emotions_and_themes.ipynb](notebooks/03_label_emotions_and_themes.ipynb) β Emotion & theme annotation pipeline.
- [05_evaluation.ipynb](notebooks/05_evaluation.ipynb) β Model evaluation: metrics, confusion matrix, reporting.
- [viz_models.ipynb](notebooks/viz_models.ipynb) β Model outputs and visualizations.
---
## π Project Goals
* Extract and normalize full Bible corpora (English + Spanish)
* Annotate every verse with emotion and theme labels
* Translate annotations for multilingual consistency
* Power a semantic chatbot that suggests aligned verses in real time
* Support additional domains like poetry or music lyrics (planned)
---
## π§ Core Technologies
* **Python 3.10+**
* `transformers`, `torch`, `sentence-transformers`
* `pandas`, `scikit-learn`, `regex`
* `beautifulsoup4`, `requests`
* `streamlit` β multilingual app for emotion/theme-based verse recommendation
---
## π Project Structure
```
LinguaAnimae/
βββ .streamlit/
β βββ secrets.toml
βββ app/
β βββ assets/
β βββ components/
β β βββ render_emotion.py
β β βββ render_feedback.py
β β βββ render_theme.py
β βββ app.py
β βββ texts.py
βββ data/
β βββ evaluation/
β β βββ verses_labeled_gpt/
β β βββ verses_parsed/
β β βββ verses_to_label/
β β βββ eval_examples.csv
β β βββ eval_results.csv
β βββ labeled/
β βββ processed/
β βββ raw/
βββ logs/
βββ notebooks/
β βββ 01_scraping_exploration.ipynb
β βββ 02_cleaning.ipynb
β βββ 03_label_emotions_and_themes.ipynb
β βββ 04_translate_labels.ipynb
β βββ 05_evaluation.ipynb
β βββ 06_emotion_finetuning_pipeline.ipynb
β βββ viz_models.ipynb
βββ src/
β βββ fine_tuning/
β β βββ parse_gpt_output_to_labeled_csv.py
β β βββ select_verses_for_labeling.py
β β βββ prompt_gpt.txt
β βββ interface/
β β βββ recommender.py
β βββ modeling/
β β βββ emotion_theme_labeling.py
β β βββ labeling_pipeline.py
β β βββ theme_labeling.py
β βββ preprocessing/
β β βββ cleaning.py
β β βββ merge.py
β β βββ translate_and_apply_labels.py
β βββ scraping/
β β βββ bible_scraper.py
β β βββ parse_osis_kjv.py
β βββ utils/
β βββ save_feedback_to_gsheet.py
β βββ translation_maps.py
βββ tests/
βββ .gitignore
βββ requirements.txt
βββ requirements_local.txt
βββ environment.yml
βββ README.md
βββ CHANGELOG.md
```
---
## π¦ Data Folders Overview
| Folder | Description |
|-------------------------|-----------------------------------------------------------------------------|
| `data/raw/` | Raw, unprocessed texts as scraped from original sources (KJV/RV60 Bibles). |
| `data/processed/` | Cleaned and normalized texts, with basic formatting corrections. |
| `data/labeled/` | Verses annotated with emotion and theme labels. |
| `data/evaluation/` | Evaluation sets, results, and samples for manual review. |
| `logs/` | Logs from annotation, training, and feedback collection. |
| `notebooks/` | Jupyter notebooks documenting each stage of the pipeline. |
---
## π Data Selection, Annotation & Versioning
**Sampling, annotation, and batch tracking workflow:**
- Automated random verse selection script for new annotation rounds, guaranteeing no duplication of already labeled verses.
- Supports multiple annotation rounds with batch/version tracking (`emotion_verses_to_label_X.csv`).
- New annotation batches can be labeled via GPT or other models, then easily merged with existing datasets.
- Utility scripts included for remapping, cleaning, and validating emotion labels prior to model training.
- Each annotation batch and its integration is versioned for reproducibility and experiment traceability.
---
## π Label Mapping and Cleaning
- **Robust label mapping:** All scripts and model pipelines use unified dictionaries for emotion and theme mapping (`EMOTION_MAP`, `THEME_MAP`), ensuring compatibility between annotation, translation, and modeling.
- **Label cleaning utilities:** Automated routines for handling strange/ambiguous emotions and mapping them to the canonical set. Out-of-vocabulary or inconsistent labels are filtered out before training.
---
## π¦ Model Training & Evaluation
The project now supports full training and evaluation workflows for emotion classification models, including:
- Fine-tuning with Hugging Face Transformers on the annotated Bible corpus.
- Optional oversampling for class balancing during training.
- Comprehensive cross-validation pipeline using StratifiedKFold and HuggingFace Trainer, reporting mean and std of macro F1 across folds.
- Export of classification reports and confusion matrices after each experiment for documentation and analysis.
- Early stopping to prevent overfitting in all model workflows.
See `notebooks/05_evaluation.ipynb` and `src/fine_tuning/` for code examples and experiment tracking.
---
## πΈ Screenshots
The following screenshots illustrate the main functionalities of the Streamlit app at a glance:
1. Home Screen: Input your message and select language
2. Recommendation Screen, part 1: The app suggests a Bible verse with detected emotion and theme
3. Recommendation Screen, part 2
4. Feedback Confirmation: User feedback is logged for model improvement
---
## π Getting Started
You can set up the environment using either `conda` (recommended) or `pip`.
### Option 1: Using Conda (recommended)
```bash
conda env create -f environment_local.yml
conda activate linguaanimae
```
### Option 2: Using pip
1. Clone the repository
```bash
git clone https://github.com/your-username/LinguaAnimae.git
cd LinguaAnimae
```
2. Create a virtual environment
```bash
python -m venv venv
source venv/bin/activate # or .\venv\Scripts\activate on Windows
```
3. Install dependencies
```bash
pip install -r requirements.txt
```
4. Run the Bible scraper to download all books
```bash
python src/scraping/bible_scraper.py
```
---
## π§° Usage
### 1. Scrape the Bible (RV60)
Use the scraping script to extract the full Reina-Valera 1960 Bible and save it as structured CSVs:
```bash
python src/scraping/bible_scraper.py
```
### 2. Label Verses with Emotions + Themes
Use the labeling pipeline to classify English Bible verses (bible\_kjv) using pretrained HuggingFace models:
```bash
python src/interface/labeling_pipeline.py --bible bible_kjv
```
Optional flags:
* \--skip-emotion to skip emotion classification
* \--skip-theme to skip theme labeling
* \--device -1 to force CPU mode (default is --device 0 for GPU)
* \--dry-run path/to/file.csv to test a single file
### 3. Translate Labels into Spanish
Align the English emotion/theme annotations with their Spanish verse equivalents in bible\_rv60:
```bash
python src/preprocessing/translate_and_apply_labels.py
```
This creates a labeled Spanish version under:
```bash
data/labeled/bible_rv60/emotion_theme/
```
---
## π¬ Streamlit Interface
The interactive Streamlit app allows users to input a free-form emotional message and receive recommended Bible verses matching its **emotion** and **theme**.
### Features
* π **Automatic translation** of input (EN/ES)
* π§ **Emotion detection** (6 Ekman categories)
* π·οΈ **Theme classification** (5 canonical themes)
* π **Context-aware verse matching** from KJV or RV60
* π¨ **Stylized cards** with emotion/theme color, emoji, and verse metadata
* β
**User feedback collection** via like/dislike buttons (stored in Google Sheets)
### Example
Input:
> *Tengo miedo y necesito consuelo...*
Returns:
> *GΓ©nesis 40:7* β *"ΒΏPor quΓ© parecen hoy mal vuestros semblantes?"*
---
## π€ Feedback System
Users can now rate the relevance of the emotion/theme detection with a π / π system.
Feedback is saved to a **Google Sheet** along with:
* Original input
* Detected emotion and score
* Detected theme and score
* User name (optional)
* Feedback value (`like` / `dislike`)
This enables future model refinement and analytics.
---
## π Outputs
Labeled files are saved to:
* \*\_emotion.csv: Emotion column using 6 Plutchik labels
* \*\_emotion\_theme.csv: Adds multilabel theme column from 5 canonical themes
* Logs are saved to: logs/labeling\_logs/ with per-file runtime and pipeline summary
---
## π Project Status (MVP Completed)
### β
MVP Completed (Weeks 1β6)
- [x] Full Bible scraping (KJV + RV60) and corpus organization
- [x] Data cleaning and normalization
- [x] Emotion and theme labeling using pretrained HuggingFace models
- [x] Cross-lingual label transfer and Spanish label alignment
- [x] Robust manual evaluation with accuracy, macro F1, and confusion matrix reporting
- [x] Streamlit interface: emotion + theme detection, stylized results, and interactive recommendations
- [x] Multilingual support: automatic input translation and dynamic corpus selection (EN/ES)
- [x] Recommendation system matching user queries by emotion and theme
- [x] Feedback system: like/dislike buttons with logging to Google Sheets
- [x] Model fine-tuning workflow: train/test split, metrics, early stopping, and artifact saving
- [x] Batch random sampling, annotation pipeline, and batch version tracking
- [x] Cross-validation pipeline (StratifiedKFold + HuggingFace Trainer) for robust evaluation
- [x] Automated report and confusion matrix export for each experiment
### π Future Work (Optional/Post-MVP)
- Export features (PDF), voice synthesis, or word cloud summaries
- Support for additional text domains (poetry, music, etc.)
- Add fine-tuned model to pipeline. Use it to relabel Bible verses.
[See CHANGELOG.md](CHANGELOG.md) for complete history.
---
## β οΈ Known Limitations
While Lingua Animae demonstrates robust results as an MVP, the current version has several known limitations that future work may address:
- **Domain scope:** The annotation and recommendation pipeline is currently limited to biblical texts (KJV and RV60). Application to other genres (e.g., poetry, music lyrics) is planned but not yet implemented or validated.
- **Language support:** Only English and Spanish are fully supported at this time. Adding other languages would require further data preparation and model adaptation.
- **Emotion & theme taxonomy:** The emotion (6-class) and theme (5-class) taxonomies, while grounded in literature, are simplified for tractability and may not capture all nuances present in complex texts.
- **Annotation transfer:** The cross-lingual label transfer assumes strong verse alignment between English and Spanish Bibles; rare misalignments or translation differences may impact label accuracy.
- **Model bias:** Pretrained models used for annotation (e.g., HuggingFace Transformers) may inherit cultural or linguistic biases from their original training data, which could affect the detection of emotions or themes.
- **Evaluation set:** Manual evaluation is limited in scale and focuses on selected books/verses. Broader user validation or external benchmarks are desirable for production-level deployment.
- **Deployment:** The Streamlit app is designed for demonstration and user feedback. For large-scale or production use, backend scalability, security, and multi-user management would require further engineering.
---
## π€ Contributing & Testing
Contributions, suggestions, or bug reports are welcome!
To run unit tests, use:
```bash
pytest tests/
```
For feature requests, open an issue or pull request on GitHub.
---
## π License
For academic and research use only. Sources are derived from public domain Bibles (e.g., RV60, KJV) and open ML models from HugginFace. License will be finalized before v1.0.
---
## β¨ Acknowledgements
Developed by [Manuel Cruz RodrΓguez](https://github.com/mancrurod) as part of an NLP and Data Science learning journey.