{"id":28140679,"url":"https://github.com/mancrurod/linguaanimae","last_synced_at":"2026-04-19T14:31:33.602Z","repository":{"id":289338720,"uuid":"970124596","full_name":"mancrurod/LinguaAnimae","owner":"mancrurod","description":"Exploring emotions and meaning in Bible verses with NLP, transformers, and a custom Streamlit app.","archived":false,"fork":false,"pushed_at":"2025-05-26T06:47:22.000Z","size":22637,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-26T08:38:21.857Z","etag":null,"topics":["bert","corpus-linguistics","digital-humanities","emotion-detection","huggingface-transformers","humanities","multi-label-classification","natural-language-processing","nlp","python","semantic-analysis","streamlit","text-classification","theme-detection","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mancrurod.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-21T14:10:40.000Z","updated_at":"2025-05-26T06:47:25.000Z","dependencies_parsed_at":"2025-05-26T16:03:08.872Z","dependency_job_id":null,"html_url":"https://github.com/mancrurod/LinguaAnimae","commit_stats":null,"previous_names":["mancrurod/linguaanimae"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/mancrurod/LinguaAnimae","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mancrurod%2FLinguaAnimae","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mancrurod%2FLinguaAnimae/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mancrurod%2FLinguaAnimae/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mancrurod%2FLinguaAnimae/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mancrurod","download_url":"https://codeload.github.com/mancrurod/LinguaAnimae/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mancrurod%2FLinguaAnimae/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32009826,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"online","status_checked_at":"2026-04-19T02:00:07.110Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","corpus-linguistics","digital-humanities","emotion-detection","huggingface-transformers","humanities","multi-label-classification","natural-language-processing","nlp","python","semantic-analysis","streamlit","text-classification","theme-detection","web-scraping"],"created_at":"2025-05-14T18:12:05.615Z","updated_at":"2026-04-19T14:31:33.597Z","avatar_url":"https://github.com/mancrurod.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ca\u003e\n    \u003cimg src=\"docs/banner_readme.png\" alt=\"Lingua Animae Banner\" width=\"100%\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003e🤖 Classify, explore, and connect with sacred texts through emotion and theme. ❤️‍🩹\u003c/b\u003e\u003cbr\u003e\n  Multilingual NLP pipeline for emotion \u0026 theme annotation, with an interactive Streamlit chatbot for personalized Bible verse recommendations.\n\u003c/p\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://linguaanimae.streamlit.app/\" style=\"text-decoration: none; font-size: 1.3em;\"\u003e\n        🟢 Try the Live Demo!\n    \u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://linguaanimae.streamlit.app/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Streamlit-Demo-brightgreen?logo=streamlit\" alt=\"Streamlit Demo\"\u003e\n  \u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/python-3.10%2B-blue.svg\" alt=\"Python Version\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/license-Academic-informational\" alt=\"License\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Open%20Source-Yes-brightgreen.svg\" alt=\"Open Source\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/HuggingFace-Transformers-yellow?logo=huggingface\" alt=\"Hugging Face\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/NLP-pipeline-blueviolet\" alt=\"NLP Pipeline\"\u003e\n\u003c/p\u003e\n\n---\n\n\u003cdetails\u003e\n\u003csummary\u003e📚 \u003cb\u003eTable of Contents\u003c/b\u003e \u003csub\u003e(click to expand)\u003c/sub\u003e\u003c/summary\u003e\n\n- [📔 Key Notebooks](#-key-notebooks)\n- [🔍 Project Goals](#-project-goals)\n- [🧠 Core Technologies](#-core-technologies)\n- [📁 Project Structure](#-project-structure)\n- [📦 Data Folders Overview](#-data-folders-overview)\n- [🆕 Data Selection, Annotation \u0026 Versioning](#-data-selection-annotation--versioning)\n- [📝 Label Mapping and Cleaning](#-label-mapping-and-cleaning)\n- [🚦 Model Training \u0026 Evaluation](#-model-training--evaluation)\n- [📸 Screenshots](#-screenshots)\n- [🚀 Getting Started](#-getting-started)\n- [🧰 Usage](#-usage)\n- [💬 Streamlit Interface](#-streamlit-interface)\n- [📤 Feedback System](#-feedback-system)\n- [📊 Outputs](#-outputs)\n- [📌 Project Status (MVP Completed)](#-project-status-mvp-completed)\n- [⚠️ Known Limitations](#-known-limitations)\n- [🤝 Contributing \u0026 Testing](#-contributing--testing)\n- [📖 License](#-license)\n- [✨ Acknowledgements](#-acknowledgements)\n\n\u003c/details\u003e\n\n---\n\n## 📔 Key Notebooks\n\nExplore the main stages of the pipeline directly in Jupyter notebooks:\n\n- [01_scraping_exploration.ipynb](notebooks/01_scraping_exploration.ipynb) — Data exploration \u0026 Bible scraping workflow.\n- [02_cleaning.ipynb](notebooks/02_cleaning.ipynb) — Data cleaning and normalization.\n- [03_label_emotions_and_themes.ipynb](notebooks/03_label_emotions_and_themes.ipynb) — Emotion \u0026 theme annotation pipeline.\n- [05_evaluation.ipynb](notebooks/05_evaluation.ipynb) — Model evaluation: metrics, confusion matrix, reporting.\n- [viz_models.ipynb](notebooks/viz_models.ipynb) — Model outputs and visualizations.\n\n---\n\n\n## 🔍 Project Goals\n\n* Extract and normalize full Bible corpora (English + Spanish)\n* Annotate every verse with emotion and theme labels\n* Translate annotations for multilingual consistency\n* Power a semantic chatbot that suggests aligned verses in real time\n* Support additional domains like poetry or music lyrics (planned)\n\n---\n\n## 🧠 Core Technologies\n\n* **Python 3.10+**\n* `transformers`, `torch`, `sentence-transformers`\n* `pandas`, `scikit-learn`, `regex`\n* `beautifulsoup4`, `requests`\n* `streamlit` – multilingual app for emotion/theme-based verse recommendation\n\n---\n\n## 📁 Project Structure\n\n```\nLinguaAnimae/\n├── .streamlit/\n│   └── secrets.toml\n├── app/\n│   ├── assets/\n│   ├── components/\n│   │   ├── render_emotion.py\n│   │   ├── render_feedback.py\n│   │   ├── render_theme.py\n│   ├── app.py\n│   └── texts.py\n├── data/\n│   ├── evaluation/\n│   │   ├── verses_labeled_gpt/\n│   │   ├── verses_parsed/\n│   │   ├── verses_to_label/\n│   │   ├── eval_examples.csv\n│   │   └── eval_results.csv\n│   ├── labeled/\n│   ├── processed/\n│   └── raw/\n├── logs/\n├── notebooks/\n│   ├── 01_scraping_exploration.ipynb\n│   ├── 02_cleaning.ipynb\n│   ├── 03_label_emotions_and_themes.ipynb\n│   ├── 04_translate_labels.ipynb\n│   ├── 05_evaluation.ipynb\n│   ├── 06_emotion_finetuning_pipeline.ipynb\n│   └── viz_models.ipynb\n├── src/\n│   ├── fine_tuning/\n│   │   ├── parse_gpt_output_to_labeled_csv.py\n│   │   └── select_verses_for_labeling.py\n│   │   └── prompt_gpt.txt\n│   ├── interface/\n│   │   └── recommender.py\n│   ├── modeling/\n│   │   ├── emotion_theme_labeling.py\n│   │   ├── labeling_pipeline.py\n│   │   └── theme_labeling.py\n│   ├── preprocessing/\n│   │   ├── cleaning.py\n│   │   ├── merge.py\n│   │   └── translate_and_apply_labels.py\n│   ├── scraping/\n│   │   ├── bible_scraper.py\n│   │   └── parse_osis_kjv.py\n│   └── utils/\n│       ├── save_feedback_to_gsheet.py\n│       └── translation_maps.py\n├── tests/\n├── .gitignore\n├── requirements.txt\n├── requirements_local.txt\n├── environment.yml\n├── README.md\n├── CHANGELOG.md\n```\n\n---\n\n## 📦 Data Folders Overview\n\n| Folder                  | Description                                                                 |\n|-------------------------|-----------------------------------------------------------------------------|\n| `data/raw/`             | Raw, unprocessed texts as scraped from original sources (KJV/RV60 Bibles).  |\n| `data/processed/`       | Cleaned and normalized texts, with basic formatting corrections.             |\n| `data/labeled/`         | Verses annotated with emotion and theme labels.                             |\n| `data/evaluation/`      | Evaluation sets, results, and samples for manual review.                    |\n| `logs/`                 | Logs from annotation, training, and feedback collection.                    |\n| `notebooks/`            | Jupyter notebooks documenting each stage of the pipeline.                   |\n\n\n---\n\n## 🆕 Data Selection, Annotation \u0026 Versioning\n\n**Sampling, annotation, and batch tracking workflow:**\n\n- Automated random verse selection script for new annotation rounds, guaranteeing no duplication of already labeled verses.\n- Supports multiple annotation rounds with batch/version tracking (`emotion_verses_to_label_X.csv`).\n- New annotation batches can be labeled via GPT or other models, then easily merged with existing datasets.\n- Utility scripts included for remapping, cleaning, and validating emotion labels prior to model training.\n- Each annotation batch and its integration is versioned for reproducibility and experiment traceability.\n\n---\n\n## 📝 Label Mapping and Cleaning\n\n- **Robust label mapping:** All scripts and model pipelines use unified dictionaries for emotion and theme mapping (`EMOTION_MAP`, `THEME_MAP`), ensuring compatibility between annotation, translation, and modeling.\n- **Label cleaning utilities:** Automated routines for handling strange/ambiguous emotions and mapping them to the canonical set. Out-of-vocabulary or inconsistent labels are filtered out before training.\n\n---\n\n## 🚦 Model Training \u0026 Evaluation\n\nThe project now supports full training and evaluation workflows for emotion classification models, including:\n\n- Fine-tuning with Hugging Face Transformers on the annotated Bible corpus.\n- Optional oversampling for class balancing during training.\n- Comprehensive cross-validation pipeline using StratifiedKFold and HuggingFace Trainer, reporting mean and std of macro F1 across folds.\n- Export of classification reports and confusion matrices after each experiment for documentation and analysis.\n- Early stopping to prevent overfitting in all model workflows.\n\nSee `notebooks/05_evaluation.ipynb` and `src/fine_tuning/` for code examples and experiment tracking.\n\n---\n\n## 📸 Screenshots\n\nThe following screenshots illustrate the main functionalities of the Streamlit app at a glance:\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003e1. Home Screen: Input your message and select language\u003c/b\u003e\u003cbr\u003e\n  \u003cimg src=\"docs/screenshot_home.png\" alt=\"App Home\" width=\"600\"/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003e2. Recommendation Screen, part 1: The app suggests a Bible verse with detected emotion and theme\u003c/b\u003e\u003cbr\u003e\n  \u003cimg src=\"docs/screenshot_recommendation.png\" alt=\"Recommendation Example\" width=\"600\"/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003e3. Recommendation Screen, part 2\u003c/b\u003e\u003cbr\u003e\n  \u003cimg src=\"docs/screenshot_recommendation_2.png\" alt=\"Second Recommendation Example\" width=\"600\"/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003e4. Feedback Confirmation: User feedback is logged for model improvement\u003c/b\u003e\u003cbr\u003e\n  \u003cimg src=\"docs/screenshot_feedback.png\" alt=\"Feedback Confirmation\" width=\"600\"/\u003e\n\u003c/p\u003e\n\n\n---\n\n## 🏁 Getting Started\n\nYou can set up the environment using either `conda` (recommended) or `pip`.\n\n### Option 1: Using Conda (recommended)\n\n```bash\nconda env create -f environment_local.yml\nconda activate linguaanimae\n```\n\n### Option 2: Using pip\n\n1. Clone the repository\n\n```bash\ngit clone https://github.com/your-username/LinguaAnimae.git\ncd LinguaAnimae\n```\n\n2. Create a virtual environment\n\n```bash\npython -m venv venv\nsource venv/bin/activate  # or .\\venv\\Scripts\\activate on Windows\n```\n\n3. Install dependencies\n\n```bash\npip install -r requirements.txt\n```\n\n4. Run the Bible scraper to download all books\n\n```bash\npython src/scraping/bible_scraper.py\n```\n\n---\n\n## 🧰 Usage\n\n### 1. Scrape the Bible (RV60)\n\nUse the scraping script to extract the full Reina-Valera 1960 Bible and save it as structured CSVs:\n\n```bash\npython src/scraping/bible_scraper.py\n```\n\n### 2. Label Verses with Emotions + Themes\n\nUse the labeling pipeline to classify English Bible verses (bible\\_kjv) using pretrained HuggingFace models:\n\n```bash\npython src/interface/labeling_pipeline.py --bible bible_kjv\n```\n\nOptional flags:\n\n* \\--skip-emotion to skip emotion classification\n* \\--skip-theme to skip theme labeling\n* \\--device -1 to force CPU mode (default is --device 0 for GPU)\n* \\--dry-run path/to/file.csv to test a single file\n\n### 3. Translate Labels into Spanish\n\nAlign the English emotion/theme annotations with their Spanish verse equivalents in bible\\_rv60:\n\n```bash\npython src/preprocessing/translate_and_apply_labels.py\n```\n\nThis creates a labeled Spanish version under:\n\n```bash\ndata/labeled/bible_rv60/emotion_theme/\n```\n\n---\n\n## 💬 Streamlit Interface\n\nThe interactive Streamlit app allows users to input a free-form emotional message and receive recommended Bible verses matching its **emotion** and **theme**.\n\n### Features\n\n* 🔄 **Automatic translation** of input (EN/ES)\n* 🧠 **Emotion detection** (6 Ekman categories)\n* 🏷️ **Theme classification** (5 canonical themes)\n* 📖 **Context-aware verse matching** from KJV or RV60\n* 🎨 **Stylized cards** with emotion/theme color, emoji, and verse metadata\n* ✅ **User feedback collection** via like/dislike buttons (stored in Google Sheets)\n\n### Example\n\nInput:\n\n\u003e *Tengo miedo y necesito consuelo...*\n\nReturns:\n\n\u003e *Génesis 40:7* — *\"¿Por qué parecen hoy mal vuestros semblantes?\"*\n\n---\n\n## 📤 Feedback System\n\nUsers can now rate the relevance of the emotion/theme detection with a 👍 / 👎 system.\nFeedback is saved to a **Google Sheet** along with:\n\n* Original input\n* Detected emotion and score\n* Detected theme and score\n* User name (optional)\n* Feedback value (`like` / `dislike`)\n\nThis enables future model refinement and analytics.\n\n---\n\n## 📊 Outputs\n\nLabeled files are saved to:\n\n* \\*\\_emotion.csv: Emotion column using 6 Plutchik labels\n* \\*\\_emotion\\_theme.csv: Adds multilabel theme column from 5 canonical themes\n* Logs are saved to: logs/labeling\\_logs/ with per-file runtime and pipeline summary\n\n---\n\n## 📌 Project Status (MVP Completed)\n\n### ✅ MVP Completed (Weeks 1–6)\n- [x] Full Bible scraping (KJV + RV60) and corpus organization\n- [x] Data cleaning and normalization\n- [x] Emotion and theme labeling using pretrained HuggingFace models\n- [x] Cross-lingual label transfer and Spanish label alignment\n- [x] Robust manual evaluation with accuracy, macro F1, and confusion matrix reporting\n- [x] Streamlit interface: emotion + theme detection, stylized results, and interactive recommendations\n- [x] Multilingual support: automatic input translation and dynamic corpus selection (EN/ES)\n- [x] Recommendation system matching user queries by emotion and theme\n- [x] Feedback system: like/dislike buttons with logging to Google Sheets\n- [x] Model fine-tuning workflow: train/test split, metrics, early stopping, and artifact saving\n- [x] Batch random sampling, annotation pipeline, and batch version tracking\n- [x] Cross-validation pipeline (StratifiedKFold + HuggingFace Trainer) for robust evaluation\n- [x] Automated report and confusion matrix export for each experiment\n\n### 🚀 Future Work (Optional/Post-MVP)\n- Export features (PDF), voice synthesis, or word cloud summaries\n- Support for additional text domains (poetry, music, etc.)\n- Add fine-tuned model to pipeline. Use it to relabel Bible verses.\n\n[See CHANGELOG.md](CHANGELOG.md) for complete history.\n\n---\n\n## ⚠️ Known Limitations\n\nWhile Lingua Animae demonstrates robust results as an MVP, the current version has several known limitations that future work may address:\n\n- **Domain scope:** The annotation and recommendation pipeline is currently limited to biblical texts (KJV and RV60). Application to other genres (e.g., poetry, music lyrics) is planned but not yet implemented or validated.\n- **Language support:** Only English and Spanish are fully supported at this time. Adding other languages would require further data preparation and model adaptation.\n- **Emotion \u0026 theme taxonomy:** The emotion (6-class) and theme (5-class) taxonomies, while grounded in literature, are simplified for tractability and may not capture all nuances present in complex texts.\n- **Annotation transfer:** The cross-lingual label transfer assumes strong verse alignment between English and Spanish Bibles; rare misalignments or translation differences may impact label accuracy.\n- **Model bias:** Pretrained models used for annotation (e.g., HuggingFace Transformers) may inherit cultural or linguistic biases from their original training data, which could affect the detection of emotions or themes.\n- **Evaluation set:** Manual evaluation is limited in scale and focuses on selected books/verses. Broader user validation or external benchmarks are desirable for production-level deployment.\n- **Deployment:** The Streamlit app is designed for demonstration and user feedback. For large-scale or production use, backend scalability, security, and multi-user management would require further engineering.\n\n---\n\n## 🤝 Contributing \u0026 Testing\n\nContributions, suggestions, or bug reports are welcome!  \nTo run unit tests, use:\n\n```bash\npytest tests/\n```\n\nFor feature requests, open an issue or pull request on GitHub.\n\n---\n\n## 📖 License\n\nFor academic and research use only. Sources are derived from public domain Bibles (e.g., RV60, KJV) and open ML models from HugginFace. License will be finalized before v1.0.\n\n---\n\n## ✨ Acknowledgements\n\nDeveloped by [Manuel Cruz Rodríguez](https://github.com/mancrurod) as part of an NLP and Data Science learning journey.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmancrurod%2Flinguaanimae","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmancrurod%2Flinguaanimae","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmancrurod%2Flinguaanimae/lists"}