{"id":50479138,"url":"https://github.com/parthapray/ecotroph-rag","last_synced_at":"2026-06-01T16:02:17.618Z","repository":{"id":359817642,"uuid":"1247616216","full_name":"ParthaPRay/EcoTroph-RAG","owner":"ParthaPRay","description":"This repo shows the coding of EcoTroph-RAG: A Retrieval-Augmented Ecological Intelligence Framework for Freshwater Fish Diet Analysis","archived":false,"fork":false,"pushed_at":"2026-05-23T15:11:32.000Z","size":2512,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-23T17:08:52.613Z","etag":null,"topics":["bart-large-cnn","bge-m3","bm25","diet","ecological","embedding-models","fish","huggingface","llm","minilm-l6-v2","nomic-ai-nomic-embed-text-v15","rag","summarization","t5-base"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ParthaPRay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-23T14:55:00.000Z","updated_at":"2026-05-23T15:13:58.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ParthaPRay/EcoTroph-RAG","commit_stats":null,"previous_names":["parthapray/ecotroph-rag"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ParthaPRay/EcoTroph-RAG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ParthaPRay%2FEcoTroph-RAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ParthaPRay%2FEcoTroph-RAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ParthaPRay%2FEcoTroph-RAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ParthaPRay%2FEcoTroph-RAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ParthaPRay","download_url":"https://codeload.github.com/ParthaPRay/EcoTroph-RAG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ParthaPRay%2FEcoTroph-RAG/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33782317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bart-large-cnn","bge-m3","bm25","diet","ecological","embedding-models","fish","huggingface","llm","minilm-l6-v2","nomic-ai-nomic-embed-text-v15","rag","summarization","t5-base"],"created_at":"2026-06-01T16:02:15.622Z","updated_at":"2026-06-01T16:02:17.588Z","avatar_url":"https://github.com/ParthaPRay.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# EcoTroph-RAG: Retrieval-Augmented Ecological Intelligence for Freshwater Fish Diet Analysis\n\n## Repository\n\nThis repository contains the implementation of **EcoTroph-RAG**, a retrieval-augmented ecological intelligence framework for freshwater fish diet analysis.\n\nRepository URL:\n\n```text\nhttps://github.com/ParthaPRay/EcoTroph-RAG/\n````\n\nThe main executable notebook is:\n\n```text\nEcoTroph_RAG.ipynb\n```\n\nThe dataset file included in this repository is:\n\n```text\ntrophish_dataset.csv\n```\n\nThe dataset was manually downloaded from the original TroPhish GitHub repository and uploaded here for reproducibility.\n\n---\n\n## Overview\n\n**EcoTroph-RAG** is a lightweight retrieval-augmented generation framework designed to transform structured freshwater fish diet records into a semantically searchable ecological knowledge system.\n\nThe framework performs:\n\n1. dataset loading and cleaning,\n2. tabular row-to-ecological-text conversion,\n3. embedding generation,\n4. vector indexing,\n5. semantic retrieval,\n6. keyword baseline retrieval,\n7. abstractive summarization,\n8. model comparison,\n9. statistical testing,\n10. SHAP-based explainability.\n\nThe goal is to support ecological diet search, trophic interaction analysis, freshwater fish feeding pattern retrieval, and evidence-grounded summarization.\n\n---\n\n## Dataset Source\n\nThis work uses the **TroPhish** dataset created by Jacob Ridgway and Jeff Wesner.\n\nOriginal dataset repository:\n\n```text\nhttps://github.com/jswesner/TroPhish\n```\n\nIn the original repository, the dataset is located at:\n\n```text\ndata/trophish_dataset.csv\n```\n\nDataset citation:\n\n```text\nRidgway, Jacob M. 2022. “TROPHISH: BUILDING A GLOBAL DATABASE OF FRESHWATER TROPHIC INTERACTIONS.” Honors Thesis. https://red.library.usd.edu/honors-thesis/259.\n```\n\nThe TroPhish dataset contains dietary data extracted from literature reports ranging from the 1890s to the present and covers hundreds of freshwater fish species.\n\n---\n\n## Dataset Used in This Repository\n\nFor this repository, the dataset file is provided as:\n\n```text\ntrophish_dataset.csv\n```\n\nThe file contains **54,751 rows including the header row**.\n\nThat means the data contains approximately:\n\n```text\n54,750 dietary records + 1 header row\n```\n\n---\n\n## Dataset Columns\n\nThe dataset contains the following columns:\n\n```text\nfish_species\nprey_kingdom\nprey_taxon\nprey_class\nprey_origin\nprey_stage\ndiet_value\ndiet_units\ndiet_type\ndiet_percent\nrecord_id\nsource_id\nfish_id\nstart_date\nend_date\nsampling_interval\ndata_sorted_by\nfish_min_length\nfish_average_length\nfish_max_length\nfish_length_units\nfish_length_measure\nhabitat_broad\nhabitat\nlongitude\nlatitude\n```\n\nThese columns describe fish identity, prey identity, prey taxonomy, diet contribution, sampling information, fish length, habitat type, and geographic location.\n\n---\n\n## Framework Architecture\n\n```text\ntrophish_dataset.csv\n        ↓\nData cleaning and normalization\n        ↓\nTabular row-to-ecological-text transformation\n        ↓\nEmbedding generation using Hugging Face models\n        ↓\nVector indexing using Chroma\n        ↓\nEcological query input\n        ↓\nTop-k semantic retrieval\n        ↓\nEvidence-grounded summarization\n        ↓\nEvaluation, statistical testing, and explainability\n```\n\n---\n\n## Row-to-Text Transformation\n\nEach tabular dietary record is converted into a natural-language ecological text unit.\n\nExample:\n\n```text\nFish species Notropis biguttatus consumed prey taxon ephemeroptera from prey kingdom Metazoa and prey class Insecta. The prey origin was aquatic and prey stage was not reported. The diet value was 8.6 percent, measured as volume, with diet percent 8.6. The habitat was lotic. The geographic location was longitude -78.00501 and latitude 43.29869.\n```\n\nThis transformation allows sentence-embedding models to process structured ecological records as semantic text.\n\n---\n\n## Retrieval-Augmented Generation Design\n\nEcoTroph-RAG uses a retrieval-augmented generation workflow.\n\nIn this framework:\n\n1. each TroPhish row is converted into ecological text;\n2. the text is embedded into a dense vector representation;\n3. embeddings are stored in a Chroma vector database;\n4. a user ecological query is embedded;\n5. top-k relevant records are retrieved;\n6. retrieved records are summarized using abstractive summarization models.\n\nThe generated responses are therefore grounded in actual TroPhish records.\n\n---\n\n## Retrieval Models Evaluated\n\nThe notebook compares multiple retrieval approaches:\n\n| Method        | Description                                                       |\n| ------------- | ----------------------------------------------------------------- |\n| BM25          | Keyword-based lexical retrieval baseline                          |\n| MiniLM-Chroma | Semantic retrieval using `sentence-transformers/all-MiniLM-L6-v2` |\n| BGE-M3        | Dense retrieval using `BAAI/bge-m3`                               |\n| Nomic-v1.5    | Retrieval using `nomic-ai/nomic-embed-text-v1.5`                  |\n\n---\n\n## Summarization Models Evaluated\n\nThe notebook evaluates two abstractive summarization models:\n\n| Model                     | Use                                  |\n| ------------------------- | ------------------------------------ |\n| `facebook/bart-large-cnn` | BART-based abstractive summarization |\n| `google-t5/t5-base`       | T5-based abstractive summarization   |\n\nBoth models summarize the same retrieved ecological evidence, allowing fair comparison.\n\n---\n\n## Evaluation Queries\n\nA set of dataset-grounded ecological queries is used for evaluation.\n\nExample queries include:\n\n```text\nWhich fish species consume crustaceans in lotic habitats?\nWhich fish consume aquatic insect larvae?\nWhich fish species consume Odonata prey?\nWhich fish consume Ephemeroptera in lotic habitats?\nWhich fish consume filamentous algae in lotic habitats?\nWhich records describe Lepomis macrochirus consuming Odonata larvae in creeks?\n```\n\nEach query is validated against the dataset using matching terms to ensure that relevant records exist.\n\n---\n\n## Retrieval Evaluation Metrics\n\nRetrieval models are evaluated using:\n\n| Metric       | Meaning                                                |\n| ------------ | ------------------------------------------------------ |\n| Precision@10 | Fraction of top-10 retrieved records that are relevant |\n| HitRate@10   | Whether at least one relevant record appears in top-10 |\n| MRR          | Mean Reciprocal Rank of the first relevant record      |\n| nDCG@10      | Ranking quality of retrieved evidence                  |\n| Latency      | Query execution time                                   |\n\n---\n\n## Summarization Evaluation Metrics\n\nSummarization models are evaluated using:\n\n| Metric                | Meaning                                    |\n| --------------------- | ------------------------------------------ |\n| ROUGE-1 F1            | Unigram overlap                            |\n| ROUGE-2 F1            | Bigram overlap                             |\n| ROUGE-L F1            | Longest common subsequence overlap         |\n| Compression Ratio     | Summary length relative to evidence length |\n| Summarization Latency | Time required to generate summary          |\n\n---\n\n## Statistical Testing\n\nThe notebook performs enriched statistical testing for retrieval and summarization comparisons.\n\nStatistical analyses include:\n\n* Shapiro normality test\n* D’Agostino normality test\n* paired t-test\n* Wilcoxon signed-rank test\n* bootstrap confidence intervals\n* Cohen’s d\n* Hedges’ g\n* rank-biserial correlation\n* paired Cliff’s delta\n* Pearson correlation\n* Spearman correlation\n* win/tie/loss counts\n\nThese tests help assess whether observed performance differences between models are meaningful.\n\n---\n\n## Explainability\n\nSHAP-based surrogate explainability is included.\n\nThe SHAP analysis explains which factors influence:\n\n1. retrieval performance,\n2. summarization quality.\n\nImportant note:\n\n```text\nSHAP is applied to surrogate machine-learning models trained on query-level evaluation outputs. It does not explain the internal transformer parameters directly.\n```\n\n---\n\n## Main Notebook\n\nRun:\n\n```text\nEcoTroph_RAG.ipynb\n```\n\nThe notebook includes:\n\n1. package installation,\n2. dataset loading,\n3. dataset statistics,\n4. ecological text generation,\n5. Chroma indexing,\n6. BM25 retrieval,\n7. semantic retrieval,\n8. embedding model comparison,\n9. BART summarization,\n10. T5 summarization,\n11. ROUGE evaluation,\n12. latency analysis,\n13. statistical testing,\n14. SHAP explainability,\n15. export of result tables and figures.\n\n---\n\n## Installation\n\nRecommended environment:\n\n```text\nGoogle Colab\nPython 3.x\nGPU runtime preferred\n```\n\nInstall dependencies:\n\n```bash\npip install pandas numpy chromadb sentence-transformers transformers torch scikit-learn tqdm rank-bm25 rouge-score psutil matplotlib seaborn shap FlagEmbedding\n```\n\n---\n\n## How to Run\n\nClone the repository:\n\n```bash\ngit clone https://github.com/ParthaPRay/EcoTroph-RAG.git\ncd EcoTroph-RAG\n```\n\nOpen the notebook:\n\n```text\nEcoTroph_RAG.ipynb\n```\n\nRun all cells sequentially.\n\nMake sure the dataset file is available in the repository root:\n\n```text\ntrophish_dataset.csv\n```\n\n---\n\n## Expected Outputs\n\nThe notebook generates:\n\n```text\ndataset statistics\nquery validation table\nretrieval evaluation table\nretrieval summary table\nembedding model statistical tests\nsummarization evaluation table\nBART vs T5 comparison table\nsummarizer statistical tests\nSHAP plots\npublication-grade figures\nCSV result files\n```\n\n---\n\n## Suggested Repository Structure\n\n```text\nEcoTroph-RAG/\n│\n├── EcoTroph_RAG.ipynb\n├── trophish_dataset.csv\n├── README.md\n│\n├── results/\n│   ├── dataset_statistics.csv\n│   ├── query_validation_dataset_coverage.csv\n│   ├── retrieval_summary.csv\n│   ├── query_level_retrieval_evaluation.csv\n│   ├── embedding_model_statistical_tests_publication.csv\n│   ├── bart_t5_summarizer_summary_table.csv\n│   └── bart_t5_summarizer_statistical_tests.csv\n│\n└── figures/\n    ├── figure_retrieval_performance.png\n    ├── figure_query_latency.png\n    ├── figure_rouge_scores.png\n    └── figure_shap_summary.png\n```\n\n---\n\n## Research Contribution\n\nEcoTroph-RAG contributes:\n\n1. a row-to-text ecological representation method for freshwater fish diet records;\n2. a retrieval-augmented framework for freshwater trophic intelligence;\n3. comparison of keyword, MiniLM, BGE-M3, and Nomic embedding retrieval;\n4. comparison of BART and T5 summarization for ecological evidence;\n5. statistical evaluation of retrieval and summarization performance;\n6. SHAP-based explainability of query-level outcomes.\n\n---\n\n## Possible Paper Title\n\n```text\nEcoTroph-RAG: A Retrieval-Augmented Ecological Intelligence Framework for Freshwater Fish Diet Analysis\n```\n\n---\n\n## Citation\n\nIf you use this repository, please cite the original TroPhish dataset source:\n\n```text\nRidgway, Jacob M. 2022. “TROPHISH: BUILDING A GLOBAL DATABASE OF FRESHWATER TROPHIC INTERACTIONS.” Honors Thesis. https://red.library.usd.edu/honors-thesis/259.\n```\n\nOriginal TroPhish GitHub repository:\n\n```text\nhttps://github.com/jswesner/TroPhish\n```\n\n---\n\n## Acknowledgement\n\nThe TroPhish dataset was developed by Jacob Ridgway and Jeff Wesner. This repository builds upon their freshwater trophic interaction dataset to explore semantic retrieval, retrieval-augmented generation, summarization, benchmarking, and explainable AI for freshwater fish diet analysis.\n\n---\n\n## Disclaimer\n\nThis repository does not claim ownership of the original TroPhish dataset. The dataset was obtained from the publicly available TroPhish repository and is included here only for reproducibility of the EcoTroph-RAG experiments. Users should consult the original TroPhish repository and thesis for dataset provenance, licensing, and full methodological details.\n\n\n## Citation\n\nIf you use this repository, please cite:\n\n```text\nRay, Partha Pratim. EcoTroph-RAG: Retrieval-Augmented Ecological Intelligence for Freshwater Fish Diet Analysis. May 23, 2026. GitHub repository. Available at: https://github.com/ParthaPRay/EcoTroph-RAG/\n````\n\n### BibTeX\n\n```bibtex\n@misc{ray2026ecotrophrag,\n  author       = {Partha Pratim Ray},\n  title        = {EcoTroph-RAG: Retrieval-Augmented Ecological Intelligence for Freshwater Fish Diet Analysis},\n  year         = {may 23, 2026},\n  howpublished = {\\url{https://github.com/ParthaPRay/EcoTroph-RAG/}},\n  note         = {GitHub repository}\n}\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparthapray%2Fecotroph-rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparthapray%2Fecotroph-rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparthapray%2Fecotroph-rag/lists"}