https://github.com/sivkri/nlp-llm-textmining_genesearch

Text Mining (PubMed Search) with NLP & LLM
https://github.com/sivkri/nlp-llm-textmining_genesearch

ensemble-machine-learning llm minilm named-entity-recognition ner nlp nlpmodels nltk pubmed pubmed-abstracts spacy textmining

Last synced: about 2 months ago
JSON representation

Text Mining (PubMed Search) with NLP & LLM

Host: GitHub
URL: https://github.com/sivkri/nlp-llm-textmining_genesearch
Owner: sivkri
License: gpl-3.0
Created: 2025-03-10T16:29:32.000Z (2 months ago)
Default Branch: main
Last Pushed: 2025-03-10T17:01:01.000Z (2 months ago)
Last Synced: 2025-03-10T17:36:59.264Z (2 months ago)
Topics: ensemble-machine-learning, llm, minilm, named-entity-recognition, ner, nlp, nlpmodels, nltk, pubmed, pubmed-abstracts, spacy, textmining
Language: Jupyter Notebook
Homepage:
Size: 31.3 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Text Mining (PubMed Search) with NLP & LLM 🚀

This project uses **NLP, LLM models, and text mining** to search **PubMed** for research articles related to specific **genes**.
It integrates keyword expansion using **spaCy, NLTK, and MiniLM** for enhanced search accuracy.

This project leverages **NLP, LLM models, and text mining** to efficiently search **PubMed** for research articles related to specific genes. Even though Arabidopsis search is shown in the **colab notebook**. It can be extended to other organism as well. It integrates **keyword expansion** using **spaCy, NLTK, and MiniLM** to enhance search accuracy and improve literature retrieval.

## 🌟 Features
✅ Retrieve PubMed articles using **gene symbols & TAIR IDs**
✅ Expand search queries using **NLP models (spaCy, NLTK, MiniLM)**
✅ Remove duplicate PubMed IDs & provide **clear logs**
✅ Fetch **detailed abstracts, DOI links, and matched keywords**

## Usage
**Prerequistes**
- **NCBI API Key**: Required for accessing PubMed efficiently
- **Hugging Face Token**: Needed for NLP model authentication (store in Colab secrets)

## Steps to Run
1️⃣ Enter **TAIR IDs or Gene Symbols** when prompted (*supports expansion to all organisms*)

2️⃣ Enter the **number of articles** to retrieve per search

3️⃣ The script will process **original and expanded queries** to retrieve PubMed articles

## 📊 Execution Flow
1. **User Input:** TAIR IDs / Gene Symbols
2. **Data Processing:** Extract TAIR IDs & Gene Symbols
3. **ID Conversion:** Convert TAIR IDs to Gene Symbols (if applicable)
4. **Keyword Expansion:** Use **NLP models** to generate related terms
5. **PubMed Querying:** Search using both **original & expanded keywords**
6. **Data Extraction:** Fetch article titles, abstracts, DOIs, and publication years
7. **Duplicate Removal:** Ensure **unique PubMed IDs** in results
8. **Final Output:** Display cleaned & structured search results

## 🛠️ Installation & Execution
1️⃣ **Clone the Repository**
```bash
git clone https://github.com/sivkri/NLP-LLM-TextMining_GeneSearch.git
cd NLP-LLM-TextMining_GeneSearch
```

2️⃣ **Run in Google Colab**

Simply upload and execute the provided Colab Notebook
```NLP-LLM-TextMining_GeneSearch.ipynb ```

##💡**Example Output**
```bash
Enter comma-separated TAIR IDs or gene symbols: AT4G26080, mapkkk18
Enter the number of articles to retrieve per search (e.g., 5, 10, 15): 5
✅ Results for AT4G26080: 0 (original) | 0 (expanded)
✅ Results for mapkkk18: 5 (original) | 5 (expanded)
✅ Results for ABI1: 5 (original) | 5 (expanded)
⚠️ 10 duplicate PubMed ID(s) removed.
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sivkri/nlp-llm-textmining_genesearch

Awesome Lists containing this project

README