An open API service indexing awesome lists of open source software.

https://github.com/gauthamnairvm/trex-app

Text Refinement EXplorer - An EDA tool for text based data.
https://github.com/gauthamnairvm/trex-app

data-analysis data-visualization groq-api large-language-models llama3 natural-language-processing text2sql

Last synced: 9 months ago
JSON representation

Text Refinement EXplorer - An EDA tool for text based data.

Awesome Lists containing this project

README

          

# T.REX β€” Text Refinement and EXploration

T.REX is a powerful local tool for analyzing, deduplicating, clustering, and querying large-scale text datasets using pretrained embeddings and LLM-powered pipelines with metadata aware plots and integrations.

> ❗ Requires a machine with a **dedicated NVIDIA GPU (CUDA 11.8+)**.
> ❌ Will NOT run in Docker, WSL, or headless environments due to GUI popups.

---

## πŸ›  Features

- βœ… CSV popup loader with column detection
- βœ… Embedding generation (`sentence-transformers`)
- βœ… Interactive EDA on metadata + selected text column
- βœ… Clustering and optional LLM labeling
- βœ… Near duplicate analysis
- βœ… Text-to-SQL pipeline
- βœ… Full CLI-based UX for pipeline chaining

---

## βš™οΈ Installation

### 1. Prerequisites

- Python **3.10**
- NVIDIA GPU with **CUDA 11.8+ drivers installed**
- **Display environment** (no WSL or remote/headless)

### 2. Setup Steps

```bash
# Clone the repo
git clone https://github.com/gauthamnairvm/trex-app.git
cd trex-app

# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

#Additional Dependencies
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

```

---

## πŸ” Environment Setup

Create a `.env` file at the root (you can copy from `.env.template`):

```
GROQ_API_KEY=your_groq_api_key_here
```

---

## πŸš€ Run T.REX

```bash
python main.py
```

You'll see the CSV loader and the T.REX CLI. Try:

```bash
T.REX > trex_eda(metadata=['col1', 'col2'])
T.REX > trex_cluster()
T.REX > trex_dedup(stopwords=False)
T.REX > trex_text2sql(pii_mask=True)
```

---

## πŸ“‚ Project Structure

```
trex-app/
β”œβ”€β”€ app/
β”‚ β”œβ”€β”€ clustering.py
β”‚ β”œβ”€β”€ dedup.py
β”‚ β”œβ”€β”€ embedding.py
β”‚ β”œβ”€β”€ file_loader.py
β”‚ β”œβ”€β”€ pipeline.py
β”‚ └── text2sql_pipeline.py
β”œβ”€β”€ data/ # CSVs and embeddings
β”œβ”€β”€ results/ # Output plots and clustering results
β”œβ”€β”€ main.py # Entry point
β”œβ”€β”€ .env.template # Environment variable example
β”œβ”€β”€ requirements.txt
└── README.md
```

---

## πŸ“‹ License

This project is licensed under the **MIT License**.
You’re free to use, modify, and distribute it. Please give credit if you build on it.

---

## 🀝 Contributions

TREX is open for issues, suggestions, and pull requests.
To contribute:

1. Fork the repo
2. Create a feature branch
3. Submit a PR with proper description

---

⚠️ Limitations
T.REX is under active development. The current version has the following limitations:

> Limited Pipelines: Only four pipelines are supported at present β€” EDA, Deduplication, Clustering, and Text2SQL.

> File Format Restriction: Currently supports only .csv files. Other formats (e.g., Excel, JSON) are not yet supported.

> Single Text Column Design: Each session supports only one designated text column, with the rest treated as metadata(if preferred). If a different text/metadata column is required, the file must be reloaded.

> Startup Instability: Occasionally, the GUI file loader popup may fail on the first try. Restarting the session usually resolves the issue.

> Fixed LLM Configuration: Uses a single Groq-hosted model (llama3-70b-8192). Prompts use hardcoded settings (temperature, max_tokens, stop), with no dynamic tuning. API key must be provided in the .env file for usage of pipelines with LLM integration.

---

Built and maintained by `Variath Madhupal Gautham Nair (MSCS Rutgers University-New Brunswick)`