https://github.com/gauthamnairvm/trex-app
Text Refinement EXplorer - An EDA tool for text based data.
https://github.com/gauthamnairvm/trex-app
data-analysis data-visualization groq-api large-language-models llama3 natural-language-processing text2sql
Last synced: 9 months ago
JSON representation
Text Refinement EXplorer - An EDA tool for text based data.
- Host: GitHub
- URL: https://github.com/gauthamnairvm/trex-app
- Owner: gauthamnairvm
- License: mit
- Created: 2025-05-05T15:36:35.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-05-06T08:59:56.000Z (9 months ago)
- Last Synced: 2025-05-06T09:47:01.668Z (9 months ago)
- Topics: data-analysis, data-visualization, groq-api, large-language-models, llama3, natural-language-processing, text2sql
- Language: Python
- Homepage:
- Size: 22.5 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# T.REX β Text Refinement and EXploration
T.REX is a powerful local tool for analyzing, deduplicating, clustering, and querying large-scale text datasets using pretrained embeddings and LLM-powered pipelines with metadata aware plots and integrations.
> β Requires a machine with a **dedicated NVIDIA GPU (CUDA 11.8+)**.
> β Will NOT run in Docker, WSL, or headless environments due to GUI popups.
---
## π Features
- β
CSV popup loader with column detection
- β
Embedding generation (`sentence-transformers`)
- β
Interactive EDA on metadata + selected text column
- β
Clustering and optional LLM labeling
- β
Near duplicate analysis
- β
Text-to-SQL pipeline
- β
Full CLI-based UX for pipeline chaining
---
## βοΈ Installation
### 1. Prerequisites
- Python **3.10**
- NVIDIA GPU with **CUDA 11.8+ drivers installed**
- **Display environment** (no WSL or remote/headless)
### 2. Setup Steps
```bash
# Clone the repo
git clone https://github.com/gauthamnairvm/trex-app.git
cd trex-app
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
#Additional Dependencies
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
```
---
## π Environment Setup
Create a `.env` file at the root (you can copy from `.env.template`):
```
GROQ_API_KEY=your_groq_api_key_here
```
---
## π Run T.REX
```bash
python main.py
```
You'll see the CSV loader and the T.REX CLI. Try:
```bash
T.REX > trex_eda(metadata=['col1', 'col2'])
T.REX > trex_cluster()
T.REX > trex_dedup(stopwords=False)
T.REX > trex_text2sql(pii_mask=True)
```
---
## π Project Structure
```
trex-app/
βββ app/
β βββ clustering.py
β βββ dedup.py
β βββ embedding.py
β βββ file_loader.py
β βββ pipeline.py
β βββ text2sql_pipeline.py
βββ data/ # CSVs and embeddings
βββ results/ # Output plots and clustering results
βββ main.py # Entry point
βββ .env.template # Environment variable example
βββ requirements.txt
βββ README.md
```
---
## π License
This project is licensed under the **MIT License**.
Youβre free to use, modify, and distribute it. Please give credit if you build on it.
---
## π€ Contributions
TREX is open for issues, suggestions, and pull requests.
To contribute:
1. Fork the repo
2. Create a feature branch
3. Submit a PR with proper description
---
β οΈ Limitations
T.REX is under active development. The current version has the following limitations:
> Limited Pipelines: Only four pipelines are supported at present β EDA, Deduplication, Clustering, and Text2SQL.
> File Format Restriction: Currently supports only .csv files. Other formats (e.g., Excel, JSON) are not yet supported.
> Single Text Column Design: Each session supports only one designated text column, with the rest treated as metadata(if preferred). If a different text/metadata column is required, the file must be reloaded.
> Startup Instability: Occasionally, the GUI file loader popup may fail on the first try. Restarting the session usually resolves the issue.
> Fixed LLM Configuration: Uses a single Groq-hosted model (llama3-70b-8192). Prompts use hardcoded settings (temperature, max_tokens, stop), with no dynamic tuning. API key must be provided in the .env file for usage of pipelines with LLM integration.
---
Built and maintained by `Variath Madhupal Gautham Nair (MSCS Rutgers University-New Brunswick)`