https://github.com/3bdalrhmans3d/dataqualityproject
An interactive web application for data quality analysis, machine learning, and conversational AI, built with Streamlit.
https://github.com/3bdalrhmans3d/dataqualityproject
data-analysis data-visualization ml numpy ollama pandas python seaborn streamlit
Last synced: 10 months ago
JSON representation
An interactive web application for data quality analysis, machine learning, and conversational AI, built with Streamlit.
- Host: GitHub
- URL: https://github.com/3bdalrhmans3d/dataqualityproject
- Owner: 3bdalrhmanS3d
- Created: 2024-12-15T19:12:16.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-25T19:50:41.000Z (about 1 year ago)
- Last Synced: 2025-02-14T08:37:15.002Z (12 months ago)
- Topics: data-analysis, data-visualization, ml, numpy, ollama, pandas, python, seaborn, streamlit
- Language: Python
- Homepage:
- Size: 157 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Quality App
This is a Python-based web application built using **Streamlit** for performing common data quality tasks such as handling missing values, duplicates, and outliers in datasets. The app also integrates with **Ollama** for a chatbot interface to interact with the dataset and answer questions using a **Retrieval-Augmented Generation (RAG)** model.
For those who wish to try the app, you can access it [here](https://dataqualityproject.streamlit.app/).
## Demo Video
Watch the [demo](https://drive.google.com/file/d/1XfPMpp-l4iARA2FLsFLykLxrSUOTCEgC/view?usp=sharing)
## Features
### 1. Data Quality Analysis
- **Dataset Upload:** Upload CSV or Excel files.
- **Dataset Info:** View detailed dataset information including memory usage and data types.
- **Describe Dataset:** Get descriptive statistics of the dataset.
- **Handle Missing Values:** Fill or drop missing values with multiple options.
- **Handle Duplicates:** Identify and remove duplicate rows.
- **Outlier Detection:** Identify and handle outliers using various techniques.
- **Data Type Conversion:** Convert data types, normalize, and transform columns.
### 2. Data Visualization
- **Interactive Plots:** Bar plots, pie charts, histograms, box plots, scatter plots, line charts, area charts, and pair plots.
- **Correlation Matrices:** View correlation between features with heatmaps.
- **Distribution Analysis:** Analyze data distributions using density and box plots.
- **Custom Color Palettes:** Choose from various color palettes for visualizations.
### 3. Machine Learning
- **Model Comparison:** Compare multiple models (Random Forest, SVM, Logistic Regression).
- **Feature Importance:** Analyze feature importance using RandomForestClassifier.
- **Cross-Validation:** Perform cross-validation to evaluate model performance.
- **Model Performance Metrics:** View accuracy, F1 score, precision, and recall.
- **Interactive Prediction Interface:** Make predictions on new data.
### 4. RAG-powered Chat
- **Dataset Querying:** Query the dataset using natural language.
- **Context-Aware Responses:** Get context-aware responses from the dataset.
- **Code Snippet Generation:** Generate code snippets for data analysis.
- **Interactive Chat Interface:** Chat with the dataset using Ollama's RAG model.
## Prerequisites
Before running the project, make sure you have Python 3.12 installed on your system and Ollama (for RAG features).
## Installation
1. **Clone the repository (optional)**
```bash
git clone https://github.com/3bdalrhmanS3d/DataQualityProject.git
cd DataQualityProject
```
2. **Create a virtual environment**
```bash
python -m venv venv
```
3. **Activate the virtual environment**
On Windows:
```bash
venv\Scripts\activate
```
On macOS/Linux:
```bash
source venv/bin/activate
```
4. **Install the required dependencies**
```bash
pip install -r requirements.txt
```
Alternatively, install the required libraries manually:
```bash
pip install streamlit pandas ollama scikit-learn matplotlib seaborn missingno imbalanced-learn
```
5. **Verify the installed libraries**
```bash
pip list
```
6. **Run the Streamlit app**
```bash
streamlit run RAG.py
```
The app will open in your default web browser.
## Project Structure
```txt
DataQualityProject/
├── RAG.py # Main application
├── HandlingSection.py # Data handling components
├── PredictionManager.py # ML model management
├── requirements.txt # Dependencies
└── README.md # Documentation
```
## Usage
- Upload your dataset (CSV or Excel) via the sidebar.
- Select the task you want to perform from the navigation menu in the sidebar:
- **Dataset Info**: View detailed information about your dataset (columns, types, non-null counts).
- **Describe Dataset**: View the descriptive statistics of the dataset.
- **Handle Missing Values**: Choose to fill or drop missing values from columns.
- **Handle Duplicates**: Identify and remove duplicate rows.
- **Handle Outliers**: Remove outliers using the IQR method.
- **Chat using RAG**: Interact with your dataset via a chatbot powered by Ollama.
## Download Modified Dataset
After performing any changes, you can download the modified dataset by clicking the download button on the sidebar.
## Requirements
- **Python 3.12**
- **Streamlit**: For creating the web interface.
- **Pandas**: For data manipulation and analysis.
- **Ollama**: For chatbot integration using the RAG model.
---
## Data Processing Features
- **Missing Values:** Multiple imputation methods and visualizations.
- **Outliers:** IQR-based detection and handling with visual analysis.
- **Transformations:** Scaling, encoding, and normalization.
- **Feature Engineering:** Automated and manual feature engineering options.
## Machine Learning Capabilities
- **Models:**
- Random Forest
- Support Vector Machines
- Logistic Regression
- **Metrics:**
- Accuracy
- F1 Score
- Precision
- Recall
- **Visualization:**
- Confusion Matrix
- ROC Curves
- Feature Importance
## requirements.txt
```txt
streamlit
pandas
numpy
scikit-learn
matplotlib
seaborn
ollama
missingno
imbalanced-learn