https://github.com/uznetdev/smoking-prediction

This project focuses on analyzing the "Smoking" dataset and building a predictive model for smoking status based on various health metrics. The goal is to identify factors influencing smoking behavior and develop a reliable model for prediction.
https://github.com/uznetdev/smoking-prediction

ai classification data data-science kaggle-competition machine-learning ml roc-auc sklearn smoking

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/uznetdev/smoking-prediction
Owner: UznetDev
License: mit
Created: 2024-10-27T14:23:56.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-11-10T06:54:08.000Z (8 months ago)
Last Synced: 2025-04-05T19:41:53.239Z (3 months ago)
Topics: ai, classification, data, data-science, kaggle-competition, machine-learning, ml, roc-auc, sklearn, smoking
Language: Jupyter Notebook
Homepage:
Size: 28.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# NT-M5-H1 Smoking Prediction Competition

This repository contains a solution to the **NT-M5-H1 Smoking Prediction Competition**, which focuses on predicting smoking status using health-related data. This project employs machine learning models, primarily using Python and various data science libraries, to create accurate predictions based on provided datasets.

## Project Overview

The goal of this project is to predict whether an individual is a smoker based on several health indicators. The dataset includes various attributes related to age, height, weight, cholesterol levels, blood pressure, and other health metrics.

### Key Features
- **Data Analysis and Visualization**: Initial data exploration and visualizations to understand feature distributions and relationships.
- **Feature Engineering**: Creation of new, relevant features to enhance model performance.
- **Modeling**: Implementation of multiple classifiers, including logistic regression, decision trees, ensemble methods, and more.
- **Hyperparameter Tuning**: Finding optimal model parameters using techniques like cross-validation and grid search.
- **Evaluation**: Assessing model performance using metrics such as ROC-AUC and accuracy.

## Repository Structure

```
NT-M5-H1-Smoking-Cmpetition/
├── notebooks/ # Jupyter notebooks for data analysis and modeling
├── data/ # datasets for usage model
├── requirements.txt # List of required libraries and dependencies
├── smoking_model.joblib # Finla model for prediction
└── README.md # Project documentation
```

## Installation

To run this project locally, follow these steps:

1. Clone this repository:
```bash
git clone https://github.com/UznetDev/NT-M5-H1-Smoking-Cmpetition.git
```
2. Navigate to the project directory:
```bash
cd NT-M5-H1-Smoking-Cmpetition
```
3. Install the required dependencies:
```bash
pip install -r requirements.txt
```

### Usage

To load and use the model for prediction, follow these steps:

```python
from joblib import load

# Load the model
model = load('smoking_model.joblib')

# Define input data (X) for prediction
# Example input data for one individual
X_new = [] # Replace with actual input values

# Predict smoking status
prediction = model.predict(X_new)

# Display the result
if prediction[0] == 1:
print("This individual is likely a smoker.")
else:
print("This individual is likely a non-smoker.")
```

### Model Details

The `smoking_model.joblib` file contains a pre-trained model created with `scikit-learn`. You can use the `predict` function to determine smoking status based on input health measurements.

**Note:** The example input (`X_new`) should be replaced with real data according to your project requirements.

## Dataset

The dataset consists of several features related to individuals' health, including:
- **Basic Attributes**: Age, height, weight, waist circumference, eyesight, and hearing levels.
- **Blood Pressure**: Systolic and diastolic blood pressure.
- **Blood Metrics**: Cholesterol, triglyceride, HDL, LDL, and hemoglobin levels.
- **Additional Health Indicators**: Serum creatinine, AST, ALT, Gtp, and urine protein.
- **Target**: A binary column indicating whether an individual is a smoker (1 for smoker, 0 for non-smoker).

### Feature Engineering

The following new features have been engineered to enhance the model's predictive power:
- **BMI** (Body Mass Index)
- **Waist-to-Height Ratio**
- **Cholesterol Ratio** (Total Cholesterol/HDL)
- **Liver Enzyme Ratio** (ALT/AST)

## Models Used

The project explores multiple machine learning algorithms for classification:
- **Linear Models**: Logistic Regression, Ridge Classifier, SGD Classifier
- **Naive Bayes Classifiers**: GaussianNB, MultinomialNB, BernoulliNB
- **Support Vector Machines**: SVC, LinearSVC, NuSVC
- **Decision Trees and Ensembles**: Decision Tree, RandomForest, Gradient Boosting, AdaBoost, Extra Trees
- **Discriminant Analysis**: Linear Discriminant Analysis, Quadratic Discriminant Analysis

### Hyperparameter Tuning

To find the best parameters for the models, this project utilizes optuna and cross-validation methods, specifically focusing on maximizing ROC-AUC scores to improve model performance.

## Evaluation

The models are evaluated based on the following metrics:
- **ROC-AUC Score**: Primary metric to assess model performance in terms of distinguishing between smokers and non-smokers.

## Usage

1. **Exploratory Data Analysis (EDA)**: Start with EDA notebooks in the `notebooks/` folder to understand the dataset and visualize patterns.
2. **Model Training**: Run the scripts or notebooks to train the models and evaluate their performance.
3. **Hyperparameter Tuning**: Use the scripts for hyperparameter tuning to improve model accuracy.
4. **Save nodel**: save model in 'model.joblib.

## Contributing

Contributions are welcome! If you'd like to improve this project, please fork the repository and make a pull request.

1. Fork the repository.
2. Create a new branch for your feature or bug fix:
```bash
git checkout -b feature-name
```
3. Commit your changes:
```bash
git commit -m "Add a new feature"
```
4. Push to your branch:
```bash
git push origin feature-name
```
5. Open a pull request.

## License

This project is licensed under the [MIT](LICENCE) License.

## Contact's

If you have any questions or suggestions, please contact:
- Email: [email protected]
- GitHub Issues: [Issues section](https://github.com/UznetDev/NT-M5-H1-Smoking-Cmpetition/issues)
- GitHub Profile: [UznetDev](https://github.com/UznetDev/)
- Telegram: [UZNet_Dev](https://t.me/UZNet_Dev)
- Linkedin: [Abdurakhmon Niyozaliev](https://www.linkedin.com/in/abdurakhmon-niyozaliyev-%F0%9F%87%B5%F0%9F%87%B8-66545222a/)

---

Thank you for your interest in this project. We hope it helps in your journey to understand and predict smoking habits using data science!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/uznetdev/smoking-prediction

Awesome Lists containing this project

README