An open API service indexing awesome lists of open source software.

https://github.com/imnotamr/datasets-used

A comprehensive collection of datasets for machine learning and data science projects, covering topics from advertising and sales to health and sports analytics
https://github.com/imnotamr/datasets-used

ai classification data-analysis data-science data-visualization deep-learning jupyter-notebook machine-learning models python regression-models

Last synced: 3 months ago
JSON representation

A comprehensive collection of datasets for machine learning and data science projects, covering topics from advertising and sales to health and sports analytics

Awesome Lists containing this project

README

        

# 📊 Datasets for Machine Learning Projects
This directory contains various datasets used for machine learning, data analysis, and visualization tasks. Each dataset serves a specific purpose and covers different domains, providing a range of options for exploratory data analysis, regression, classification, and more.

## "I haven't uploaded all of the datasets I've used yet :)" ##

# 📁 Dataset Descriptions
1) Advertising.csv
• Description: Contains data on advertising expenditures across different channels (TV, Radio, Newspaper) and their impact on sales.
• Usage: Great for regression analysis and predictive modeling to analyze the impact of advertising on sales.

2) SVM_Data.csv
• Description: A dataset designed for Support Vector Machine (SVM) classification tasks.
• Usage: Suitable for experimenting with SVM algorithms, tuning hyperparameters, and evaluating model accuracy.

4) Salary.csv
• Description: Contains information on individuals' salary levels based on factors like education, experience, and position.
• Usage: Ideal for regression modeling, predicting salary levels based on features, and understanding salary trends.

6) adult.csv
• Description: UCI Adult dataset, used to predict income levels based on demographic features.
• Usage: Commonly used for classification tasks, especially binary classification on predicting income levels (e.g., above or below 50K).

8) car_data.csv
• Description: Information on various car attributes and evaluations.
• Usage: Useful for classification tasks, such as evaluating car acceptability based on features like buying price, maintenance cost, and safety.

10) insurance.csv
• Description: Insurance data, including demographic and medical history factors, along with insurance charges.
• Usage: Great for regression analysis, predicting insurance costs, and understanding factors influencing charges.

12) kc_house_data.csv
• Description: Contains house prices and associated features from the King County housing dataset.
• Usage: Suitable for regression modeling, predicting house prices, and feature importance analysis.

14) titanic.csv
• Description: Famous Titanic dataset containing details about passengers and their survival status.
• Usage: Ideal for binary classification, survival analysis, and feature engineering experiments.

16) IBM Dataset.csv
• Description: IBM HR Analytics dataset, used to predict employee attrition and understand HR factors.
• Usage: Useful for classification tasks, especially for predicting employee turnover.

18) Iris.csv
• Description: Classic Iris dataset containing measurements of different iris flower species.
• Usage: Commonly used for classification tasks and exploring clustering algorithms.

20) SampleSuperstore.csv
• Description: Data from a sample superstore, including sales, profits, and product categories.
• Usage: Useful for analyzing sales performance, regional trends, and market segmentation.

22) data.csv
• Description: General dataset for testing and exploratory analysis.
• Usage: Used for various data analysis tasks.

24) diabetes.csv
• Description: Diabetes dataset containing health-related features used to predict diabetes onset.
• Usage: Suitable for binary classification and medical analysis tasks.

26) fifa.csv
• Description: FIFA dataset with player statistics.
• Usage: Ideal for data exploration, player comparison, and sports analytics.

28) pendigits-test.csv & pendigits-train.csv
• Description: Handwritten digit recognition dataset with training and testing splits.
• Usage: Commonly used for classification tasks and testing digit recognition algorithms.

30) regression_project_data.csv
• Description: Dataset for regression analysis.
• Usage: Used for regression modeling and feature analysis.

32) tips.csv
• Description: Dataset containing restaurant tips, including features like total bill, tip amount, and day.
• Usage: Great for exploring regression, correlation, and understanding factors affecting tips.

34) titanic.csv (duplicate)
• Description: Duplicate of the Titanic dataset, used for different experimental setups.
• Usage: Same as above, for binary classification and survival analysis.

# 📚 How to Use
Load the Data
Each dataset is in CSV format and can be loaded using pandas in Python:

python
Copy code
import pandas as pd

data = pd.read_csv('path/to/dataset.csv')
Exploratory Data Analysis
Perform initial analysis to understand the features, data types, and any necessary preprocessing steps:

python
Copy code
data.info()
data.describe()
Machine Learning
These datasets are suitable for a variety of machine learning tasks such as regression, classification, clustering, and feature engineering.

# 💡 Suggested Applications
Regression Models: Advertising.csv, insurance.csv, kc_house_data.csv
Classification Models: adult.csv, car_data.csv, titanic.csv, Iris.csv, diabetes.csv
Exploratory Data Analysis: SampleSuperstore.csv, fifa.csv
Time-Series & Sequential Data: pendigits-train.csv and pendigits-test.csv
# 📄 License
These datasets are publicly available for educational and research purposes. Please ensure proper citation when using these datasets for publications or sharing results.