An open API service indexing awesome lists of open source software.

https://github.com/lovnishverma/datasets

This repository contains various datasets for data analysis, machine learning, and educational purposes
https://github.com/lovnishverma/datasets

csv dataset kaggle-dataset

Last synced: about 2 months ago
JSON representation

This repository contains various datasets for data analysis, machine learning, and educational purposes

Awesome Lists containing this project

README

          

# My Datasets Repository

This repository contains various datasets for data analysis, machine learning, and educational purposes. Below is a brief description of each dataset available in this repository.

### Want to download any csv file for local use? Follow the steps mentioned below: πŸ‘‡


  1. Go to a csv file in a repository of your choice

  2. From the top right bar just above the file section, select and click on "Raw" button

  3. A page will appear with comma separated data with no styling

  4. Copy the page url

  5. Make a folder in your desktop

  6. Open that folder in your favourite code editor and make a simple python file inside the folder. Name it as you please.

  7. Copy this code [From the section below]

  8. Run the python file

  9. The csv file will get downloaded within sometime, depending upon file size

  10. Now you are ready the use it locally!!

```
import requests
import pandas as pd
url = '{(copied url here)}'
res = requests.get(url, allow_redirects=True)
with open('download_file_name.csv','wb') as file:
file.write(res.content)
download_file_name = pd.read_csv('download_file_name.csv')
```

## Available Datasets

### 1. BMI_Data.csv
- Contains Body Mass Index (BMI) data.
- Useful for health and fitness analysis.

### 2. departments.csv
- Contains department-related information.
- Useful for organizational data processing.

### 3. employees.csv
- Contains employee details.
- Can be used for HR analytics and workforce management.

### 4. iris.csv
- Classic Iris dataset for machine learning.
- Contains different species of iris flowers with their measurements.

### 5. item_similarity_df.csv
- Contains item similarity data.
- Useful for recommendation system development.

### 6. movies.csv
- Dataset containing information about movies.
- Useful for movie recommendation models.

### 7. music_genre.csv
- Contains music genre classification data.
- Can be used for genre prediction models.

### 8. nielit.patt
- Not a database it's for AVR custom Marker

### 9. pandas.csv
- Sample dataset for practicing pandas library operations.
- Useful for learning data manipulation.

### 10. pandas_tutorial1.csv
- Another dataset for pandas tutorials.
- Contains structured data for training purposes.

### 11. ratings.csv
- Contains user ratings for various items.
- Useful for collaborative filtering and recommendation systems.

### 12. sample.csv
- A sample dataset.
- Can be used for testing and learning purposes.

### 13. test.csv
- A test dataset.
- Used for validation and experimentation.

[Explore More Datasets on my Kaggle](https://www.kaggle.com/datasets/princelv84/csv-datasets)

## Usage
These datasets can be used for:
- Machine learning projects
- Data analysis and visualization
- Educational and tutorial purposes

## How to Contribute
If you have additional datasets to contribute, feel free to upload them and update this README with the necessary descriptions.

## License
These datasets are provided for educational and research purposes. Please check individual datasets for any specific license information.

---
For any questions or suggestions, feel free to raise an issue or contact Lovnish Verma.

# πŸ“Š Machine Learning Dataset Sources

A list of public datasets for machine learning, AI, data science, and analytics projects.

---

## πŸ”Ή General-Purpose ML Repositories

- [UCI Machine Learning Repository](https://archive.ics.uci.edu/) – Classic datasets used in academic ML research.
- [Kaggle Datasets](https://www.kaggle.com/datasets) – User-contributed datasets with competitions and notebooks.
- [Google Dataset Search](https://datasetsearch.research.google.com/) – Dataset-specific search engine.
- [AWS Open Data Registry](https://registry.opendata.aws/) – Public datasets hosted on AWS.
- [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/) – Curated datasets for training on Azure.
- [OpenML](https://www.openml.org/) – Collaborative platform for sharing datasets and experiments.
- [Papers with Code – Datasets](https://paperswithcode.com/datasets) – ML benchmarks tied to research papers.
- [Hugging Face Datasets](https://huggingface.co/datasets) – NLP, vision, and multimodal datasets.
- [Zenodo](https://zenodo.org/) – Scientific datasets with citation support.
- [Figshare](https://figshare.com/) – Open-access research datasets.
- [Data World](https://data.world/) – Community platform for data sharing.
- [Awesome Public Datasets (GitHub)](https://github.com/awesomedata/awesome-public-datasets) – Curated list across domains.
- [FiveThirtyEight Data](https://data.fivethirtyeight.com/) – Datasets used in data journalism.
- [Quandl](https://www.quandl.com/) – Financial and economic data.

---

## πŸ”Ή Government & Open Data Portals

- [India AI – Dataset Repository](https://indiaai.gov.in/datasets) – Indian AI project datasets.
- [Data.gov.in](https://data.gov.in/) – Indian government open data.
- [Data.gov (USA)](https://data.gov/) – US federal open datasets.
- [EU Open Data Portal](https://data.europa.eu/en) – Data from European institutions.
- [UK Data Service](https://ukdataservice.ac.uk/) – Economic and social research datasets (UK).
- [Canada Open Government](https://open.canada.ca/en/open-data) – Datasets from Canada.
- [Australia Data Portal](https://data.gov.au/) – Australian government datasets.

---

## πŸ”Ή Domain-Specific Datasets

### πŸ–ΌοΈ Computer Vision

- [ImageNet](http://www.image-net.org/) – Large-scale image classification dataset.
- [COCO Dataset](https://cocodataset.org/) – Object detection, segmentation, and captioning.
- [Open Images Dataset](https://storage.googleapis.com/openimages/web/index.html) – Annotated image data.
- [Stanford Dogs Dataset](https://www.kaggle.com/jessicali9530/stanford-dogs-dataset) – Fine-grained image classification.

### 🌐 Web & NLP

- [Common Crawl](https://commoncrawl.org/) – Large-scale web crawl data.
- [Wikipedia Dumps](https://dumps.wikimedia.org/) – Raw Wikipedia text.
- [Project Gutenberg](https://www.gutenberg.org/) – Public domain books for NLP.
- [TREC Question Classification](https://cogcomp.seas.upenn.edu/Data/QA/QC/) – NLP benchmark dataset.

### 🧬 Bio, Medical & Health

- [PhysioNet](https://physionet.org/) – Physiological and clinical data.
- [MIMIC-III](https://mimic.physionet.org/) – ICU medical data (de-identified).
- [NIH Biomedical Data](https://datascience.nih.gov/data) – NIH open data portal.
- [Cancer Imaging Archive](https://www.cancerimagingarchive.net/) – Medical imaging data for cancer research.

### πŸ—£οΈ Speech & Audio

- [OpenSLR](https://www.openslr.org/) – Speech recognition datasets.
- [LibriSpeech ASR](https://www.openslr.org/12/) – Audiobook dataset for speech recognition.

### πŸ—ΊοΈ Maps & Geospatial

- [OpenStreetMap (Geofabrik)](https://download.geofabrik.de/) – Extracts of OSM data.
- [Google Open Buildings](https://sites.research.google/open-buildings/) – Global building footprints.

---

## βœ… Quick Access Table

| Name | Domain | Link |
|------|--------|------|
| UCI ML Repo | General | [Link](https://archive.ics.uci.edu/) |
| Kaggle | General | [Link](https://www.kaggle.com/datasets) |
| IndiaAI | Govt (India) | [Link](https://indiaai.gov.in/datasets) |
| Data.gov.in | Govt (India) | [Link](https://data.gov.in/) |
| Data.gov | Govt (USA) | [Link](https://data.gov/) |
| Data World | General | [Link](https://data.world/) |
| Hugging Face | NLP/ML | [Link](https://huggingface.co/datasets) |
| Papers with Code | Benchmarks | [Link](https://paperswithcode.com/datasets) |
| Zenodo | Research | [Link](https://zenodo.org/) |

---

## πŸ“Œ Tip

For code integration and automatic downloads, you can often use Python libraries such as:

```python
from datasets import load_dataset

dataset = load_dataset("imdb") # Hugging Face example
````

You can also automate downloads from Kaggle via API:

```bash
kaggle datasets download -d username/dataset-name
```

---

Feel free to contribute more sources via pull request!