https://github.com/lovnishverma/datasets

This repository contains various datasets for data analysis, machine learning, and educational purposes
https://github.com/lovnishverma/datasets

csv dataset kaggle-dataset

Last synced: 5 months ago
JSON representation

This repository contains various datasets for data analysis, machine learning, and educational purposes

Host: GitHub
URL: https://github.com/lovnishverma/datasets
Owner: lovnishverma
License: bsd-2-clause
Created: 2023-06-17T07:50:22.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2026-01-21T09:52:13.000Z (5 months ago)
Last Synced: 2026-01-21T21:31:03.068Z (5 months ago)
Topics: csv, dataset, kaggle-dataset
Language: Jupyter Notebook
Homepage: https://www.kaggle.com/datasets/princelv84/csv-datasets
Size: 40.6 MB
Stars: 13
Watchers: 1
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # My Datasets Repository

This repository contains various datasets for data analysis, machine learning, and educational purposes. Below is a brief description of each dataset available in this repository.

### Want to download any csv file for local use? Follow the steps mentioned below: 👇



  Go to a csv file in a repository of your choice

  From the top right bar just above the file section, select and click on "Raw" button

  A page will appear with comma separated data with no styling

  Copy the page url

  Make a folder in your desktop

  Open that folder in your favourite code editor and make a simple python file inside the folder. Name it as you please.

  Copy this code [From the section below]

  Run the python file

  The csv file will get downloaded within sometime, depending upon file size

  Now you are ready the use it locally!!




  ``` 

  import requests

  import pandas as pd

  url = '{(copied url here)}' 

  res = requests.get(url, allow_redirects=True)

  with open('download_file_name.csv','wb') as file:

      file.write(res.content)

  download_file_name = pd.read_csv('download_file_name.csv') 

  ```  

## Available Datasets

### 1. BMI_Data.csv

   - Contains Body Mass Index (BMI) data.

   - Useful for health and fitness analysis.

### 2. departments.csv

   - Contains department-related information.

   - Useful for organizational data processing.

### 3. employees.csv

   - Contains employee details.

   - Can be used for HR analytics and workforce management.

### 4. iris.csv

   - Classic Iris dataset for machine learning.

   - Contains different species of iris flowers with their measurements.

### 5. item_similarity_df.csv

   - Contains item similarity data.

   - Useful for recommendation system development.

### 6. movies.csv

   - Dataset containing information about movies.

   - Useful for movie recommendation models.

### 7. music_genre.csv

   - Contains music genre classification data.

   - Can be used for genre prediction models.

### 8. nielit.patt

   - Not a database it's for AVR custom Marker

### 9. pandas.csv

   - Sample dataset for practicing pandas library operations.

   - Useful for learning data manipulation.

### 10. pandas_tutorial1.csv

   - Another dataset for pandas tutorials.

   - Contains structured data for training purposes.

### 11. ratings.csv

   - Contains user ratings for various items.

   - Useful for collaborative filtering and recommendation systems.

### 12. sample.csv

   - A sample dataset.

   - Can be used for testing and learning purposes.

### 13. test.csv

   - A test dataset.

   - Used for validation and experimentation.

[Explore More Datasets on my Kaggle](https://www.kaggle.com/datasets/princelv84/csv-datasets)

## Usage

These datasets can be used for:

- Machine learning projects

- Data analysis and visualization

- Educational and tutorial purposes

## How to Contribute

If you have additional datasets to contribute, feel free to upload them and update this README with the necessary descriptions.

## License

These datasets are provided for educational and research purposes. Please check individual datasets for any specific license information.

---

For any questions or suggestions, feel free to raise an issue or contact Lovnish Verma.

# 📊 Machine Learning Dataset Sources

A list of public datasets for machine learning, AI, data science, and analytics projects.

---

## 🔹 General-Purpose ML Repositories

- [UCI Machine Learning Repository](https://archive.ics.uci.edu/) – Classic datasets used in academic ML research.

- [Kaggle Datasets](https://www.kaggle.com/datasets) – User-contributed datasets with competitions and notebooks.

- [Google Dataset Search](https://datasetsearch.research.google.com/) – Dataset-specific search engine.

- [AWS Open Data Registry](https://registry.opendata.aws/) – Public datasets hosted on AWS.

- [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/) – Curated datasets for training on Azure.

- [OpenML](https://www.openml.org/) – Collaborative platform for sharing datasets and experiments.

- [Papers with Code – Datasets](https://paperswithcode.com/datasets) – ML benchmarks tied to research papers.

- [Hugging Face Datasets](https://huggingface.co/datasets) – NLP, vision, and multimodal datasets.

- [Zenodo](https://zenodo.org/) – Scientific datasets with citation support.

- [Figshare](https://figshare.com/) – Open-access research datasets.

- [Data World](https://data.world/) – Community platform for data sharing.

- [Awesome Public Datasets (GitHub)](https://github.com/awesomedata/awesome-public-datasets) – Curated list across domains.

- [FiveThirtyEight Data](https://data.fivethirtyeight.com/) – Datasets used in data journalism.

- [Quandl](https://www.quandl.com/) – Financial and economic data.

---

## 🔹 Government & Open Data Portals

- [India AI – Dataset Repository](https://indiaai.gov.in/datasets) – Indian AI project datasets.

- [Data.gov.in](https://data.gov.in/) – Indian government open data.

- [Data.gov (USA)](https://data.gov/) – US federal open datasets.

- [EU Open Data Portal](https://data.europa.eu/en) – Data from European institutions.

- [UK Data Service](https://ukdataservice.ac.uk/) – Economic and social research datasets (UK).

- [Canada Open Government](https://open.canada.ca/en/open-data) – Datasets from Canada.

- [Australia Data Portal](https://data.gov.au/) – Australian government datasets.

---

## 🔹 Domain-Specific Datasets

### 🖼️ Computer Vision

- [ImageNet](http://www.image-net.org/) – Large-scale image classification dataset.

- [COCO Dataset](https://cocodataset.org/) – Object detection, segmentation, and captioning.

- [Open Images Dataset](https://storage.googleapis.com/openimages/web/index.html) – Annotated image data.

- [Stanford Dogs Dataset](https://www.kaggle.com/jessicali9530/stanford-dogs-dataset) – Fine-grained image classification.

### 🌐 Web & NLP

- [Common Crawl](https://commoncrawl.org/) – Large-scale web crawl data.

- [Wikipedia Dumps](https://dumps.wikimedia.org/) – Raw Wikipedia text.

- [Project Gutenberg](https://www.gutenberg.org/) – Public domain books for NLP.

- [TREC Question Classification](https://cogcomp.seas.upenn.edu/Data/QA/QC/) – NLP benchmark dataset.

### 🧬 Bio, Medical & Health

- [PhysioNet](https://physionet.org/) – Physiological and clinical data.

- [MIMIC-III](https://mimic.physionet.org/) – ICU medical data (de-identified).

- [NIH Biomedical Data](https://datascience.nih.gov/data) – NIH open data portal.

- [Cancer Imaging Archive](https://www.cancerimagingarchive.net/) – Medical imaging data for cancer research.

### 🗣️ Speech & Audio

- [OpenSLR](https://www.openslr.org/) – Speech recognition datasets.

- [LibriSpeech ASR](https://www.openslr.org/12/) – Audiobook dataset for speech recognition.

### 🗺️ Maps & Geospatial

- [OpenStreetMap (Geofabrik)](https://download.geofabrik.de/) – Extracts of OSM data.

- [Google Open Buildings](https://sites.research.google/open-buildings/) – Global building footprints.

---

## ✅ Quick Access Table

| Name | Domain | Link |

|------|--------|------|

| UCI ML Repo | General | [Link](https://archive.ics.uci.edu/) |

| Kaggle | General | [Link](https://www.kaggle.com/datasets) |

| IndiaAI | Govt (India) | [Link](https://indiaai.gov.in/datasets) |

| Data.gov.in | Govt (India) | [Link](https://data.gov.in/) |

| Data.gov | Govt (USA) | [Link](https://data.gov/) |

| Data World | General | [Link](https://data.world/) |

| Hugging Face | NLP/ML | [Link](https://huggingface.co/datasets) |

| Papers with Code | Benchmarks | [Link](https://paperswithcode.com/datasets) |

| Zenodo | Research | [Link](https://zenodo.org/) |

---

## 📌 Tip

For code integration and automatic downloads, you can often use Python libraries such as:

```python

from datasets import load_dataset

dataset = load_dataset("imdb")  # Hugging Face example

````

You can also automate downloads from Kaggle via API:

```bash

kaggle datasets download -d username/dataset-name

```

---

Feel free to contribute more sources via pull request!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lovnishverma/datasets

Awesome Lists containing this project

README