https://github.com/lovnishverma/datasets
This repository contains various datasets for data analysis, machine learning, and educational purposes
https://github.com/lovnishverma/datasets
csv dataset kaggle-dataset
Last synced: about 2 months ago
JSON representation
This repository contains various datasets for data analysis, machine learning, and educational purposes
- Host: GitHub
- URL: https://github.com/lovnishverma/datasets
- Owner: lovnishverma
- License: bsd-2-clause
- Created: 2023-06-17T07:50:22.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2026-01-21T09:52:13.000Z (2 months ago)
- Last Synced: 2026-01-21T21:31:03.068Z (2 months ago)
- Topics: csv, dataset, kaggle-dataset
- Language: Jupyter Notebook
- Homepage: https://www.kaggle.com/datasets/princelv84/csv-datasets
- Size: 40.6 MB
- Stars: 13
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# My Datasets Repository
This repository contains various datasets for data analysis, machine learning, and educational purposes. Below is a brief description of each dataset available in this repository.
### Want to download any csv file for local use? Follow the steps mentioned below: π
- Go to a csv file in a repository of your choice
- From the top right bar just above the file section, select and click on "Raw" button
- A page will appear with comma separated data with no styling
- Copy the page url
- Make a folder in your desktop
- Open that folder in your favourite code editor and make a simple python file inside the folder. Name it as you please.
- Copy this code [From the section below]
- Run the python file
- The csv file will get downloaded within sometime, depending upon file size
- Now you are ready the use it locally!!
```
import requests
import pandas as pd
url = '{(copied url here)}'
res = requests.get(url, allow_redirects=True)
with open('download_file_name.csv','wb') as file:
file.write(res.content)
download_file_name = pd.read_csv('download_file_name.csv')
```
## Available Datasets
### 1. BMI_Data.csv
- Contains Body Mass Index (BMI) data.
- Useful for health and fitness analysis.
### 2. departments.csv
- Contains department-related information.
- Useful for organizational data processing.
### 3. employees.csv
- Contains employee details.
- Can be used for HR analytics and workforce management.
### 4. iris.csv
- Classic Iris dataset for machine learning.
- Contains different species of iris flowers with their measurements.
### 5. item_similarity_df.csv
- Contains item similarity data.
- Useful for recommendation system development.
### 6. movies.csv
- Dataset containing information about movies.
- Useful for movie recommendation models.
### 7. music_genre.csv
- Contains music genre classification data.
- Can be used for genre prediction models.
### 8. nielit.patt
- Not a database it's for AVR custom Marker
### 9. pandas.csv
- Sample dataset for practicing pandas library operations.
- Useful for learning data manipulation.
### 10. pandas_tutorial1.csv
- Another dataset for pandas tutorials.
- Contains structured data for training purposes.
### 11. ratings.csv
- Contains user ratings for various items.
- Useful for collaborative filtering and recommendation systems.
### 12. sample.csv
- A sample dataset.
- Can be used for testing and learning purposes.
### 13. test.csv
- A test dataset.
- Used for validation and experimentation.
[Explore More Datasets on my Kaggle](https://www.kaggle.com/datasets/princelv84/csv-datasets)
## Usage
These datasets can be used for:
- Machine learning projects
- Data analysis and visualization
- Educational and tutorial purposes
## How to Contribute
If you have additional datasets to contribute, feel free to upload them and update this README with the necessary descriptions.
## License
These datasets are provided for educational and research purposes. Please check individual datasets for any specific license information.
---
For any questions or suggestions, feel free to raise an issue or contact Lovnish Verma.
# π Machine Learning Dataset Sources
A list of public datasets for machine learning, AI, data science, and analytics projects.
---
## πΉ General-Purpose ML Repositories
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/) β Classic datasets used in academic ML research.
- [Kaggle Datasets](https://www.kaggle.com/datasets) β User-contributed datasets with competitions and notebooks.
- [Google Dataset Search](https://datasetsearch.research.google.com/) β Dataset-specific search engine.
- [AWS Open Data Registry](https://registry.opendata.aws/) β Public datasets hosted on AWS.
- [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/) β Curated datasets for training on Azure.
- [OpenML](https://www.openml.org/) β Collaborative platform for sharing datasets and experiments.
- [Papers with Code β Datasets](https://paperswithcode.com/datasets) β ML benchmarks tied to research papers.
- [Hugging Face Datasets](https://huggingface.co/datasets) β NLP, vision, and multimodal datasets.
- [Zenodo](https://zenodo.org/) β Scientific datasets with citation support.
- [Figshare](https://figshare.com/) β Open-access research datasets.
- [Data World](https://data.world/) β Community platform for data sharing.
- [Awesome Public Datasets (GitHub)](https://github.com/awesomedata/awesome-public-datasets) β Curated list across domains.
- [FiveThirtyEight Data](https://data.fivethirtyeight.com/) β Datasets used in data journalism.
- [Quandl](https://www.quandl.com/) β Financial and economic data.
---
## πΉ Government & Open Data Portals
- [India AI β Dataset Repository](https://indiaai.gov.in/datasets) β Indian AI project datasets.
- [Data.gov.in](https://data.gov.in/) β Indian government open data.
- [Data.gov (USA)](https://data.gov/) β US federal open datasets.
- [EU Open Data Portal](https://data.europa.eu/en) β Data from European institutions.
- [UK Data Service](https://ukdataservice.ac.uk/) β Economic and social research datasets (UK).
- [Canada Open Government](https://open.canada.ca/en/open-data) β Datasets from Canada.
- [Australia Data Portal](https://data.gov.au/) β Australian government datasets.
---
## πΉ Domain-Specific Datasets
### πΌοΈ Computer Vision
- [ImageNet](http://www.image-net.org/) β Large-scale image classification dataset.
- [COCO Dataset](https://cocodataset.org/) β Object detection, segmentation, and captioning.
- [Open Images Dataset](https://storage.googleapis.com/openimages/web/index.html) β Annotated image data.
- [Stanford Dogs Dataset](https://www.kaggle.com/jessicali9530/stanford-dogs-dataset) β Fine-grained image classification.
### π Web & NLP
- [Common Crawl](https://commoncrawl.org/) β Large-scale web crawl data.
- [Wikipedia Dumps](https://dumps.wikimedia.org/) β Raw Wikipedia text.
- [Project Gutenberg](https://www.gutenberg.org/) β Public domain books for NLP.
- [TREC Question Classification](https://cogcomp.seas.upenn.edu/Data/QA/QC/) β NLP benchmark dataset.
### 𧬠Bio, Medical & Health
- [PhysioNet](https://physionet.org/) β Physiological and clinical data.
- [MIMIC-III](https://mimic.physionet.org/) β ICU medical data (de-identified).
- [NIH Biomedical Data](https://datascience.nih.gov/data) β NIH open data portal.
- [Cancer Imaging Archive](https://www.cancerimagingarchive.net/) β Medical imaging data for cancer research.
### π£οΈ Speech & Audio
- [OpenSLR](https://www.openslr.org/) β Speech recognition datasets.
- [LibriSpeech ASR](https://www.openslr.org/12/) β Audiobook dataset for speech recognition.
### πΊοΈ Maps & Geospatial
- [OpenStreetMap (Geofabrik)](https://download.geofabrik.de/) β Extracts of OSM data.
- [Google Open Buildings](https://sites.research.google/open-buildings/) β Global building footprints.
---
## β
Quick Access Table
| Name | Domain | Link |
|------|--------|------|
| UCI ML Repo | General | [Link](https://archive.ics.uci.edu/) |
| Kaggle | General | [Link](https://www.kaggle.com/datasets) |
| IndiaAI | Govt (India) | [Link](https://indiaai.gov.in/datasets) |
| Data.gov.in | Govt (India) | [Link](https://data.gov.in/) |
| Data.gov | Govt (USA) | [Link](https://data.gov/) |
| Data World | General | [Link](https://data.world/) |
| Hugging Face | NLP/ML | [Link](https://huggingface.co/datasets) |
| Papers with Code | Benchmarks | [Link](https://paperswithcode.com/datasets) |
| Zenodo | Research | [Link](https://zenodo.org/) |
---
## π Tip
For code integration and automatic downloads, you can often use Python libraries such as:
```python
from datasets import load_dataset
dataset = load_dataset("imdb") # Hugging Face example
````
You can also automate downloads from Kaggle via API:
```bash
kaggle datasets download -d username/dataset-name
```
---
Feel free to contribute more sources via pull request!