An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with dataset

A curated list of projects in awesome lists tagged with dataset .

https://github.com/joke2k/faker

Faker is a Python package that generates fake data for you.

dataset fake fake-data faker faker-generator python test-data test-data-generator testing

Last synced: 29 Dec 2025

https://github.com/conardli/easy-dataset

A powerful tool for creating fine-tuning datasets for LLM

dataset javascript llm

Last synced: 11 May 2025

https://github.com/googlecreativelab/quickdraw-dataset

Documentation on how to access and use the Quick, Draw! Dataset.

dataset quickdraw-dataset

Last synced: 08 Apr 2025

https://github.com/mdn/browser-compat-data

This repository contains compatibility data for Web technologies as displayed on MDN

compat compatibility data dataset json

Last synced: 06 Jan 2026

https://github.com/splware/esproc

esProc SPL is a JVM-based programming language designed for structured data computation, serving as both a data analysis tool and an embedded computing engine.

cluster-computing database dataset esproc java sql

Last synced: 29 Apr 2025

https://github.com/SPLWare/esProc

esProc SPL is a scripting language for data processing, with well-designed rich library functions and powerful syntax, which can be executed in a Java program through JDBC interface and computing independently.

cluster-computing database dataset esproc java sql

Last synced: 04 Apr 2025

https://github.com/tensorflow/datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

data dataset datasets jax machine-learning numpy tensorflow

Last synced: 12 May 2025

https://github.com/whoiskatrin/sql-translator

SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.

data-analysis data-engineering dataquery datascience dataset openai postgresql query sql

Last synced: 14 May 2025

https://github.com/cluebenchmark/clue

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

albert benchmark bert chinese chineseglue corpus dataset glue language-model nlu pretrained-models pytorch roberta tensorflow transformers

Last synced: 14 May 2025

https://github.com/wainshine/chinese-names-corpus

中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。

corpus dataset dict names ner

Last synced: 26 Mar 2025

https://github.com/wainshine/Chinese-Names-Corpus

中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。

corpus dataset dict names ner

Last synced: 25 Mar 2025

https://github.com/rom1504/img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

big-data dataset deep-learning download-images image image-dataset multimodal

Last synced: 13 May 2025

https://github.com/CLUEbenchmark/CLUE

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

albert benchmark bert chinese chineseglue corpus dataset glue language-model nlu pretrained-models pytorch roberta tensorflow transformers

Last synced: 28 Mar 2025

https://github.com/hyunwoongko/transformer

Transformer: PyTorch Implementation of "Attention Is All You Need"

attention dataset pytorch transformer

Last synced: 14 May 2025

https://github.com/pytorch/text

Models, data loaders and abstractions for language processing, powered by PyTorch

data-loader dataset deep-learning models nlp pytorch

Last synced: 13 May 2025

https://github.com/Charmve/Surface-Defect-Detection

📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.

charmve dataset deep-learning defects image-segmentation paper pcb-surface-defect surface surface-defect-detection surface-defects surface-detection

Last synced: 05 May 2025

https://github.com/charmve/surface-defect-detection

📈 目前最大的工业缺陷检测数据库及论文集 Constantly summarizing open source dataset and critical papers in the field of surface defect research which are of great importance.

charmve dataset deep-learning defects image-segmentation paper pcb-surface-defect surface surface-defect-detection surface-defects surface-detection

Last synced: 14 May 2025

https://github.com/pydata/pandas-datareader

Extract data from a wide range of Internet sources into a pandas DataFrame.

data data-analysis dataset econdb economic-data fama-french finance financial-data fred html pandas pydata python stock-data

Last synced: 14 May 2025

https://github.com/ieee8023/covid-chestxray-dataset

We are building an open database of COVID-19 cases with chest X-ray or CT images.

computed-tomography computer-vision covid-19 dataset deep-learning xray

Last synced: 14 May 2025

https://github.com/ConardLi/easy-dataset

A powerful tool for creating fine-tuning datasets for LLM

dataset javascript llm

Last synced: 21 Mar 2025

https://github.com/whylabs/whylogs

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

ai-pipelines analytics approximate-statistics calculate-statistics constraints data-constraints data-pipeline data-quality data-science dataops dataset logging machine-learning ml-pipelines mlops model-performance python statistical-properties

Last synced: 13 May 2025

https://github.com/Zjh-819/LLMDataHub

A quick guide (especially) for trending instruction finetuning datasets

chatbot chatgpt dataset llm

Last synced: 02 Apr 2025

https://github.com/ashvardanian/stringzilla

Up to 10x faster strings for C, C++, Python, Rust, Swift & Go, leveraging NEON, AVX2, AVX-512, SVE, & SWAR to accelerate search, hashing, sort, edit distances, and memory ops 🦖

beautifulsoup common-crawl csv dataset html information-retrieval json laion ndjson parser pattern-recognition simd sorting-algorithms string string-manipulation string-matching string-parsing string-search substring

Last synced: 11 May 2025

https://github.com/unsplash/datasets

🎁 6,500,000+ Unsplash images made available for research and machine learning

data dataset images keywords machine-learning photos research search-engine semantics unsplash

Last synced: 15 May 2025

https://github.com/meodai/color-names

Large list of handpicked color names 🌈

color colors colour colours dataset dictionary naming palette rgb-color

Last synced: 05 Jan 2026

https://github.com/GeorgeSeif/Semantic-Segmentation-Suite

Semantic Segmentation Suite in TensorFlow. Implement, train, and test new Semantic Segmentation models easily!

computer-vision dataset deep-learning densenet encoder-decoder epoch iou python refinenet segmentation semantic-segmentation semantic-segmentation-models tensorflow upsampling

Last synced: 28 Mar 2025

https://github.com/georgeseif/semantic-segmentation-suite

Semantic Segmentation Suite in TensorFlow. Implement, train, and test new Semantic Segmentation models easily!

computer-vision dataset deep-learning densenet encoder-decoder epoch iou python refinenet segmentation semantic-segmentation semantic-segmentation-models tensorflow upsampling

Last synced: 27 Sep 2025

https://github.com/ashvardanian/StringZilla

Up to 10x faster strings for C, C++, Python, Rust, Swift & Go, leveraging NEON, AVX2, AVX-512, SVE, & SWAR to accelerate search, hashing, sort, edit distances, and memory ops 🦖

beautifulsoup common-crawl csv dataset html information-retrieval json laion ndjson parser pattern-recognition simd sorting-algorithms string string-manipulation string-matching string-parsing string-search substring

Last synced: 23 Mar 2025

https://github.com/detectrecog/ccpd

[ECCV 2018] CCPD: a diverse and well-annotated dataset for license plate detection and recognition

ccpd dataset detection large-scale plate-detection recognition

Last synced: 15 May 2025

https://github.com/lukes/iso-3166-countries-with-regional-codes

ISO 3166-1 country lists merged with their UN Geoscheme regional codes in ready-to-use JSON, XML, CSV data sets

countries csv data dataset iso iso3166 iso3166-1 iso3166-2 json region-codes xml

Last synced: 14 May 2025

https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes

ISO 3166-1 country lists merged with their UN Geoscheme regional codes in ready-to-use JSON, XML, CSV data sets

countries csv data dataset iso iso3166 iso3166-1 iso3166-2 json region-codes xml

Last synced: 03 Apr 2025

https://github.com/google-research-datasets/objectron

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes

3d 3d-reconstruction 3d-vision ai augmented-reality computer-vision dataset deep-learning machine-learning neural-network python pytorch tensorflow

Last synced: 05 Oct 2025

https://github.com/google-research-datasets/Objectron

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes

3d 3d-reconstruction 3d-vision ai augmented-reality computer-vision dataset deep-learning machine-learning neural-network python pytorch tensorflow

Last synced: 20 Mar 2025

https://github.com/candlewill/dialog_corpus

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System

chatbot corpus dataset dialog system

Last synced: 15 May 2025

https://github.com/Ph055a/OSINT_Collection

Maintained collection of OSINT related resources. (All Free & Actionable)

court-search data-science dataset infosec investigation journalism osint research search

Last synced: 02 Apr 2025

https://github.com/ph055a/osint_collection

Maintained collection of OSINT related resources. (All Free & Actionable)

court-search data-science dataset infosec investigation journalism osint research search

Last synced: 25 Mar 2025

https://github.com/nvkelso/natural-earth-vector

A global, public domain map dataset available at three scales and featuring tightly integrated vector and raster data.

dataset gis map naturalearthdata

Last synced: 26 Mar 2025

https://github.com/beir-cellar/beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

benchmark bert colbert dataset deep-learning dpr elasticsearch information-retrieval llm nlp passage-retrieval pytorch question-generation rag retrieval retrieval-models sbert sentence-transformers zero-shot-retrieval

Last synced: 14 May 2025

https://github.com/alibaba/clusterdata

cluster data collected from production clusters in Alibaba for cluster management research

dataset

Last synced: 14 May 2025

https://github.com/salesforce/WikiSQL

A large annotated semantic parsing corpus for developing natural language interfaces.

database dataset machine-learning natural-language natural-language-interface natural-language-processing

Last synced: 09 May 2025

https://github.com/salesforce/wikisql

A large annotated semantic parsing corpus for developing natural language interfaces.

database dataset machine-learning natural-language natural-language-interface natural-language-processing

Last synced: 02 Sep 2025

https://github.com/visual-layer/fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

data-augmentation data-curation dataset deep-learning image image-analysis image-classfication image-classification image-duplicate-detection image-processing image-similarity machine-learning novelty-detection object-detection outlier-detection python visual-search visualization visualization-tools

Last synced: 14 May 2025

https://github.com/njvisionpower/safety-helmet-wearing-dataset

Safety helmet wearing detect dataset, with pretrained model

dataset detection gluoncv hardhat helmet

Last synced: 16 May 2025

https://github.com/cluebenchmark/cluener2020

CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition

albert bert chinese chinese-ner chinesener dataset fine-grained-ner named-entity-recognition ner roberta seq2seq sequence-labeling sequence-to-sequence

Last synced: 08 Apr 2025

https://github.com/karolpiczak/ESC-50

ESC-50: Dataset for Environmental Sound Classification

audio dataset environmental-sounds

Last synced: 14 Apr 2025

https://github.com/corollari/linusrants

Dataset of Linus Torvalds' rants classified by negativity using sentiment analysis

dataset linus linus-rants linus-torvalds sentiment-analysis

Last synced: 16 May 2025

https://github.com/mosaicml/streaming

A Data Streaming Library for Efficient Neural Network Training

dataset deep-learning machine-learning neural-network pytorch streaming

Last synced: 13 May 2025

https://github.com/jayleicn/animegan

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

dataset generative-adversarial-network pytorch

Last synced: 16 May 2025

https://github.com/jayleicn/animeGAN

A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

dataset generative-adversarial-network pytorch

Last synced: 02 May 2025

https://github.com/datitran/raccoon_dataset

The dataset is used to train my own raccoon detector and I blogged about it on Medium

dataset tensorflow-experiments

Last synced: 08 Apr 2025

https://github.com/wainshine/Company-Names-Corpus

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

company corpus dataset dict ner

Last synced: 30 Mar 2025

https://github.com/wainshine/company-names-corpus

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

company corpus dataset dict ner

Last synced: 03 Mar 2025

https://github.com/pomber/covid19

JSON time-series of coronavirus cases (confirmed, deaths and recovered) per country - updated daily

2019-ncov api coronavirus covid-19 data dataset json time-series

Last synced: 15 May 2025

https://github.com/hikariming/chat-dataset-baseline

人工精调的中文对话数据集和一段chatglm的微调代码

alpaca chatglm dataset

Last synced: 16 May 2025

https://github.com/datasets/covid-19

Novel Coronavirus 2019 time series data on cases

coronavirus coronavirus-disease covid covid-19 covid19-data data-package datapackage dataset

Last synced: 14 May 2025

https://github.com/unrealcv/synthetic-computer-vision

A list of synthetic dataset and tools for computer vision

computer-vision dataset synthetic-images virtual-worlds

Last synced: 16 May 2025

https://github.com/ultralytics/json2yolo

Convert JSON annotations into YOLO format.

coco darknet dataset json label labelbox ultralytics yolo yolov3 yolov5

Last synced: 14 May 2025

https://github.com/rucaibox/recsysdatasets

This is a repository of public data sources for Recommender Systems (RS).

atomic-files dataset recbole recommendation-datasets recommendations recommender-system

Last synced: 17 Sep 2025

https://github.com/ericguo5513/humanml3d

HumanML3D: A large and diverse 3d human motion-language dataset.

dataset deep-learning motion-generation text-annotation

Last synced: 16 May 2025

https://github.com/piskvorky/gensim-data

Data repository for pretrained NLP models and NLP corpora.

corpora dataset gensim glove-model lda-model lsi-model pretrained-models word2vec-model

Last synced: 04 Apr 2025

https://github.com/SJTU-ViSYS/M2DGR

M2DGR: a Multi-modal and Multi-scenario Dataset for Ground Robots(RA-L2021 & ICRA2022)

dataset robotics slam

Last synced: 30 Mar 2025

https://github.com/EricGuo5513/HumanML3D

HumanML3D: A large and diverse 3d human motion-language dataset.

dataset deep-learning motion-generation text-annotation

Last synced: 22 Mar 2025

https://github.com/PRBonn/lidar-bonnetal

Semantic and Instance Segmentation of LiDAR point clouds for autonomous driving

dataset deep-learning lidar ptcl segmentation semantic

Last synced: 20 Mar 2025