awesome-datascience
:memo: An awesome Data Science repository to learn and apply for real world problems.
https://github.com/academic/awesome-datascience
Last synced: 6 days ago
JSON representation
-
Agents
-
Frameworks
- ADK-Rust - Production-ready AI agent development kit for Rust with model-agnostic design (Gemini, OpenAI, Anthropic), multiple agent types (LLM, Graph, Workflow), MCP support, and built-in telemetry.
-
Research & Knowledge Retrieval
- Chunk Tuner - Open-source Python library and MCP server to benchmark document chunking strategies for RAG, score retrieval quality, and recommend configurations for a corpus.
- II-Commons - Daily-updated skill and CLI for deterministic retrieval across arXiv, PubMed/PMC, and supported US policy corpora.
- Suppr - AI literature search, document translation, and deep-research workspace for researchers.
-
Tools
- DeepAlpha - AI crypto trading framework using LightGBM + XGBoost ensemble with 72 ML features. 70.9% walk-forward validated accuracy on out-of-sample data. Supports Bybit and Binance. MIT licensed, available on [PyPI](https://pypi.org/project/deepalpha-bot/).
- Frostbyte MCP - MCP server providing 13 data tools for AI agents: real-time crypto prices, IP geolocation, DNS lookups, web scraping to markdown, code execution, and screenshots. One API key for 40+ services.
- CAJAL - Local AI agent for generating publication-ready scientific papers with real arXiv citations, IMRaD structure, and tribunal scoring. Runs 100% offline via Ollama with 4B-9B models. MIT licensed. [HuggingFace](https://huggingface.co/Agnuxo/CAJAL-9B-P2PCLAW)
-
Workflow
-
-
Fun
-
Comics
-
Datasets
- Academic Torrents
- ADS-B Exchange - Specific datasets for aircraft and Automatic Dependent Surveillance-Broadcast (ADS-B) sources.
- Public Big Data Sets
- data.gov - The home of the U.S. Government's open data
- United States Census Bureau
- usgovxml.com
- datahub.io
- datacite.org
- The official portal for European data
- NASDAQ:DATA - Nasdaq Data Link A premier source for financial, economic and alternative datasets.
- figshare.com
- GeoLite Legacy Downloadable Databases
- Quora's Big Datasets Answer
- Kaggle Datasets
- A Deep Catalog of Human Genetic Variation
- World Bank Data
- Open Data Philly
- grouplens.org
- research-quality data sets
- National Centers for Environmental Information
- ClimateData.us
- r/datasets
- MapLight - provides a variety of data free of charge for uses that are freely available to the general public. Click on a data set below to learn more
- GHDx - Institute for Health Metrics and Evaluation - a catalog of health and demographic datasets from around the world and including IHME results
- St. Louis Federal Reserve Economic Data - FRED
- New Zealand Institute of Economic Research – Data1850
- UNICEF Data
- undata
- NASA SocioEconomic Data and Applications Center - SEDAC
- The GDELT Project
- StackExchange Data Explorer - an open source tool for running arbitrary queries against public data from the Stack Exchange network.
- SocialGrep - a collection of open Reddit datasets.
- San Fransisco Government Open Data
- IBM Asset Dataset
- Open data Index
- Public Git Archive
- Microsoft Research Open Data
- Open Government Data Platform India
- Google Dataset Search (beta)
- IBB Open Portal
- The Humanitarian Data Exchange
- GHTorrent
- enigma.com - Navigate the world of public data - Quickly search and analyze billions of public records published by governments, companies and organizations.
- Hugging Face Datasets
- Open Data Sources
- NASA SocioEconomic Data and Applications Center - SEDAC
- StackExchange Data Explorer - an open source tool for running arbitrary queries against public data from the Stack Exchange network.
- Open data Index
- Open Government Data Platform India
- NAYN.CO Turkish News with categories
- Covid-19
- Covid-19 Google
- 5000 Images of Clothes
- IBB Open Portal
- 250k+ Job Postings - An expanding dataset of historical job postings from Luxembourg from 2020 to today. Free with 250k+ job postings hosted on AWS Data Exchange.
- FinancialData.Net - Financial datasets (stock market data, financial statements, sustainability data, and more).
- notesjor corpus-collection - Free corpora (over 6 billion tokens) mostly German (both historically and in contemporary German).
- CLARIN-Repository - CLARIN is a European repository for scientific datasets.
- IBM Asset Dataset
- AI Displacement Tracker - Structured dataset tracking 92 AI-attributed workforce reduction events affecting 453,748 workers across 12 countries and 11 sectors. JSON and CSV formats. CC-BY-4.0 licensed.
- GBIF - Global Biodiversity Information Facility: 2.4B+ species occurrence records. Free, open API for ecological modeling and ML research.
- FirstData - The world's most comprehensive authoritative data source knowledge base. 210+ curated sources from governments, international organizations, and research institutions. MCP integration for AI agents. MIT licensed.
- latamdata-py - Python package for one-line access to 38 open research datasets from Latin America (health, neuroscience, mental health, economics). pip install latamdata-py.
- Japan Neighborhoods - English dataset of Tokyo crime statistics across 5,078 neighborhoods × 7 years (36,222 records, 2018-2024), sourced from Tokyo Metropolitan Police open data. Includes interactive crime map, safety grading, and cost-of-living index. CC BY licensed.
- ZipCheckup - Free ZIP-level environmental safety data for 42,000+ US ZIP codes: water quality, air quality, PFAS contamination, radon, lead, flood risk, and 11 more verticals. Public REST API, npm/PyPI packages, CC BY 4.0.
- Helium - Real-time news corpus with structured bias features across 15+ dimensions (3.2M+ articles, 5,000+ sources), live financial market data (stocks, ETFs, crypto) with AI-generated analysis, ML options pricing with probability metrics and full Greeks, historical options chain data for quantitative research; available via MCP server or REST API.
- Packrift Packaging Optimization Benchmark Corpus - Public packaging product dataset generated from 1,000 exact-spec SKU records, with downloadable CSV and JSON files for ecommerce fulfillment and warehouse analysis.
- The Quiet-Broke Index - A 30-metro composite ranking of how much of a $400K household income gets consumed by housing, taxes, childcare, healthcare, and transport. Open methodology, free, no email gate.
- Crime Brasil - Open-data platform for Brazilian crime statistics. Neighborhood-level in Rio Grande do Sul (2.99M incidents across 79,024 neighborhoods, 2022–2025), municipality-level for MG and RJ, plus national PRF highway and DATASUS interpersonal-violence data. Free REST API, CSV/Parquet, daily updates, CC BY 4.0.
- New Zealand Institute of Economic Research – Data1850
- NASA SocioEconomic Data and Applications Center - SEDAC
- Sweden, Statistics
- FAOSTAT - UN FAO statistics on food production, trade, land use, and emissions for 245+ countries. Free API and bulk download.
-
Infographics
- <img src="https://i.imgur.com/0OoLaa5.png" width="150" /> - differences-of-a-data-scientist-vs-data-engineer) |
- <img src="https://cloud.githubusercontent.com/assets/182906/19517857/604f88d8-960c-11e6-97d6-16c9738cb824.png" width="150" />
- <img src="https://i.imgur.com/W2t2Roz.png" width="150" />
- <img src="https://i.imgur.com/rb9ruaa.png" width="150" /> - a-data-scientist/). |
- <img src="https://i.imgur.com/XBgKF2l.png" width="150" />
- <img src="https://i.imgur.com/l9ZGtal.jpg" width="150" />
- <img src="https://i.imgur.com/TWkB4X6.png" width="150" />
- <img src="https://i.imgur.com/gtTlW5I.png" width="150" />
- <img src="https://scikit-learn.org/stable/_static/ml_map.png" width="150" />
- <img src="https://i.imgur.com/3JSyUq1.png" width="150" />
- <img src="https://i.imgur.com/DQqFwwy.png" width="150" />
- <img src="https://www.springboard.com/blog/wp-content/uploads/2016/03/20160324_springboard_vennDiagram.png" width="150" height="150" /> - science-career-paths-different-roles-industry/) by Springboard |
- <img src="https://data-literacy.geckoboard.com/assets/img/data-fallacies-to-avoid-preview.jpg" width="150" alt="Data Fallacies To Avoid" /> - data scientist/non-statistician colleagues [how to avoid mistakes with data](https://data-literacy.geckoboard.com/poster/). From Geckoboard's [Data Literacy Lessons](https://data-literacy.geckoboard.com/). |
- <img src="https://scikit-learn.org/1.5/_downloads/b82bf6cd7438a351f19fac60fbc0d927/ml_map.svg" width="150" /> - learn.org/1.5/machine_learning_map.html#choosing-the-right-estimator) |
- <img src="https://data-literacy.geckoboard.com/assets/img/data-fallacies-to-avoid-preview.jpg" width="150" alt="Data Fallacies To Avoid" /> - data scientist/non-statistician colleagues [how to avoid mistakes with data](https://data-literacy.geckoboard.com/poster/). From Geckoboard's [Data Literacy Lessons](https://data-literacy.geckoboard.com/). |
-
-
Literature and Media
-
Bloggers
- Wes McKinney - Wes McKinney Archives.
- Matthew Russell - Mining The Social Web.
- Greg Reda - Greg Reda Personal Blog
- Julia Evans - Recurse Center alumna
- Hakan Kardas - Personal Web Page
- Sean J. Taylor - Personal Web Page
- Drew Conway - Personal Web Page
- Hilary Mason - Personal Web Page
- Noah Iliinsky - Personal Blog
- Matt Harrison - Personal Blog
- Vamshi Ambati - AllThings Data Sciene
- Prash Chan - Tech Blog on Master Data Management And Every Buzz Surrounding It
- Clare Corthell - The Open Source Data Science Masters
- Paul Miller
- Data Science London - profit organization dedicated to the free, open, dissemination of data science.
- Datawrangling
- Quora Data Science - Data Science Questions and Answers from experts
- Siah
- Machine Learning Mastery
- Daniel Forsyth - Personal Blog
- Data Science Weekly - Weekly News Blog
- Revolution Analytics - Data Science Blog
- R Bloggers - R Bloggers
- The Practical Quant
- Yet Another Data Blog
- Spenczar - building to reporting.
- KD Nuggets
- Meta Brown - Personal Blog
- Data Scientist
- WhatSTheBigData
- Tevfik Kosar - Magnus Notitia
- New Data Scientist
- Harvard Data Science - Thoughts on Statistical Computing and Visualization
- Data Science 101 - Learning To Be A Data Scientist
- Kaggle Past Solutions
- datascientistjourney
- Learning Lover
- Dataists
- Data-Mania
- Data-Magnum
- P-value - Musings on data science, machine learning, and stats.
- Digital transformation
- Data Mania Blog - [The File Drawer](https://chris-said.io/) - Chris Said's science blog
- Emilio Ferrara's web page
- DataNews
- Reddit TextMining
- Periscopic
- Hilary Parker
- Data Science Lab
- Meaning of
- Adventures in Data Land
- DATA MINERS BLOG
- Dataclysm
- FlowingData - Visualization and Statistics
- Calculated Risk
- O'reilly Learning Blog
- Dominodatalab
- i am trask - A Machine Learning Craftsmanship Blog
- Vademecum of Practical Data Science - Handbook and recipes for data-driven solutions of real-world problems
- Dataconomy - A blog on the newly emerging data economy
- Springboard - A blog with resources for data science learners
- Analytics Vidhya - A full-fledged website about data science and analytics study material.
- Occam's Razor - Focused on Web Analytics.
- Data School - Data science tutorials for beginners!
- Colah's Blog - Blog for understanding Neural Networks!
- Sebastian's Blog - Blog for NLP and transfer learning!
- Chris Albon's Website - Data Science and AI notes
- Andrew Carr - Data Science with Esoteric programming languages
- floydhub - Blog for Evolutionary Algorithms
- Jingles - Review and extract key concepts from academic papers
- nbshare - Data Science notebooks
- Deep and Shallow - All things Deep and Shallow in Data Science
- Loic Tetrel - Data science blog
- Chip Huyen's Blog - ML Engineering, MLOps, and the use of ML in startups
- Maria Khalusova - Data science blog
- Aditi Rastogi - ML,DL,Data Science blog
- Santiago Basulto - Data Science with Python
- Akhil Soni - ML, DL and Data Science
- Akhil Soni - ML, DL and Data Science
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
- Dataclysm
-
Programming Languages
Categories
Sub Categories
Miscellaneous Tools
139
Bloggers
124
Books
96
Deep Learning Packages
90
Comparison
78
Datasets
73
Twitter Accounts
73
YouTube Videos & Channels
59
Comics
50
MOOC's
49
Facebook Accounts
43
Journals, Publications and Magazines
43
General Machine Learning Packages
40
Podcasts
35
Algorithms
29
Colleges
24
Tutorials
23
Free Courses
23
Infographics
15
Presentations
11
Data Science Competitions
6
Newsletters
4
Research & Knowledge Retrieval
3
Tools
3
Telegram Channels
3
Slack Communities
2
Intensive Programs
2
Workflow
2
Mailing lists
1
Hobby
1
GitHub Groups
1
Frameworks
1
Disaster
1
Keywords
machine-learning
86
python
59
deep-learning
58
data-science
50
pytorch
25
tensorflow
21
scikit-learn
13
keras
13
ml
12
data-analysis
12
neural-network
11
reinforcement-learning
11
artificial-intelligence
10
mlops
10
ai
9
data-visualization
9
numpy
8
computer-vision
8
hyperparameter-optimization
7
object-detection
7
neural-networks
7
awesome-list
7
gradient-boosting
7
pandas
6
jupyter-notebook
6
data-mining
6
llm
6
r
6
jupyter
6
spark
5
data
5
big-data
5
image-processing
5
explainable-ai
5
explainable-ml
5
nlp
5
workflow
5
pipeline
5
dataset
5
awesome
5
classifier
4
reproducibility
4
gbm
4
gbdt
4
distributed
4
data-engineering
4
random-forest
4
pyspark
4
machine-learning-algorithms
4
developer-tools
4