An open API service indexing awesome lists of open source software.

Data Science

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge from structured and unstructured data. Data scientists perform data analysis and preparation, and their findings inform high-level decisions in many organizations.

https://github.com/kraina-ai/srai

Spatial Representations for Artificial Intelligence - a Python library toolkit for geospatial machine learning focused on creating embeddings for downstream tasks

artificial-intelligence data-science geo geospatial machine-learning python spatial spatial-analysis srai

Last synced: 15 May 2025

https://github.com/rasgointelligence/RasgoQL

Write python locally, execute SQL in your data warehouse

data-analysis data-science pandas python sql

Last synced: 20 Jul 2025

https://github.com/rasgointelligence/rasgoql

Write python locally, execute SQL in your data warehouse

data-analysis data-science pandas python sql

Last synced: 14 Jun 2025

https://github.com/vopani/datatableton

100 exercises to learn Python Datatable

data-science datatable pydatatable python tutorial-exercises

Last synced: 12 May 2025

https://github.com/durgeshsamariya/data-science-roadmap

Roadmap to learn Data Science and related areas.

data-science data-science-resources learn-data-science roadmap

Last synced: 17 Oct 2025

https://github.com/svenkreiss/pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

apache-spark data-processing data-science python

Last synced: 07 Apr 2025

https://github.com/Bears-R-Us/arkouda

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:

chapel data data-analysis data-science distributed-computing eda hpc python

Last synced: 08 Jul 2025

https://github.com/empower-ai/dsensei

AI-powered key driver analysis tool that pinpoints root cause behind metrics fluctuation in one minute.

analytics business-analytics business-intelligence data data-analytics data-insights data-science

Last synced: 01 Aug 2025

https://github.com/slowkow/harmonypy

🎼 Integrate multiple high-dimensional datasets with fuzzy k-means and locally linear adjustments.

bioinformatics data-integration data-science single-cell-analysis

Last synced: 24 Apr 2026

https://github.com/dwhitena/gophernet

A simple from-scratch neural net written in Go

artificial-intelligence data-science go golang machine-learning neural-network

Last synced: 13 Sep 2025

https://github.com/wizardforcel/data-science-notebook

:book: 每一个伟大的思想和行动都有一个微不足道的开始

data-analysis data-science machine-learning notebook numpy pandas sklearn tensorflow

Last synced: 10 Apr 2025

https://github.com/bears-r-us/arkouda

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:

chapel data data-analysis data-science distributed-computing eda hpc python

Last synced: 06 Apr 2025

https://github.com/scrapinghub/webstruct

NER toolkit for HTML data

crfsuite data-science ner

Last synced: 26 Jan 2026

https://github.com/carloocchiena/the_statistics_handbook

the statistics handbook open source repository

data-science latex mathematics statistics

Last synced: 09 Apr 2025

https://github.com/red-data-tools/unicode_plot.rb

Plot your data by Unicode characters

data-science data-visualization ruby

Last synced: 14 May 2025

https://github.com/khanhnamle1994/statistical-learning

Lecture Slides and R Sessions for Trevor Hastie and Rob Tibshinari's "Statistical Learning" Stanford course

data-mining data-science r regression statistical-learning

Last synced: 10 Apr 2025

https://github.com/tiesdekok/learnpythonforresearch

This repository provides everything you need to get started with Python for (social science) research.

accounting bokeh data-science exercises finance getting-started jupyter jupyter-notebook pandas python research seaborn tutorial tutorial-notebooks web-scraping

Last synced: 10 Apr 2025

https://github.com/uclatommy/tweetfeels

Real-time sentiment analysis in Python using twitter's streaming api

data-mining data-science python-3-6 sentiment-analysis twitter

Last synced: 06 Apr 2025

https://github.com/tirthajyoti/uci-ml-api

Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)

api classification clustering data-science learning machine-learning python regression statistics uci-machine-learning

Last synced: 09 Apr 2025

https://github.com/dsfsi/covid19za

Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa

coronavirus covid-19 covid-data covid19 covid19-data dashboard data-science dataset doh doi dsfsi-datasets health nicd south-africa

Last synced: 04 Apr 2025

https://github.com/analysiscenter/cardio

CardIO is a library for data science research of heart signals

data-science deep-learning deep-neural-networks healthcare machine-learning python

Last synced: 21 Jan 2026

https://github.com/cartodb/cartoframes

CARTO Python package for data scientists

carto data-science jupyter-notebook maps python spatial-data-analysis

Last synced: 15 May 2025

https://github.com/Oxen-AI/oxen-archive

Deprecated: We moved this to Oxen-AI/Oxen core

artificial-intelligence data-science database machine-learning version-control

Last synced: 29 Aug 2025

https://github.com/ropensci/elastic

R client for the Elasticsearch HTTP API

data-science database database-wrapper elasticsearch etl http json r r-package rstats

Last synced: 16 May 2025

https://github.com/dgerlanc/programming-with-data

🐍 Learn Python and Pandas from the ground up

dangerlanc data-science pandas pandas-tutorial python workshop

Last synced: 06 Apr 2025

https://github.com/justmarkham/trump-lies

Tutorial: Web scraping in Python with Beautiful Soup

beautiful-soup data-science dataset pandas python requests tutorial web-scraping

Last synced: 26 Jul 2025

https://github.com/xlang-ai/DS-1000

[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".

benchmark code-generation data-science large-language-models semantic-parsing

Last synced: 22 Jul 2025

https://github.com/voxel51/voxelgpt

AI assistant that can query visual datasets, search the FiftyOne docs, and answer general computer vision questions

artificial-intelligence chatgpt computer-vision data-science deep-learning fiftyone langchain llm machine-learning openai python

Last synced: 26 Jun 2025

https://github.com/touppercase78/formula1-datasets

Datasets & Analyses for Formula 1 World Championship

analysis data-science datasets formula1 jupyter-notebook motorsports python racing

Last synced: 09 Apr 2025

https://github.com/jldbc/coffee-quality-database

Building the Coffee Quality Institute Database

agriculture coffee data data-science dataset

Last synced: 09 Apr 2025

https://github.com/visualize-ml/linear-algebra-made-easy---learn-with-python-and-visualization

”数学不难“ 之 《线性代数不难》上下册,66话题完册;欢迎批评指正

data-science data-visualization linear-algebra machine-learning python visualization

Last synced: 26 Jun 2025

https://github.com/xlang-ai/ds-1000

[ICML 2023] Data and code release for the paper "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation".

benchmark code-generation data-science large-language-models semantic-parsing

Last synced: 12 Apr 2025

https://github.com/neurodata/hyppo

Python package for multivariate hypothesis testing

data-science hacktoberfest hypothesis-testing independence ksample-testing python

Last synced: 14 Dec 2025

https://github.com/recodehive/stackoverflow-analysis

Stack overflow is a professional community for developers. This repo analysis 3 years of developer Survey done by Stackoverflow and do visualization and predict the salary of Data Scientist in future.

canva collaborate data-analysis data-science data-visualization ghdesktop github github-pages machine-learning stack-overflow student-vscode survey-analysis vscode

Last synced: 15 May 2025

https://github.com/shreyashankar/datasets-for-good

List of datasets to apply stats/machine learning/technology to the world of social good.

data-science dataset education environment government health machine-learning social-good

Last synced: 05 Mar 2026

https://github.com/bgruening/docker-galaxy

:whale::bar_chart::books: Docker Images tracking the stable Galaxy releases.

data-science docker-image galaxy galaxyproject science

Last synced: 16 May 2025

https://github.com/fusedio/udfs

Public Fused UDFs. Build any scale workflows with the Fused Python SDK and Workbench webapp, and integrate them into your stack with the Fused Hosted API.

data-science earth-observation geo geopython geospatial geospatial-analysis gis python raster spatial timeseries-analysis udf vector

Last synced: 16 May 2025

https://github.com/apache/texera

Collaborative Machine-Learning-Centric Data Analytics Using Workflows

artificial-intelligence cloud-native data data-analytics data-science machine-learning texera workflow

Last synced: 02 May 2026

https://github.com/mcekovic/tennis-crystal-ball

Ultimate Tennis Statistics and Tennis Crystal Ball - Tennis Big Data Analysis and Prediction

big-data bigdata data-analysis data-science database elo elo-rating forecast goat machine-learning prediction sports statistics tennis tennis-score

Last synced: 11 Mar 2026

https://github.com/data-dot-all/dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.

aws aws-glue aws-lake-formation aws-s3 data data-science etl-framework lakeformation lakehouse redshift

Last synced: 29 Jul 2025

https://github.com/project-codeflare/codeflare

Simplifying the definition and execution, scaling and deployment of pipelines on the cloud.

automl data-science hyperparameter-optimization machine-learning pipelines ray sklearn workflows

Last synced: 11 Oct 2025

https://github.com/koalaverse/homlr

Supplementary material for Hands-On Machine Learning with R, an applied book covering the fundamentals of machine learning with R.

data-science machine-learning r supervised-learning unsupervised-learning

Last synced: 28 Jul 2025

https://github.com/dialnd/imbalanced-algorithms

Python-based implementations of algorithms for learning on imbalanced data.

data-science imbalanced-data machine-learning notre-dame python

Last synced: 11 Apr 2025

https://github.com/mukeshmithrakumar/book_list

Python, Machine Learning, Deep Learning and Data Science Books

algorithms books data-science deep-learning free machine-learning python

Last synced: 23 Jul 2025

https://github.com/mukeshmithrakumar/Book_List

Python, Machine Learning, Deep Learning and Data Science Books

algorithms books data-science deep-learning free machine-learning python

Last synced: 05 May 2025

https://github.com/hugohadfield/kalmangrad

Automated, smooth, N'th order derivatives of non-uniformly sampled time series data

data-science derivatives kalman-filter signal-processing smoothing

Last synced: 25 Oct 2025

https://github.com/Netflix/metaflow-service

:rocket: Metadata tracking and UI service for Metaflow!

ai data-science machine-learning metaflow ml ml-infrastructure ml-platform productivity ui

Last synced: 09 Jun 2026

https://github.com/rumbledb/rumble

⛈️ RumbleDB 1.23.0 "Mountain Ash" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml

Last synced: 03 Aug 2025

https://github.com/RumbleDB/rumble

Quick start: pip install jsoniq ⛈️ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for Apache Spark | Run queries on your large-scale, messy datasets (JSON, text, CSV, Parquet, Delta...) | Data Lakehouse with Updates, Scripting, Declarative Machine Learning and more

azure csv data-science dataframes delta-lake hdfs json jsoniq lakehouse machine-learning nested parquet query query-engine s3 scale schemaless spark svm text

Last synced: 20 Nov 2025

https://github.com/vertica/VerticaPy

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.

big-data data-science data-visualization machine-learning preparation python python-library vertica

Last synced: 06 May 2025

https://github.com/nickslevine/zebras

Data analysis library for JavaScript built with Ramda

data-analysis data-science functional-programming javascript pandas ramda

Last synced: 24 Aug 2025

https://github.com/Minyus/pipelinex

PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more

data-engineering data-science deep-learning experimentation machine-learning pipeline

Last synced: 24 Mar 2025

https://github.com/analysiscenter/radio

RadIO is a library for data science research of computed tomography imaging

computed-tomography data-science deep-learning machine-learning medical-imaging neural-networks tensorflow

Last synced: 21 Jan 2026

https://github.com/Toloka/crowd-kit

Control the quality of your labeled data with the Python tools you already know.

aggregations annotation crowd crowdsourcing data-mining data-science labeling python quality-control toloka truth-inference

Last synced: 26 Mar 2025

https://github.com/PecanProject/pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.

bayesian cyberinfrastructure data-assimilation data-science ecosystem-model ecosystem-science forecasting meta-analysis national-science-foundation pecan plants r

Last synced: 07 May 2025