An open API service indexing awesome lists of open source software.

Data Science

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge from structured and unstructured data. Data scientists perform data analysis and preparation, and their findings inform high-level decisions in many organizations.

https://github.com/Azure/DataScienceVM

Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)

ai azure big-data data-analysis data-science deep-learning dsvm machine-learning ml python r sqlserver

Last synced: 20 Jul 2025

https://github.com/curiousily/machine-learning-from-scratch

Succinct Machine Learning algorithm implementations from scratch in Python, solving real-world problems (Notebooks and Book). Examples of Logistic Regression, Linear Regression, Decision Trees, K-means clustering, Sentiment Analysis, Recommender Systems, Neural Networks and Reinforcement Learning.

artificial-intelligence book classification data-science machine-learning machine-learning-algorithms neural-networks notebook recommender-systems regression reinforcement-learning sentiment-analysis

Last synced: 07 May 2025

https://github.com/azure/datasciencevm

Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)

ai azure big-data data-analysis data-science deep-learning dsvm machine-learning ml python r sqlserver

Last synced: 07 Apr 2025

https://github.com/sicara/sicarator

Instant Setup & Best Quality for Data Projects!

data-science generator machine-learning python

Last synced: 04 Apr 2025

https://github.com/t04glovern/selfie2anime

Anime2Selfie Backend Services - Lambda, Queue, API Gateway and traffic processing

aws aws-lambda data-science selfie2anime serverless

Last synced: 21 Aug 2025

https://github.com/tk-learning-center/machine-learning-degree

✨ ML/AI, Medicine, Genomics, Science Research

data-science deep-learning machine-learning python science

Last synced: 01 May 2025

https://github.com/jvalue/jayvee

Jayvee is a domain-specific language and runtime for automated processing of data pipelines

data-engineering data-pipeline data-science domain-specific-language etl-pipeline typescript

Last synced: 23 Oct 2025

https://github.com/solegalli/machine-learning-imbalanced-data

Code repository for the online course Machine Learning with Imbalanced Data

data-science imbalanced-classification imbalanced-data imbalanced-learning machine-learning python

Last synced: 16 May 2025

https://github.com/anthdm/ml-email-clustering

Email clustering with machine learning

clustering data-science machine-learning scikit-learn

Last synced: 14 May 2025

https://github.com/apachecn/ds-ai-tech-notes

:book: [译] 数据科学和人工智能技术笔记

ai data-science matplotlib notes numpy python sklearn

Last synced: 24 Jul 2025

https://github.com/capeprivacy/cape-dataframes

Privacy transformations on Spark and Pandas dataframes backed by a simple policy language.

collaboration data-science hacktoberfest machine-learning pandas policy privacy python spark

Last synced: 11 May 2026

https://github.com/Oxen-AI/Oxen

Oxen.ai's core rust library, server, and CLI

artificial-intelligence data-science database machine-learning version-control

Last synced: 06 Aug 2025

https://github.com/arabacibahadir/sup-res

A great companion for finding key support and resistance levels on financial charts, cryptocurrencies.

algotrade analysis binance binance-api bitcoin cryptocurrency data-science finance pandas pinescript python stock telegram telegram-bot tradingview

Last synced: 19 Mar 2025

https://github.com/zetane/zetaforge

Open source AI platform for rapid development of advanced AI and AGI pipelines.

agi ai claude data-science developer-tools gpt kubernetes llm machine-learning ml ml-pipelines mlops python workflow workflow-orchestration zetaforge

Last synced: 16 May 2025

https://github.com/kdr-aus/ogma

Scripting language focused on processing tabular data.

data-science language rust scripting-language table-data

Last synced: 27 Mar 2025

https://hachmannlab.github.io/chemml/

ChemML is a machine learning and informatics program suite for the chemical and materials sciences.

data-science deep-learning drug-discovery machine-learning materials-informatics quantum-mechanics

Last synced: 20 Nov 2025

https://github.com/fedora-infra/fedmsg

Federated Messaging with ZeroMQ

data-science fedora-project message-bus python zeromq

Last synced: 26 Feb 2025

https://github.com/hachmannlab/chemml

ChemML is a machine learning and informatics program suite for the chemical and materials sciences.

data-science deep-learning drug-discovery machine-learning materials-informatics quantum-mechanics

Last synced: 21 Oct 2025

https://github.com/pydatablog/python-for-data-science

A blog for data analytics using data science technologies

data-science finance python

Last synced: 20 Aug 2025

https://github.com/dlab-berkeley/Python-Fundamentals-Legacy

D-Lab's 12 hour introduction to Python. Learn how to create variables and functions, use control flow structures, use libraries, import data, and more, using Python and Jupyter Notebooks.

data-science introduction-to-python jupyter python

Last synced: 26 Apr 2025

https://github.com/probcomp/metaprob

An embedded language for probabilistic programming and meta-programming.

clojure data-science machine-learning probabilistic-programming

Last synced: 08 May 2025

https://github.com/tirthajyoti/ds-with-pysimplegui

Data science and Machine Learning GUI programs/ desktop apps with PySimpleGUI package

analytics application artificial-intelligence data-science desktop-app gui machine-learning python windows

Last synced: 21 Aug 2025

https://github.com/lamastex/scalable-data-science

Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.

apache-spark data-science databricks scala

Last synced: 16 May 2025

https://github.com/phillipdupuis/dtale-desktop

Build a data visualization dashboard with simple snippets of python code

data-analysis data-science data-visualization fastapi pandas python react typescript visualization

Last synced: 13 Sep 2025

https://github.com/google/starthinker

Reference framework for building data workflows provided by Google. Accelerates authentication, logging, scheduling, and deployment of solutions using GCP. To borrow a tagline.. "The framework for professionals with deadlines."

airflow app-engine automation bigquery cloud-functions cm360 colab-notebook data-science django dv360 google-ads google-analytics logger python scheduler ui workflows

Last synced: 04 Oct 2025

https://github.com/unnati-xyz/scalable-data-science-platform

Content for architecting a data science platform for products using Luigi, Spark & Flask.

data-engineer data-pipeline data-science luigi machine-learning rest-api spark

Last synced: 19 Jul 2025

https://github.com/Automunge/AutoMunge

Tabular feature encoding pipelines for machine learning with options for string parsing, missing data infill, and stochastic perturbations.

data-science machine-learning

Last synced: 15 Mar 2025

https://github.com/matyushkin/lessons

📖 In Russian: cписок русскоязычных публикаций matyushkin и блокноты Jupyter для различных образовательных ресурсов.

data-science jupyter jupyter-notebook neural-network python python-plotly russian russian-language tensorflow

Last synced: 04 Apr 2026

https://github.com/davendw49/k2

Code and datasets for paper "K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization" in WSDM-2024

ai4science data-science geoai geoscience kg large-language-models llm

Last synced: 01 Apr 2025

https://github.com/robb/rbbjson

Flexible JSON traversal for rapid prototyping.

data-science json jsonpath prototyping swift

Last synced: 17 Mar 2025

https://github.com/jeroenjanssens/python-polars-the-definitive-guide

Scripts and datasets for the O'Reilly book Python Polars: The Definitive Guide

data-science oreilly oreilly-books polars polars-dataframe python

Last synced: 05 Apr 2025

https://github.com/priorlabs/tabpfn-client

⚡ Easy API access to the tabular foundation model TabPFN ⚡

data-science foundation-models machine-learning tabpfn tabular-data

Last synced: 16 May 2025

https://github.com/apoorvalal/ding_causalinference_python

python implementation of Peng Ding's "First Course in Causal Inference"

causal-inference data-science

Last synced: 12 Apr 2025

https://github.com/hamelsmu/docker_tutorial

Code and helper scripts for article on Medium "How Docker Can Help You Become A More Effective Data Scientist"

data-science docker docker-tutorial medium medium-article

Last synced: 16 Jun 2025

https://github.com/alexandervnikitin/tsgm

Generation and evaluation of synthetic time series datasets (also, augmentations, visualizations, a collection of popular datasets) NeurIPS'24

augmentations data-augmentation data-science datasets deep-learning generative-model keras machine-learning python synthetic-data synthetic-time-series tensorflow2 time-series vae

Last synced: 06 Apr 2025

https://github.com/celebi-pkg/flight-analysis

Python package to scrape flight data from Google Flights and analyzes prices. Can determine optimal flight from date, place, and price

data-science google pandas planes prediction price-tracker python

Last synced: 30 Oct 2025

https://github.com/pyscaffold/pyscaffoldext-dsproject

💫 PyScaffold extension for data-science projects

data-science pyscaffold pyscaffold-extension python

Last synced: 16 May 2025

https://github.com/risenw/datasist

A Python library for easy data analysis, visualization, exploration and modeling

data-analysis data-science data-visualization feature-engineering machine-learning python-3

Last synced: 24 Oct 2025

https://github.com/oxinabox/DataDeps.jl

reproducible data setup for reproducible science

data data-science open-science

Last synced: 13 Nov 2025

https://github.com/theislab/kbet

An R package to test for batch effects in high-dimensional single-cell RNA sequencing data.

batch-effects data-science quantification scrnaseq

Last synced: 11 Oct 2025

https://github.com/jgoerner/beyond-jupyter

🐍💻📊 All material from the PyCon.DE 2018 Talk "Beyond Jupyter Notebooks - Building your own data science platform with Python & Docker" (incl. Slides, Video, Udemy MOOC & other References)

airflow apache apistar data-science docker docker-compose jupyter jupyter-notebook minio postgres superset

Last synced: 16 Mar 2025

https://github.com/noahho/caafe

Semi-automatic feature engineering process using Language Models and your dataset descriptions. Based on the paper "LLMs for Semi-Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering" by Hollmann, Müller, and Hutter (2023).

automl data-science deep-learning feature-engineering machine-learning tabpfn

Last synced: 04 Apr 2025

https://github.com/dataficationsdk/verso

Extensible interactive notebook platform for .NET. Every built-in feature, from the C# kernel to the dashboard layout, is an extension built on the same public interfaces available to third-party authors. Runs in VS Code and the browser.

csharp data-science fsharp jupyter notebooks polyglot

Last synced: 02 May 2026

https://github.com/tiannaparris/data-analysis-portfolio

This is a repository that I have created to showcase skills, share projects and track my progress in Data Analytics / Data Science related topics.

data-analysis data-science data-visualization excel matplotlib pandas portfolio powerbi python r scipy seaborn sql tableau

Last synced: 30 Oct 2025

https://github.com/alteryx/woodwork

Woodwork is a Python library that provides robust methods for managing and communicating data typing information.

data-science dataframe dataframes evalml featuretools inference machine-learning nlp-primitives python semantic-tags typing woodwork

Last synced: 15 May 2025

https://github.com/thebabylonai/babylog

A lightweight logger for machine learning teams to log images and predictions in production.

computer-vision cvops data-science logger logging-library machine-learning ml mlops python python3

Last synced: 07 May 2025

https://github.com/ryanswanstrom/awesome-datascience-colleges

A list of colleges and universities offering degrees in data science.

colleges data-science datascience-colleges universities

Last synced: 27 Feb 2026

https://github.com/h2oai/wave-apps

Sample AI Apps built with H2O Wave.

data-science h2oai hacktoberfest low-code machine-learning python3

Last synced: 05 Sep 2025

https://github.com/oxinabox/datadeps.jl

reproducible data setup for reproducible science

data data-science open-science

Last synced: 26 Jan 2026

https://github.com/heidelbergcement/hcrystalball

A library that unifies the API for most commonly used libraries and modeling techniques for time-series forecasting in the Python ecosystem.

cross-validation data-science fbprophet model-selection pmdarima sarimax sklearn sklearn-api sklearn-compatible sklearn-library sktime statsmodels tbats time-series time-series-forecasting transformer wrapper

Last synced: 05 Apr 2025

https://github.com/morganjwilliams/pyrolite

A set of tools for getting the most from your geochemical data.

chemistry data-science geochemical-data geochemistry geoscience pyrolite ternary-diagrams

Last synced: 23 Feb 2026

https://github.com/abdenlab/oxbow

Oxbow makes genomic data ready for high-performance analytics.

apache-arrow bioinformatics data-science dataframe fair-data genomics multiomics ngs pandas polars python r rust-lang

Last synced: 07 Mar 2026

https://github.com/yandexdataschool/roc_comparison

The fast version of DeLong's method for computing the covariance of unadjusted AUC.

data-science statistics

Last synced: 10 Apr 2025

https://github.com/emilhvitfeldt/r-text-data

List of textual data sources to be used for text mining in R

data-science nlp rstats text-analysis text-analytics-in-r text-mining tidytext

Last synced: 18 Jan 2026

https://github.com/whitews/flowkit

A Python toolkit for flow cytometry analysis supporting GatingML and FlowJo workspaces

cytometry data-science fcs fcs-files flow-cytometry flow-cytometry-analysis flowjo gatingml immunology python

Last synced: 02 Apr 2026

https://github.com/EmilHvitfeldt/R-text-data

List of textual data sources to be used for text mining in R

data-science nlp rstats text-analysis text-analytics-in-r text-mining tidytext

Last synced: 13 Jul 2025

https://github.com/gzuidhof/zarr.js

Javascript implementation of Zarr

array data-science gehlenborglab javascript typescript zarr

Last synced: 05 Apr 2025

https://github.com/rivasiker/gghoriplot

A user-friendly, highly customizable R package for building horizon plots in ggplot2

data-science data-visualization ggplot2 horizon-plots r r-package

Last synced: 24 Sep 2025

https://github.com/apache/incubator-liminal

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

ai airflow big-data data-science machine-learning ml workflows

Last synced: 14 Jan 2026

https://github.com/martineastwood/penaltyblog

⚽ High-performance football analytics: build data pipelines, scrape data, model matches, rank teams, and bet smarter | Powered by www.pena.lt/y 🚀

betting betting-models betting-odds betting-strategies cython data-science elo-rating football football-data match-predictions opta pi-rating poisson-model predictive-modeling python ranked-probability-score soccer sports-analytics sports-betting statsbomb

Last synced: 28 Feb 2026

https://github.com/bcg-x-official/artkit

Automated prompt-based testing and evaluation of Gen AI applications

asyncio data-science gen-ai genai python red-teaming test-automation

Last synced: 16 May 2025

https://github.com/mybridge/learn-python

Python Top 45 Articles of 2017

algorithm data-science machine-learning python python3

Last synced: 13 Apr 2025