An open API service indexing awesome lists of open source software.

Data Science

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge from structured and unstructured data. Data scientists perform data analysis and preparation, and their findings inform high-level decisions in many organizations.

https://github.com/tuangauss/DataScienceProjects

The code repository for projects and tutorials in R and Python that covers a variety of topics in data visualization, statistics sports analytics and general application of probability theory.

data-science data-visualization statistics

Last synced: 29 Mar 2025

https://github.com/JuliaStats/GLM.jl

Generalized linear models in Julia

data-science glm julia regression statistical-models statistics

Last synced: 01 May 2025

https://github.com/xiaodaigh/disk.frame

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data

data data-science large-dataset manipulation-data medium-data r

Last synced: 14 Mar 2025

https://github.com/DiskFrame/disk.frame

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data

data data-science large-dataset manipulation-data medium-data r

Last synced: 14 Mar 2025

https://github.com/jacksonwuxs/dapy

Easy-to-use data analysis / manipulation framework for humans

analysis data-analysis data-science efficiency pypi python statistical-reports

Last synced: 05 Apr 2025

https://github.com/alegonz/baikal

A graph-based functional API for building complex scikit-learn pipelines.

data-science graph-based machine-learning python scikit-learn

Last synced: 08 May 2025

https://github.com/inseefrlab/onyxia

๐Ÿ”ฌ Data science environment for k8s

bluehats data-science datalab helm insee kubernetes onyxia

Last synced: 15 May 2025

https://github.com/JacksonWuxs/DaPy

Easy-to-use data analysis / manipulation framework for humans

analysis data-analysis data-science efficiency pypi python statistical-reports

Last synced: 28 Mar 2025

https://github.com/siznax/wptools

Wikipedia tools (for Humans): easily extract data from Wikipedia, Wikidata, and other MediaWikis

api-client commons data-science glam linked-open-data mediawiki mediawiki-api open-data python restbase wikidata wikimedia-commons wikipedia wikipedia-api

Last synced: 15 May 2025

https://github.com/kkulma/climate-change-data

:earth_africa: A curated list of APIs, open data and ML/AI projects on climate change

climate climate-analysis climate-change climate-data data data-science datascience hacktoberfest python r resources rstats

Last synced: 04 Apr 2025

https://github.com/youssefhosni/efficient-python-for-data-scientists-book

Official Repo for the Efficient Python for Data Scientists Book. You can buy the book from here:

data-science numpy pandas python

Last synced: 15 May 2025

https://github.com/pgalko/bambooai

A Python library powered by Language Models (LLMs) for conversational data discovery and analysis.

ai ai-agents anthropic data-analysis data-science docker gemini groq llm mistral ollama openai-api pandas pinecone python vector-database vllm

Last synced: 15 May 2025

https://github.com/dmbee/seglearn

Python module for machine learning time series:

data-science machine-learning python time-series

Last synced: 14 Mar 2025

https://dmbee.github.io/seglearn/

Python module for machine learning time series:

data-science machine-learning python time-series

Last synced: 01 Apr 2025

https://github.com/capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

compare dask data data-science dataframes fugue numpy pandas polars pyspark python snowflake snowpark spark

Last synced: 14 May 2025

https://github.com/rpy2/rpy2

Interface to use R from Python

cffi data-science interoperability python r statistics

Last synced: 14 May 2025

https://github.com/GRAAL-Research/poutyne

A simplified framework and utilities for PyTorch

data-science deep-learning keras machine-learning neural-network python pytorch

Last synced: 27 Mar 2025

https://github.com/LearnDataSci/articles

A repository for the source code, notebooks, data, files, and other assets used in the data science and machine learning articles on LearnDataSci

data-analysis data-science data-visualization machine-learning machine-learning-algorithms machinelearning python

Last synced: 13 Apr 2025

https://github.com/rushter/heamy

A set of useful tools for competitive data science.

data-science machine-learning stacking

Last synced: 16 May 2025

https://github.com/firmai/pandapy

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

algorithmic-trading arrays data-science data-structures finance machine-learning numpy pandas structured-data

Last synced: 06 May 2025

https://github.com/youssefHosni/Efficient-Python-for-Data-Scientists-Book

Writing clean and optimized Python code

data-science numpy pandas python

Last synced: 16 Mar 2025

https://github.com/Lackoftactics/facebook_data_analyzer

Analyze facebook copy of your data with ruby language. Download zip file from facebook and get info about friends ranking by message, vocabulary, contacts, friends added statistics and more

conversation data-science data-visualization english-language facebook facebook-data facebook-data-analyzer ruby ruby-gem scraping script statistics

Last synced: 20 Nov 2024

https://github.com/youssefhosni/efficient-python-for-data-scientists

Writing clean and optimized Python code

data-science numpy pandas python

Last synced: 25 Jan 2025

https://github.com/WecoAI/aideml

AIDE: the state-of-the-art machine learning engineer agent, generating machine learning solution code from natural language descriptions.

ai data-science llm machine-learning

Last synced: 02 May 2025

https://github.com/bradleyboehmke/data-science-learning-resources

A collection of data science and machine learning resources that I've found helpful (I only post what I've read!)

data-science machine-learning

Last synced: 07 Apr 2025

https://hdi-project.github.io/ATM/

Auto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning).

automl data-science distributed-computing hyperparameter-optimization machine-learning

Last synced: 12 May 2025

https://github.com/justmarkham/pycon-2019-tutorial

Data Science Best Practices with pandas

data-science pandas python tutorial vizualisation

Last synced: 05 Apr 2025

https://github.com/HDI-Project/ATM

Auto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning).

automl data-science distributed-computing hyperparameter-optimization machine-learning

Last synced: 25 Nov 2024

https://github.com/giorgi/duckdb.net

Bindings and ADO.NET Provider for DuckDB

ado-net data-science duckdb duckdb-database hacktoberfest

Last synced: 14 May 2025

https://github.com/RunLLM/aqueduct

Aqueduct is no longer being maintained. Aqueduct allows you to run LLM and ML workloads on any cloud infrastructure.

ai data data-science kubernetes llm llms machine-learning ml ml-infrastructure ml-monitoring mlops orchestration python python3

Last synced: 18 Apr 2025

https://github.com/openhackathons-org/gpubootcamp

This repository consists for gpu bootcamp material for HPC and AI

ai4hpc cuda data-science deep-learning deepstream gpu hpc machine-learning mpi openacc openmp rapidsai

Last synced: 27 Mar 2025

https://github.com/HoloClean/holoclean

A Machine Learning System for Data Enrichment.

data-enrichment data-science inference-engine machine-learning pytorch

Last synced: 02 May 2025

https://github.com/juliaacademy/datascience

Data Science in Julia course for JuliaAcademy.com, taught by Huda Nassar

data-science julia juliaacademy learnjulia

Last synced: 12 Apr 2025

https://github.com/frictionlessdata/datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.

csv data-science json metadata schema validation

Last synced: 03 Apr 2025

https://github.com/JuliaAcademy/DataScience

Data Science in Julia course for JuliaAcademy.com, taught by Huda Nassar

data-science julia juliaacademy learnjulia

Last synced: 15 Mar 2025

https://github.com/ericlagergren/decimal

A high-performance, arbitrary-precision, floating-point decimal library.

arbitrary-precision big-decimal data-science decimal dogs-of-instagram financial general-decimal-arithmetic money multi-precision

Last synced: 20 Nov 2024

https://github.com/Giorgi/DuckDB.NET

Bindings and ADO.NET Provider for DuckDB

ado-net data-science duckdb duckdb-database hacktoberfest

Last synced: 24 Mar 2025

https://github.com/microsoft/Reactors

๐ŸŒฑ Join a community of developers at Microsoft Reactor and connect with people, skills, and technology to build your career or personal learning. We offer free livestreams, on-demand content, and hybrid/in-person events daily around the world. Access our projects and code here.

ai azure cloud data data-science devops dotnet events iot live-streaming low-code meetup mixed-reality ml no-code nodejs personal-de python web

Last synced: 05 May 2025

https://github.com/alteryx/compose

A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.

ai automl data-labeling data-science labeling labeling-tool machine-learning prediction-engineering prediction-problem training-data

Last synced: 14 May 2025

https://github.com/vi3k6i5/guidedlda

semi supervised guided topic model with custom guidedLDA

data-science guided-topic-modeling guidedlda machine-learning seededlda topic-modeling

Last synced: 12 Apr 2025

https://github.com/vi3k6i5/GuidedLDA

semi supervised guided topic model with custom guidedLDA

data-science guided-topic-modeling guidedlda machine-learning seededlda topic-modeling

Last synced: 03 May 2025

https://github.com/jmschrei/apricot

apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly. See the documentation page: https://apricot-select.readthedocs.io/en/latest/index.html

data-science machine-learning python submodular-optimization submodularity

Last synced: 04 Apr 2025

https://github.com/alteryx/open_source_demos

A collection of demos showcasing automated feature engineering and machine learning in diverse use cases

compose data-science evalml feature-engineering featuretools machine-learning python tutorial

Last synced: 09 Apr 2025

https://github.com/akanz1/klib

Easy to use Python library of customized functions for cleaning and analyzing data.

data-analysis data-cleaning data-preprocessing data-science data-visualization feature-selection klib python

Last synced: 08 May 2025

https://github.com/plotly/dash.jl

Dash for Julia - A Julia interface to the Dash ecosystem for creating analytic web applications in Julia. No JavaScript required.

bioinformatics charting dash dashboard data-science data-visualization finance gui-framework julia modeling no-javascript no-vba plotly plotly-dash productivity react technical-computing web-app

Last synced: 15 May 2025

https://github.com/SwanHubX/SwanLab

โšก๏ธSwanLab: your ML experiment notebook. ไฝ ็š„AIๅฎž้ชŒ็ฌ”่ฎฐๆœฌ๏ผŒๆ—ฅๅฟ—่ฎฐๅฝ•ไธŽๅฏ่ง†ๅŒ–AI่ฎญ็ปƒๅ…จๆต็จ‹ใ€‚

data-science deep-learning fastapi jax machine-learning mlops model-versioning python pytorch tensorboard tensorflow tracking transformers visualization

Last synced: 05 Mar 2025

https://github.com/ottogroup/palladium

Framework for setting up predictive analytics services

data-science machine-learning scikit-learn

Last synced: 12 Apr 2025

https://github.com/s-shemmee/sql-101

Get started with SQL database programming. This beginner's guide provides step-by-step tutorials, practical examples, exercises, and resources to master SQL. Let's unlock the power of data with SQL!

data-analysis data-science sql sql-challenges sql-commands sql-database sql-injection sql-server

Last synced: 05 Apr 2025

https://github.com/serengil/chefboost

A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4.5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting, Random Forest and Adaboost w/categorical features support for Python

adaboost c45-trees cart categorical-features data-mining data-science decision-trees gbdt gbm gbrt gradient-boosting gradient-boosting-machine gradient-boosting-machines id3 kaggle machine-learning python random-forest regression-tree

Last synced: 14 May 2025

https://github.com/mfarragher/obsidiantools

Obsidian tools - a Python package for analysing an Obsidian.md vault

data-science knowledge-management network-analysis note-taking obsidian-community obsidian-md python

Last synced: 16 May 2025

https://github.com/breck7/scroll

Scroll is a language for scientists of all ages. Scroll includes a command line app that builds static blogs, websites, CSVs, text files, and more.

blog cms csv data-science knowledge-base knowledge-graph markdown markup markup-language note-taking scroll static-site-generator tree-notation

Last synced: 15 Apr 2025

https://github.com/jbn/zigzag

Python library for identifying the peaks and valleys of a time series.

data-science statistics technical-analysis

Last synced: 16 May 2025

https://github.com/pykale/pykale

Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the ๐Ÿ”ฅPyTorch ecosystem. โญ Star to support our work!

computer-vision data-science deep-learning domain-adaptation graph-analysis knowledge-aware-learning machine-learning medical-image-analysis meta-learning multimodal multimodal-learning python pytorch transfer-learning

Last synced: 15 May 2025

https://github.com/ploomber/sklearn-evaluation

Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.

data-science deep-learning jupyter-notebook machine-learning pytorch scikit-learn sklearn tensorflow

Last synced: 13 Apr 2025

https://github.com/rudeboybert/fivethirtyeight

R package of data and code behind the stories and interactives at FiveThirtyEight

cran data-science datajournalism fivethirtyeight r rpackage statistics

Last synced: 16 May 2025

https://github.com/FilippoBovo/production-data-science

Production Data Science: a workflow for collaborative data science aimed at production

collaborative data-science production workflow

Last synced: 02 May 2025

https://github.com/aeturrell/skimpy

skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.

data-science eda exploratory-data-analysis pandas statistics summary-statistics

Last synced: 07 May 2025

https://github.com/filippobovo/production-data-science

Production Data Science: a workflow for collaborative data science aimed at production

collaborative data-science production workflow

Last synced: 05 Apr 2025

https://github.com/dcai-course/dcai-lab

Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 ๐Ÿ‘ฉ๐Ÿฝโ€๐Ÿ’ป

course data-centric-ai data-science deep-learning homework lab machine-learning

Last synced: 26 Mar 2025

https://github.com/pgalko/BambooAI

A lightweight library that leverages Language Models (LLMs) to enable natural language interactions, allowing you to source and converse with data.

ai ai-agents data-analysis data-science gemini groq llm mistral ollama openai-api pandas pinecone python vector-database

Last synced: 23 Mar 2025