An open API service indexing awesome lists of open source software.

awesome-data-analysis

πŸš€ 500+ curated resources for Data Analysis & Data Science: Python, SQL, Statistics, ML, AI, Visualization, Cheatsheets, Roadmaps, Interview Prep. For beginners and experts.
https://github.com/pavelgrigoryevds/awesome-data-analysis

Last synced: about 7 hours ago
JSON representation

  • πŸ“Š Data Visualization

    • Resources

    • Tools

      • Python for Geo - Contextily: add background basemaps to your plots in GeoPandas.
      • Altair - A declarative statistical visualization library for Python.
      • Plotnine - A grammar of graphics for Python.
      • Pygal - A Python SVG charting library.
      • Plotnine - A grammar of graphics for Python.
      • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.
      • Seaborn - A statistical data visualization library based on Matplotlib.
      • Plotly - A library for creating interactive plots and dashboards.
      • Altair - A declarative statistical visualization library for Python.
      • Bokeh - A library for creating interactive visualizations for modern web browsers.
      • HoloViews - A tool for building complex visualizations easily.
      • Geopandas - An extension of Pandas for geospatial data.
      • Folium - A library for visualizing data on interactive maps.
      • Bqplot - A plotting library for IPython/Jupyter notebooks.
      • PyPalettes - A large (+2500) collection of color maps for Python.
      • Deck.gl - A WebGL-powered framework for visual exploratory data analysis of large datasets.
      • OSMnx - A package to easily download, model, analyze, and visualize street networks from OpenStreetMap.
      • Apache ECharts - A powerful, interactive charting and visualization library for browser-based applications.
      • VisPy - A high-performance interactive 2D/3D data visualization library leveraging the power of OpenGL.
      • Glumpy - A Python library for scientific visualization that is fast, scalable and beautiful, based on OpenGL.
      • Pandas-bokeh - Bokeh plotting backend for Pandas.
  • πŸ“ˆ Dashboards & BI

    • Tools

      • Gradio - Tool for creating and sharing machine learning applications.
      • Gradio - Tool for creating and sharing machine learning applications.
      • Dash - Framework for creating interactive web applications.
      • Streamlit - Simplified framework for building data applications.
      • Panel - Framework for creating interactive web applications.
      • OpenSearch Dashboards - A powerful data visualization and dashboarding tool for OpenSearch data, forked from Kibana.
      • GridStack.js - A library for building draggable, resizable responsive dashboard layouts.
      • Tremor - A React library to build dashboards fast with pre-built components for charts, KPIs, and more.
      • Appsmith - An open-source platform to build and deploy internal tools, admin panels, and CRUD apps quickly.
      • Grafanalib - A Python library for generating Grafana dashboards configuration as code.
      • H2O Wave - A Python framework for rapidly building and deploying realtime web apps and dashboards for AI and analytics.
      • Shiny for Python - Python version of the popular R Shiny framework.
      • VoilΓ  - Turn Jupyter notebooks into standalone web applications.
      • Reflex - Full-stack Python framework for building web apps.
    • Software

      • Redash - Tool for visualizing and sharing data insights.
      • Redash - Tool for visualizing and sharing data insights.
      • Grafana - Dashboarding and monitoring tool.
      • ChartBlocks - Online chart creation platform.
      • Infogram - Tool for creating infographics and visual content.
      • Google Data Studio - Free tool for creating interactive dashboards and reports.
      • Microsoft Power BI - Business analytics tool for visualizing data.
      • QlikView - Tool for data visualization and business intelligence.
      • Preset - A platform for modern business intelligence, providing a hosted version of Apache Superset.
      • Metabase - The simplest way to get analytics and business intelligence for everyone in your company.
      • Rath - Next-generation automated data exploratory analysis and visualization platform.
      • Kibana - The official visualization and dashboarding tool for the Elastic Stack (Elasticsearch, Logstash, Beats).
    • Resources

  • πŸ€– Machine Learning & AI

    • Tools

      • PEFT - Library for efficiently adapting large pretrained models.
      • Ultralytics - YOLOv8 and other computer vision models.
      • Scikit-learn - Machine learning library for classical algorithms and model building.
      • XGBoost - Optimized distributed gradient boosting library for tree-based models.
      • LightGBM - Fast, distributed, high-performance gradient boosting framework.
      • CatBoost - High-performance gradient boosting on decision trees with categorical features support.
      • H2O-3 - Open-source distributed machine learning platform.
      • cuML - GPU-accelerated machine learning algorithms from RAPIDS.
      • dlib - Modern C++ toolkit containing machine learning algorithms and tools.
      • SHAP - Game theoretic approach to explain the output of any machine learning model.
      • InterpretML - Fit interpretable models and explain blackbox machine learning.
      • Optuna - Hyperparameter optimization framework.
      • TensorFlow - End-to-end open source platform for machine learning and deep learning.
      • PyTorch - Deep learning framework with strong support for research and production.
      • PyTorch Lightning - PyTorch wrapper for high-performance AI research.
      • PyTorch Ignite - High-level library to help with training and evaluating neural networks.
      • Keras - High-level neural networks API, running on top of TensorFlow.
      • Fast.ai - Deep learning library simplifying training fast and accurate neural nets.
      • HuggingFace Transformers - Model-definition framework for state-of-the-art machine learning models.
      • HuggingFace Diffusers - Library for state-of-the-art pretrained diffusion models.
      • YOLOv5 - Real-time object detection system.
      • ONNX - Open standard for machine learning interoperability.
      • PyTorch Geometric - Geometric deep learning extension library for PyTorch.
      • Pyro - Deep universal probabilistic programming with Python and PyTorch.
      • Skorch - Scikit-learn compatible neural network library.
      • Sonnet - DeepMind's library for building complex neural networks.
      • JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
      • TensorFlow Models - Official TensorFlow repository with models and examples.
    • Resources

  • πŸš€ MLOps

    • Resources

    • Tools

      • ColossalAI - High-performance distributed training framework.
      • DVC - Version control system for machine learning projects.
      • Evidently - Tool for analyzing and monitoring data and model drift.
      • Deepchecks - Validation for ML models and data.
      • Sematic - Tool to build, debug, and execute ML pipelines with native Python.
      • netdata - Real-time performance monitoring.
      • meilisearch - Fast, open-source search engine.
      • vLLM - High-throughput and memory-efficient inference library for LLMs.
      • haystack - LLM framework for building search and question answering systems.
      • Kubeflow - Machine learning toolkit for Kubernetes.
      • Seldon Core - Open source platform for deploying and monitoring machine learning models in production.
      • Feast - A feature store for machine learning that manages and serves ML features to models.
      • BentoML - Framework for building, shipping, and scaling ML applications.
      • MLflow - Open-source platform for the complete machine learning lifecycle.
      • Wandb - Tool for experiment tracking, dataset versioning, and model management.
      • Comet ML - ML platform for tracking, comparing and optimizing experiments.
      • Netflix Metaflow - A human-friendly Python library for helping scientists and engineers build and manage real-life data science projects.
      • mindsdb - Platform for integrating AI into databases and applications.
      • KServe - Standardized serverless inference platform for deploying and serving machine learning models on Kubernetes.
      • SQLFlow - Brings machine learning capabilities to SQL, enabling model training and prediction using SQL syntax.
      • Jina AI Serve - Framework for building and deploying AI services that communicate via gRPC, HTTP and WebSockets.
      • LiteLLM - Unified interface to call all LLM APIs (OpenAI, Anthropic, Cohere, etc.) with consistent output formatting.
  • ☁️ Cloud Platforms & Infrastructure

    • Tools

      • Higress - Cloud-native API gateway based on Istio.
      • Docker - Open platform for developing, shipping, and running applications in containers.
      • Docker Compose - A tool for defining and running multi-container Docker applications.
      • Kubernetes - Production-grade container orchestration system.
      • Kompose - Conversion tool from Docker Compose to Kubernetes.
      • Terraform - Infrastructure as Code tool.
      • OpenTofu - Open source fork of Terraform.
      • Pulumi - Modern IaC platform using familiar programming languages.
      • CDK8s - Define Kubernetes apps using familiar languages.
      • Jenkins - Open source automation server.
      • Argo CD - Declarative GitOps continuous delivery.
      • Argo Workflows - Container-native workflow engine.
      • Tekton - Kubernetes-native CI/CD framework.
      • Spinnaker - Multi-cloud continuous delivery.
      • Dagger - Portable devkit for CI/CD pipelines.
      • Traefik - Modern HTTP reverse proxy and load balancer.
      • Kong - Cloud-native API Gateway.
      • Apache APISIX - Dynamic API gateway.
      • Envoy Gateway - Manages Envoy Proxy as gateway.
      • Meshery - Service mesh management.
      • Helm - Package manager for Kubernetes.
      • Kustomize - Configuration customization for Kubernetes.
      • Kubernetes Dashboard - Web-based UI for Kubernetes.
      • Skaffold - Continuous development for Kubernetes.
      • Tilt - Local development for Kubernetes.
      • Flagger - Progressive delivery operator.
      • KubeVela - Application delivery platform.
      • KubeSphere - Kubernetes multi-cloud management.
      • Crossplane - Cloud native control plane.
      • Artifact Hub - Kubernetes packages and Helm charts.
      • Devtron - Kubernetes dashboard.
      • Harness - End-to-end developer platform.
    • Resources

  • ⚑ Productivity

    • Useful VS Code Extensions

    • Resources

      • Trello - A visual project management tool.
      • Positron - A next-generation data science IDE.
      • Nanobrowser - An open-source AI web automation tool with multi-agent system that runs directly in your browser.
      • Best of Jupyter - Ranked list of notable Jupyter Notebook, Hub, and Lab projects.
      • Deepnote - AI native data science notebook platform compatible with Jupyter, featuring real-time collaboration, environment management, and integrations.
      • AFFiNE - All-in-one workspace for notes, docs, and data visualization.
      • Marimo - Reactive Python notebook for reproducible and interactive data science.
      • ChatGPT Data Science Prompts - A collection of useful prompts for data scientists using ChatGPT.
      • Cookiecutter Data Science - A standardized project structure for data science projects.
      • Learn Regex - Comprehensive guide to learning regular expressions with examples and exercises.
      • Awesome Regex - Curated collection of regex tools, libraries, and learning resources.
      • The Markdown Guide - Comprehensive guide to learning Markdown.
      • Readme-AI - A tool to automatically generate README.md files for your projects.
      • Markdown Here - Extension for writing emails in Markdown and rendering them before sending.
      • MarkText - Simple and elegant markdown editor for documentation.
      • QuarkDown - Lightweight markdown processor for fast document rendering.
      • screenshot-to-code - AI tool that converts screenshots into code for various frontend stacks.
      • Codebeautify - All-in-one online code formatter and beautifier for Python, SQL, JSON, and more.
      • Notion - An all-in-one workspace for note-taking and task management.
      • Habitica - A habit-building and productivity app that treats your life like a role-playing game.
      • Bujo - Tools to help transform the way you work and live.
      • Parabola - An AI-powered workflow builder for organizing data.
      • Asana - A project management platform for tracking work and projects.
      • Puter - An open-source, browser-based computing environment and cloud OS.
    • Useful Linux Tools

      • tldr-pages - Simplified and community-driven man pages with practical examples.
      • Bat - Cat clone with syntax highlighting.
      • Exa - Modern replacement for ls.
      • Ripgrep - Faster grep alternative.
      • Zoxide - Smarter cd command.
      • Peek - Simple animated GIF screen recorder with an easy to use interface.
      • CopyQ - Clipboard manager with advanced features.
      • Translate Shell - Command-line translator using Google Translate, Bing Translator, Yandex.Translate, etc.
      • Espanso - Cross-platform Text Expander written in Rust.
      • Flameshot - Powerful yet simple to use screenshot software.
      • DrawIO Desktop - An open-source diagramming software for making flowcharts, process diagrams, and more.
      • Inkscape - A powerful, free, and open-source vector graphics editor for creating and editing visualizations.
      • Rclone - A command-line program to manage files on cloud storage.
      • Rsync - A fast and versatile file copying tool that can synchronize files and directories between two locations over a network or locally.
      • Timeshift - System restore tool for Linux that creates filesystem snapshots using rsync+hardlinks or BTRFS snapshots.
      • Backintime - A comfortable and well-configurable graphical frontend for incremental backups.
      • Fzf - A command-line fuzzy finder.
      • Osquery - SQL powered operating system instrumentation, monitoring, and analytics.
      • GNU Parallel - A tool to run jobs in parallel.
      • HTop - An interactive process viewer.
      • Ncdu - A disk usage analyzer with an ncurses interface.
      • Thefuck - A command line tool to correct your previous console command.
      • Miller - A tool for querying, processing, and formatting data in various file formats (CSV, JSON, etc.), like awk/sed/cut for data.
      • jq - Command-line JSON processor for parsing and manipulating JSON data.
      • yq - Portable command-line YAML processor (like jq for YAML and XML).
      • q - Run SQL directly on CSV or TSV files from the command line.
      • VisiData - Interactive multitool for tabular data exploration in the terminal.
      • csvkit - Suite of command-line tools for working with CSV data.
      • httpie - Modern command-line HTTP client for API testing and debugging.
      • glances - Cross-platform system monitoring tool for resource usage analysis.
      • hyperfine - Command-line benchmarking tool for performance testing.
      • termgraph - Draw basic graphs in the terminal for quick data visualization.
      • fd - Simple, fast and user-friendly alternative to 'find'.
      • dust - More intuitive version of du written in rust.
      • bottom - Cross-platform graphical process/system monitor.
  • πŸ“‹ Cheatsheets

  • πŸ“¦ Additional Python Libraries

    • Code Quality & Development

      • Mypy - Optional static typing for Python.
      • Pydeps - Python module dependency graphs.
      • PyForest - Automated Python imports for data science.
      • Black - Uncompromising Python code formatter.
      • Pre-commit - Framework for managing pre-commit hooks.
      • Pylint - Python code static analysis.
      • Rich - Rich text and beautiful formatting in the terminal.
      • Icecream - Debugging without using print.
      • Pandas-log - Logs pandas operations for data transformation tracking.
      • PandasVet - Code style validator for Pandas.
    • Miscellaneous

      • Pytest - Framework for writing small tests.
      • Pampy - Pattern matching for Python dictionaries.
      • UV - An extremely fast Python package installer and resolver.
      • Funcy - Fancy functional tools for Python.
      • Pillow - Image processing library.
      • Ftfy - Fixes broken Unicode strings.
      • Glom - Transforms nested data structures.
      • GitPython - A Python library used to interact with Git repositories.
      • TQDM - Progress bars for loops and operations.
      • Loguru - Python logging made simple.
      • Click - Beautiful command line interfaces.
      • Poetry - Python dependency management and packaging.
      • Hydra - Elegant configuration management.
      • JmesPath - Queries JSON data (SQL-like for JSON).
      • Diagrams - Diagrams as code for cloud architecture.
      • Pygorithm - A Python module for learning all major algorithms.
    • Documentation & File Processing

      • PyPDF2 - Reads and writes PDF files.
      • Python-docx - Reads and writes Word documents.
      • Python-markdownify - Convert HTML to Markdown.
      • Sphinx - Documentation generator.
      • Pdoc - API documentation for Python projects.
      • Mkdocs - Project documentation with Markdown.
      • OpenPyXL - Read/write Excel files.
      • Tablib - Exports data to XLSX, JSON, CSV.
      • CleverCSV - Smart CSV reader for messy data.
      • Xlwings - Integration of Python with Excel.
      • WeasyPrint - Convert HTML to PDF.
      • Xmltodict - Converts XML to Python dictionaries.
      • MarkItDown - Python tool for converting files and office documents to Markdown.
      • Jupyter-book - Build publication-quality books from Jupyter notebooks.
      • PyMuPDF - Advanced PDF manipulation library.
      • Camelot - PDF table extraction library.
    • Web & APIs

      • FastAPI - Modern web framework for building APIs.
      • Flask - Lightweight Python web framework for building applications and APIs.
      • Typer - Library for building CLI applications.
      • Requests-cache - Persistent caching for requests library.
      • HTTPX - Next-generation HTTP client for Python.
  • πŸ•ΈοΈ Web Scraping & Crawling

    • Resources

    • Tools

      • Selenium - A tool for automating web applications for testing purposes.
      • Dirsearch - A web path scanner.
      • Selenium - A tool for automating web applications for testing purposes.
      • Scrapy - An open-source and collaborative web crawling framework for Python.
      • Requests - A simple, yet elegant, HTTP library for Python.
      • BeautifulSoup - A library for parsing HTML and XML documents.
      • Browser Use - A library for browser automation and web scraping.
      • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.
      • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.
      • Feedparser - A library to parse feeds in Python.
      • Trafilatura - A Python & command-line tool to gather text and metadata on the web.
      • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.
      • MechanicalSoup - A Python library for automating interaction with websites.
      • ScrapeGraph AI - A Python scraper based on AI.
      • Snscrape - A social networking service scraper in Python.
      • Ferret - A web scraping system that lets you declaratively describe what data to extract using a simple query language.
      • Grab - A Python framework for building web scraping apps, providing a high-level API for asynchronous requests.
      • Playwright - Python version of the Playwright browser automation library.
      • PyQuery - A jQuery-like library for parsing HTML documents in Python.
      • Helium - High-level Selenium wrapper for easier web automation.
      • Scrapling - A framework for building web scrapers and crawlers.
      • Crawl4AI - Advanced web crawling framework designed for AI and data extraction tasks.
  • πŸ—ΊοΈ Roadmaps

  • πŸ† Awesome Data Science Repositories

  • 🐍 Python

    • Resources

    • Useful Python Tools for Data Analysis

      • fitter - Figures out the distribution your data comes from.
      • Arrow - Enhanced work with dates and times.
      • Cerberus - Data validation through schemas.
      • Pandera - Data validation through declarative schemas.
      • Petl - ETL tool for data cleaning and transformation.
      • D-Tale - Interactive GUI for data analysis in a browser.
      • Pandarallel - Parallel operations for pandas DataFrames.
      • Dask - Parallel computing for arrays and DataFrames.
      • Modin - Speeds up Pandas by distributing computations.
      • Pillow - Image processing library.
      • Geopy - Geocoding addresses and calculating distances.
      • Scattertext - Beautiful visualizations of language differences among document types.
      • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.
      • Pandas DQ - Data type correction and automatic DataFrame cleaning.
      • PyOD - Outlier and anomaly detection.
      • Pandas Flavor - Add custom methods to Pandas.
      • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.
      • Polars - Multithreaded, vectorized query engine for DataFrames.
      • Fugue - Unified interface for Pandas, Spark, and Dask.
      • TheFuzz - Fuzzy string matching (Levenshtein distance).
      • DateUtil - Extensions for standard Python datetime features.
      • Pendulum - Alternative to datetime with timezone support.
      • DataCleaner - Python tool for automatically cleaning and preparing datasets.
      • Pandas DataReader - Reads data from various online sources into pandas DataFrames.
      • Sklearn Pandas - Bridge between Pandas and Scikit-learn.
      • CuPy - A NumPy-compatible array library accelerated by NVIDIA CUDA for high-performance computing.
      • Numba - A JIT compiler that translates a subset of Python and NumPy code into fast machine code.
      • Pandas Stubs - Type stubs for pandas, improves IDE autocompletion.
      • AutoViz - Automatic data visualization in 1 line of code.
      • Sweetviz - Automatic EDA with dataset comparison.
      • Lux - Automatic DataFrame visualization in Jupyter.
      • YData Profiling - Data quality profiling & exploratory data analysis.
      • Missingno - Visualize missing data patterns.
      • Vizro - Low-code toolkit for building data visualization apps.
      • Yellowbrick - Visual diagnostic tools for machine learning.
      • Great Tables - Create awesome display tables using Python.
      • DataMapPlot - Create beautiful plots of data maps.
      • Datashader - Quickly and accurately render even the largest data.
      • PandasAI - Conversational data analysis using LLMs and RAG.
      • Mito - Jupyter extensions for faster code writing.
      • Pandasgui - GUI for viewing and filtering DataFrames.
      • PyGWalker - Interactive UIs for visual analysis of DataFrames.
      • QGrid - Interactive grid for DataFrames in Jupyter.
      • Pivottablejs - Interactive PivotTable.js tables in Jupyter.
      • Alibi Detect - Outlier, adversarial and drift detection.
      • Pydantic - Data validation using Python type annotations.
      • Dora - Automate EDA: preprocessing, feature engineering, visualization.
      • Great Expectations - Data validation and testing.
      • FeatureTools - Automated feature engineering.
      • Feature Engine - Feature engineering with Scikit-Learn compatibility.
      • Prince - Multivariate exploratory data analysis (PCA, CA, MCA).
      • Fitter - Figures out the distribution your data comes from.
      • Feature Selector - Tool for dimensionality reduction of machine learning datasets.
      • Category Encoders - Extensive collection of categorical variable encoders.
      • Imbalanced Learn - Handling imbalanced datasets.
      • cuDF - A GPU DataFrame library for loading, joining, and aggregating data.
      • Faker - Generates fake data for testing.
      • Mimesis - Generates realistic test data.
      • PySAL - Spatial analysis functions.
      • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.
      • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.
      • ImageIO - A library that provides an easy interface to read and write a wide range of image data.
      • Texthero - Text preprocessing, representation and visualization.
      • Geopandas - Geographic data operations with pandas.
      • NetworkX - Network analysis and graph theory.
    • Data Manipulation with Pandas and Numpy

  • πŸ—ƒοΈ SQL & Databases

    • Resources

    • Tools

      • SQLAlchemy - SQL toolkit and ORM for Python.
      • Psycopg2 - PostgreSQL database adapter.
      • MySQL Connector/Python - MySQL driver for Python.
      • PonyORM - ORM for Python with dynamic query generation.
      • PyODBC - Python library for ODBC database access.
      • PyMongo - Official MongoDB driver for Python.
      • SQLiteviz - A tool for exploring SQLite databases and visualizing the results of your queries.
      • SQLite - A C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine.
      • Vanna.AI - An AI-powered tool for generating SQL queries from natural language questions.
      • Records - SQL queries to databases via Python syntax.
      • DB Browser for SQLite - A high quality, visual, open source tool to create, design, and edit database files compatible with SQLite.
      • DBeaver - A free universal database tool and SQL client for developers, SQL programmers, and administrators.
      • Beekeeper Studio - A modern, easy-to-use SQL client and database manager with a clean, cross-platform interface.
      • SQLFluff - A modular SQL linter and auto-formatter designed to enforce consistent style and catch errors in SQL code.
      • PyMySQL - A pure-Python MySQL client library for interacting with MySQL databases from Python applications.
      • SQLChat - A chat-based SQL client that allows you to query databases using natural language conversations.
      • Dataset - JSON-like interface for working with SQL databases.
      • SQLGlot - A no-dependency SQL parser, transpiler, and optimizer for Python.
      • TDengine - An open-source big data platform designed for time-series data, IoT, and industrial monitoring.
      • TimescaleDB - An open-source time-series SQL database optimized for fast ingest and complex queries.
      • DuckDB - In-memory analytical database for fast SQL queries.
  • πŸ“ˆ Dashboards

  • πŸ“– Natural Language Processing (NLP)

    • Resources

    • Tools

      • TextBlob - A simple library for processing textual data.
      • TextRank - A library for TextRank algorithm implementation.
      • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.
      • SpaCy - An open-source software library for advanced NLP in Python.
      • BERT - A transformer-based model for NLP tasks.
      • Flair - A simple framework for state-of-the-art NLP.
      • OpenHands - A library and framework for building applications with large language models.
      • Stanford CoreNLP - A Java suite of core NLP tools providing fundamental linguistic analysis capabilities.
      • John Snow Labs Spark-NLP - A state-of-the-art Natural Language Processing library built on Apache Spark.
      • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.
      • Gensim - Topic modeling and natural language processing library for Python.
      • Stanza - Python NLP library for many human languages, from the Stanford NLP Group.
      • SentenceTransformers - Framework for state-of-the-art sentence and text embeddings.
      • LangExtract - Google's library for structured information extraction from text using language models.
      • Rasa - Open-source framework for building contextual AI assistants and chatbots.
  • πŸ”’ Mathematics

  • πŸ”’ Mathematics, Statistics & Probability

  • 🎲 Statistics & Probability

    • Resources

    • Tools

      • SciPy - Fundamental library for scientific computing and statistics.
      • Statsmodels - Statistical modeling, testing, and data exploration.
      • PyMC - A probabilistic programming library for Python that allows for flexible Bayesian modeling.
      • Pingouin - Statistical package with improved usability over SciPy.
      • scikit-posthocs - Post-hoc tests for statistical analysis of data.
      • Lifelines - Survival analysis and event history analysis in Python.
      • scikit-survival - Survival analysis built on scikit-learn for time-to-event prediction.
      • Bootstrap - Bootstrap confidence interval estimation methods.
      • PyStan - Python interface to Stan for Bayesian statistical modeling.
      • ArviZ - Exploratory analysis of Bayesian models with visual diagnostics.
      • PyGAM - A Python library for generalized additive models with built-in smoothing and regularization.
      • NumPyro - A probabilistic programming library built on JAX for high-performance Bayesian modeling.
      • Causal Impact - A Python implementation of the R package for causal inference using Bayesian structural time-series models.
      • DoWhy - A Python library for causal inference that supports explicit modeling and testing of causal assumptions.
      • Patsy - A Python library for describing statistical models and building design matrices.
      • Pomegranate - Fast and flexible probabilistic modeling library for Python with GPU support.
      • Pgmpy - Python library for probabilistic and causal inference using graphical models.
  • πŸ§ͺ A/B Testing

  • ⏳ Time Series Analysis

    • Resources

    • Tools

      • PlotJuggler - A tool to visualize and analyze time series data logs in real-time.
      • TSFresh - Automatically extracting features from time series data.
      • pmdarima - Python library for ARIMA modeling and time series analysis.
      • Kats - Toolkit for analyzing time series data from Facebook Research.
      • Facebook Prophet - A procedure for forecasting time series data based on an additive model.
      • Uber Orbit - A Python package for Bayesian time series forecasting and inference.
      • sktime - A unified Python framework for machine learning with time series, compatible with scikit-learn.
      • GluonTS - A Python toolkit for probabilistic time series modeling, built on MXNet.
      • Time-Series-Library - A library for deep learning-based time series analysis and forecasting.
      • TimesFM - A pretrained time series foundation model from Google Research for zero-shot forecasting.
      • PyTorch Forecasting - A PyTorch-based library for time series forecasting with neural networks.
      • Time-series-prediction - A collection of time series prediction methods and implementations.
  • βš™οΈ Data Engineering

    • Resources

    • Tools

      • Apache Hive - A data warehouse software for reading, writing, and managing large datasets in distributed storage using SQL.
      • Apache Hadoop - A framework that allows for the distributed processing of large data sets across clusters of computers.
      • dbt-core - A framework for transforming data in your warehouse using SQL and Jinja.
      • Apache Spark - A unified engine for large-scale data processing and analytics.
      • Apache Kafka - A distributed event streaming platform for building real-time data pipelines.
      • Dagster - A data orchestrator for machine learning, analytics, and ETL.
      • Apache Airflow - A platform to programmatically author, schedule, and monitor workflows.
      • Luigi - A Python module for building complex and batch-oriented data pipelines.
      • Apache Iceberg - A high-performance table format for huge analytic datasets.
      • Apache Cassandra - A highly scalable distributed NoSQL database designed for handling large amounts of data across many commodity servers.
      • Apache Flink - A framework for stateful computations over unbounded and bounded data streams (real-time stream processing).
      • Apache Beam - A unified model for defining both batch and streaming data-parallel processing pipelines.
      • Apache Pulsar - A cloud-native, distributed messaging and streaming platform.
      • Delta Lake - A storage layer that brings ACID transactions to Apache Spark and big data workloads.
      • Apache Hudi - An open data lakehouse platform, built on a high-performance open table format.
      • Trino - A distributed SQL query engine designed for fast analytic queries against large datasets.
      • DataHub - A metadata platform for the modern data stack.
      • OpenLineage - An open framework for collection and analysis of data lineage.
      • Kedro - A framework for creating reproducible, maintainable and modular data science code.
      • Apache Calcite - A dynamic data management framework that allows for SQL parsing, optimization, and federation.
      • Prefect - Workflow orchestration for building resilient data pipelines.
      • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.
      • Kestra - An open-source, event-driven orchestrator that simplifies data workflow management.
  • 🧠 AI Applications & Platforms

    • Tools

      • n8n - Workflow automation platform for connecting APIs and services.
      • crewAI - Framework for orchestrating role-playing AI agents.
      • autogen - Framework for building multi-agent conversational systems.
      • AutoGPT - Autonomous AI agent that can complete complex tasks.
      • LangGraph - Framework for building stateful, multi-actor applications with LLMs, with cycles and control flow.
      • LangChain - Framework for developing applications powered by language models.
      • LlamaIndex - Data framework for LLM-based applications with RAG capabilities.
      • openai-python - Official Python library for OpenAI API.
      • openai-agents-python - Official OpenAI framework for building AI agents.
      • ragflow - Open-source RAG (Retrieval-Augmented Generation) workflow platform.
      • firecrawl - Web crawling and data extraction service for AI applications.
      • Fabric - Framework for augmenting humans using AI.
      • gpt-engineer - AI-powered code generation tool.
      • gpt-pilot - AI pair programmer that writes entire applications.
      • tabby - Self-hosted AI coding assistant.
      • Ollama - Tool for running large language models locally.
      • OpenLLM - Open platform for operating large language models in production.
      • LocalAI - Self-hosted, local-first AI model deployment platform.
      • dify - Visual LLM application development platform.
      • LLaMA-Factory - Easy-to-use LLM fine-tuning framework.
      • open-webui - Web interface for interacting with various LLMs.
      • ComfyUI - Visual node-based interface for Stable Diffusion.
      • lobe-chat - Modern AI conversation interface.
      • LibreChat - Open-source ChatGPT alternative.
      • quivr - Personal second brain and AI assistant.
      • upscayl - AI-powered image upscaling tool.
      • facefusion - AI face swapping and enhancement tool.
      • DocsGPT - Documentation-based question answering system.
      • Whisper - Robust speech recognition model for transcription and translation.
    • Resources

      • Awesome LLM Apps - Collection of awesome LLM apps with AI Agents and RAG using OpenAI, Anthropic, Gemini and opensource models.
      • Awesome Generative AI - A curated list of modern Generative Artificial Intelligence projects and services.
      • Generative AI for Beginners - Course on generative AI for beginners from Microsoft.
      • Awesome AI Agents - A curated list of AI autonomous agents, environments, and frameworks.
      • AI Collection - The Generative AI Landscape - A Collection of Awesome Generative AI Applications.
      • Awesome AI Apps - A collection of projects showcasing RAG, agents, workflows, and other AI use cases.
      • System Prompts and Models - System Prompts, Internal Tools & AI Models from various AI applications and coding tools.
      • Awesome LangChain - Awesome list of tools and projects with the awesome LangChain framework.
      • Awesome AI Tools - A curated list of Artificial Intelligence Top Tools.
      • Awesome LLM Security - A curation of awesome tools, documents and projects about LLM Security.
      • Claude Cookbooks - Official Anthropic examples and recipes for working with Claude AI.
  • πŸ“š Skill Development & Career

  • πŸ“ More Awesome Lists

  • πŸ“œ License

  • 🌐 Additional Resources and Tools

    • Miscellaneous

      • UC Berkeley - Data 8 - Course materials for the Data Science Foundations course.
      • PaddleOCR - Production-ready OCR toolkit with multilingual and document AI support.
      • A collective list of free APIs - A comprehensive list of free APIs for various purposes.
      • arXiv.org - A free distribution service and open-access archive for scholarly articles.
      • Elicit - An AI research assistant that helps automate parts of literature review.
      • 500+ AI/ML/DL/NLP Projects - A massive collection of AI and machine learning projects with code for learning and portfolios.
      • Full Stack Fastapi Template - Full-stack template with FastAPI, React, and PostgreSQL.
      • Kittl - Platform for creating and editing charts and data visualizations.
      • Zasper - High Performace IDE for Jupyter Notebooks.
      • Sketch - Toolkit designed for designers, focusing on their workflow.
      • Growth.Design - A collection of product case studies and behavioral psychology insights for data-driven decision-making.
  • 🀝 Contributing