An open API service indexing awesome lists of open source software.

awesome-data-analysis

πŸš€πŸ“Š 400+ curated resources for data analysis and data science: Python, SQL, ML, Visualization, Dashboards, Cheatsheets, Roadmaps, and Interview Prep. Perfect for beginners and pros!
https://github.com/pavelgrigoryevds/awesome-data-analysis

Last synced: about 4 hours ago
JSON representation

  • πŸ—ΊοΈ Roadmaps

  • 🐍 Python

    • Resources

    • Useful Python Tools for Data Analysis

      • Great Expectations - Data validation and testing.
      • Fitter - Figures out the distribution your data comes from.
      • Sklearn Pandas - Bridge between Pandas and Scikit-learn.
      • CuPy - A NumPy-compatible array library accelerated by NVIDIA CUDA for high-performance computing.
      • Numba - A JIT compiler that translates a subset of Python and NumPy code into fast machine code.
      • Pandas Stubs - Type stubs for pandas, improves IDE autocompletion.
      • Pydantic - Data validation using Python type annotations.
      • Category Encoders - Extensive collection of categorical variable encoders.
      • Imbalanced Learn - Handling imbalanced datasets.
      • PySAL - Spatial analysis functions.
      • ImageIO - A library that provides an easy interface to read and write a wide range of image data.
      • Texthero - Text preprocessing, representation and visualization.
      • Geopandas - Geographic data operations with pandas.
      • NetworkX - Network analysis and graph theory.
      • Pandas DQ - Data type correction and automatic DataFrame cleaning.
      • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.
      • DataCleaner - Python tool for automatically cleaning and preparing datasets.
      • TheFuzz - Fuzzy string matching (Levenshtein distance).
      • PandasAI - Conversational data analysis using LLMs and RAG.
      • DateUtil - Extensions for standard Python datetime features.
      • Fugue - Unified interface for Pandas, Spark, and Dask.
      • Pandas DataReader - Reads data from various online sources into pandas DataFrames.
      • Pandas-dq - Data type correction and automatic DataFrame cleaning.
      • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.
      • DataCleaner - Python tool for automatically cleaning and preparing datasets.
      • Polars - Multithreaded, vectorized query engine for DataFrames.
      • TheFuzz - Fuzzy string matching (Levenshtein distance).
      • PandasAI - Conversational data analysis using LLMs and RAG.
      • DateUtil - Extensions for standard Python datetime features.
      • Fugue - Unified interface for Pandas, Spark, and Dask.
      • Pandas-DataReader - Reads data from various online sources into pandas DataFrames.
      • sklearn-pandas - Bridge between Pandas and Scikit-learn.
      • fitter - Figures out the distribution your data comes from.
      • Arrow - Enhanced work with dates and times.
      • fitter - Figures out the distribution your data comes from.
      • Arrow - Enhanced work with dates and times.
      • Pendulum - Alternative to datetime with timezone support.
      • AutoViz - Automatic data visualization in 1 line of code.
      • Datashader - Quickly and accurately render even the largest data.
      • Vizro - Low-code toolkit for building data visualization apps.
      • Great Tables - Create awesome display tables using Python.
      • DataMapPlot - Create beautiful plots of data maps.
      • Sweetviz - Automatic EDA with dataset comparison.
      • Lux - Automatic DataFrame visualization in Jupyter.
      • Yellowbrick - Visual diagnostic tools for machine learning.
      • PyOD - Outlier and anomaly detection.
      • Pendulum - Alternative to datetime with timezone support.
      • AutoViz - Automatic data visualization in 1 line of code.
      • Vizro - Low-code toolkit for building high-quality data visualization apps.
      • Great Tables - Create awesome display tables using Python.
      • DataMapPlot - Create beautiful plots of data maps.
      • Datashader - Quickly and accurately render even the largest data.
      • Sweetviz - Automatic EDA with dataset comparison.
      • Lux - Automatic DataFrame visualization in Jupyter with a click.
      • Yellowbrick - A suite of visual diagnostic tools for machine learning, extending the Scikit-Learn API.
      • PyOD - Python library for outlier and anomaly detection.
      • YData Profiling - Data quality profiling & exploratory data analysis.
      • Missingno - Visualize missing data patterns.
      • Alibi Detect - Outlier, adversarial and drift detection.
      • YData Profiling - 1 line of code data quality profiling & exploratory data analysis.
      • Missingno - Visualize missing data patterns in matrix format.
      • Dora - Automate EDA: preprocessing, feature engineering, visualization.
      • FeatureTools - Automated feature engineering.
      • Feature Selector - Tool for dimensionality reduction of machine learning datasets.
      • Dora - Automate EDA: preprocessing, feature engineering, visualization.
      • Alibi-detect - Algorithms for outlier, adversarial and drift detection.
      • FeatureTools - Open-source automated feature engineering.
      • Feature Selector - Tool for dimensionality reduction of machine learning datasets.
      • TSFresh - A Python library for automatically extracting features from time series data.
      • Prince - Multivariate exploratory data analysis (PCA, CA, MCA).
      • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.
      • Pytest - Framework for writing small tests.
      • TSFresh - A Python library for automatically extracting features from time series data.
      • Feature Engine - Feature engineering with Scikit-Learn compatibility.
      • Cerberus - Data validation through schemas.
      • Pandera - Data validation through declarative schemas.
      • PandasVet - Code style validator for Pandas (similar to ESLint).
      • Prefect - Workflow orchestration for building resilient data pipelines.
      • Airflow - Platform for automating data workflows.
      • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.
      • Petl - ETL tool for data cleaning and transformation.
      • D-Tale - Interactive GUI for data analysis in a browser.
      • Feature Engine - A feature engineering library with Scikit-Learn compatibility.
      • Prince - A Python library for multivariate exploratory data analysis, including PCA, CA, MCA, and more.
      • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.
      • Pytest - Framework for writing small tests.
      • Cerberus - Data validation through schemas.
      • Pandera - Data validation through declarative schemas.
      • PandasVet - Code style validator for Pandas (similar to ESLint).
      • Prefect - Workflow orchestration for building resilient data pipelines.
      • Airflow - Platform for automating data workflows.
      • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.
      • Petl - ETL tool for data cleaning and transformation.
      • DuckDB - In-memory analytical database for fast SQL queries.
      • D-Tale - Interactive GUI for data analysis in a browser.
      • Pandasgui - GUI for viewing and filtering DataFrames.
      • QGrid - Interactive grid for DataFrames in Jupyter.
      • PyGWalker - Interactive UIs for visual analysis of DataFrames.
      • Pivottablejs - Interactive PivotTable.js tables in Jupyter.
      • Faker - Generates fake data for testing.
      • Mimesis - Generates realistic test data.
      • Rich - Rich text and beautiful formatting in the terminal.
      • Pandas-log - Logs pandas operations for data transformation tracking.
      • Icecream - Debugging without using print.
      • Pydeps - Python module dependency graphs.
      • PyForest - Automated Python imports for data science.
      • Pandarallel - Parallel operations for pandas DataFrames.
      • Dask - Parallel computing for arrays and DataFrames.
      • Modin - Speeds up Pandas by distributing computations.
      • Sphinx - The Sphinx documentation generator.
      • Pdoc - API documentation for Python projects.
      • Mkdocs - Project documentation with Markdown.
      • OpenPyXL - Read/write Excel files with support for advanced features.
      • Tablib - Exports data to XLSX, JSON, CSV via a single API.
      • Pandasgui - GUI for viewing and filtering DataFrames.
      • QGrid - Interactive grid for sorting, filtering, and editing DataFrames in Jupyter.
      • PyGWalker - Interactive UIs for visual analysis of pandas DataFrames.
      • Rich - Rich text and beautiful formatting in the terminal.
      • Pandas-log - Logs pandas operations for data transformation tracking.
      • Pivottablejs - Interactive PivotTable.js tables in Jupyter.
      • Faker - Generates fake data for testing.
      • Mimesis - Generates realistic test data.
      • Icecream - Debugging without using print.
      • Pydeps - Python module dependency graphs.
      • PyForest - Automated Python imports for data science.
      • Pandarallel - Parallel operations for pandas DataFrames.
      • Dask - Parallel computing for arrays and DataFrames.
      • Modin - Speeds up Pandas by distributing computations.
      • OpenPyXL - Read/write Excel files with support for advanced features.
      • Tablib - Exports data to XLSX, JSON, CSV via a single API.
      • PyPDF2 - Reads and writes PDF files.
      • Sphinx - The Sphinx documentation generator.
      • Pdoc - API documentation for Python projects.
      • Mkdocs - Project documentation with Markdown.
      • Python-docx - Reads and writes Word documents.
      • CleverCSV - Smart CSV reader for messy data.
      • Xlwings - Integration of Python with Excel.
      • Xmltodict - Converts XML to Python dictionaries.
      • Python-markdownify - Convert HTML to Markdown.
      • PyPDF2 - Reads and writes PDF files.
      • Python-docx - Reads and writes Word documents.
      • CleverCSV - Smart CSV reader for messy data.
      • Xlwings - Integration of Python with Excel.
      • Xmltodict - Converts XML to Python dictionaries.
      • Python-markdownify - Convert HTML to Markdown.
      • MarkItDown - Python tool for converting files and office documents to Markdown.
      • Pillow - Image processing library.
      • Ftfy - Fixes broken Unicode strings.
      • MarkItDown - Python tool for converting files and office documents to Markdown.
      • Pillow - Image processing library.
      • Ftfy - Fixes broken Unicode strings.
      • JmesPath - Queries JSON data (SQL-like for JSON).
      • Glom - Transforms nested data structures.
      • Pampy - Pattern matching for Python dictionaries.
      • Geopy - Geocoding addresses and calculating distances.
      • Diagrams - Diagrams as code for cloud system architecture prototyping.
      • Scattertext - Beautiful visualizations of language differences among document types.
      • Pygorithm - A Python module for learning all major algorithms.
      • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.
      • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.
      • Dataset - JSON-like interface for working with SQL databases.
      • JmesPath - Queries JSON data (SQL-like for JSON).
      • Glom - Transforms nested data structures.
      • Pampy - Pattern matching for Python dictionaries.
      • Geopy - Geocoding addresses and calculating distances.
      • Diagrams - Diagrams as code for cloud system architecture prototyping.
      • Scattertext - Beautiful visualizations of language differences among document types.
      • Pygorithm - A Python module for learning all major algorithms.
      • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.
      • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.
    • Data Manipulation with Pandas and Numpy

    • Data Manipulation with Pandas

  • πŸ—ƒοΈ SQL & Databases

  • πŸ“Š Data Visualization

    • Resources

    • Tools

      • Altair - A declarative statistical visualization library for Python.
      • Glumpy - A Python library for scientific visualization that is fast, scalable and beautiful, based on OpenGL.
      • Pandas-bokeh - Bokeh plotting backend for Pandas.
      • Deck.gl - A WebGL-powered framework for visual exploratory data analysis of large datasets.
      • Python for Geo - Contextily: add background basemaps to your plots in GeoPandas.
      • OSMnx - A package to easily download, model, analyze, and visualize street networks from OpenStreetMap.
      • Apache ECharts - A powerful, interactive charting and visualization library for browser-based applications.
      • VisPy - A high-performance interactive 2D/3D data visualization library leveraging the power of OpenGL.
      • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.
      • Seaborn - A statistical data visualization library based on Matplotlib.
      • Plotly - A library for creating interactive plots and dashboards.
      • Bokeh - A library for creating interactive visualizations for modern web browsers.
      • HoloViews - A tool for building complex visualizations easily.
      • Geopandas - An extension of Pandas for geospatial data.
      • Folium - A library for visualizing data on interactive maps.
      • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.
      • Seaborn - A statistical data visualization library based on Matplotlib.
      • Plotly - A library for creating interactive plots and dashboards.
      • Altair - A declarative statistical visualization library for Python.
      • Bokeh - A library for creating interactive visualizations for modern web browsers.
      • HoloViews - A tool for building complex visualizations easily.
      • Geopandas - An extension of Pandas for geospatial data.
      • Folium - A library for visualizing data on interactive maps.
      • Plotnine - A grammar of graphics for Python.
      • Plotnine - A grammar of graphics for Python.
      • Bqplot - A plotting library for IPython/Jupyter notebooks.
      • PyPalettes - A large (+2500) collection of color maps for Python.
      • Bqplot - A plotting library for IPython/Jupyter notebooks.
      • PyPalettes - A large (+2500) collection of color maps for Python.
  • πŸ“ˆ Dashboards & BI

    • Resources

    • Tools

      • OpenSearch Dashboards - A powerful data visualization and dashboarding tool for OpenSearch data, forked from Kibana.
      • GridStack.js - A library for building draggable, resizable responsive dashboard layouts.
      • Tremor - A React library to build dashboards fast with pre-built components for charts, KPIs, and more.
      • Appsmith - An open-source platform to build and deploy internal tools, admin panels, and CRUD apps quickly.
      • Grafanalib - A Python library for generating Grafana dashboards configuration as code.
      • H2O Wave - A Python framework for rapidly building and deploying realtime web apps and dashboards for AI and analytics.
      • Shiny for Python - Python version of the popular R Shiny framework.
      • VoilΓ  - Turn Jupyter notebooks into standalone web applications.
      • Reflex - Full-stack Python framework for building web apps.
      • Dash - Framework for creating interactive web applications.
      • Streamlit - Simplified framework for building data applications.
      • Panel - Framework for creating interactive web applications.
      • Gradio - Tool for creating and sharing machine learning applications.
    • Software

      • Preset - A platform for modern business intelligence, providing a hosted version of Apache Superset.
      • Kibana - The official visualization and dashboarding tool for the Elastic Stack (Elasticsearch, Logstash, Beats).
      • Rath - Next-generation automated data exploratory analysis and visualization platform.
      • Microsoft Power BI - Business analytics tool for visualizing data.
      • QlikView - Tool for data visualization and business intelligence.
      • Redash - Tool for visualizing and sharing data insights.
  • πŸ•ΈοΈ Web Scraping & Crawling

    • Tools

      • Ferret - A web scraping system that lets you declaratively describe what data to extract using a simple query language.
      • Grab - A Python framework for building web scraping apps, providing a high-level API for asynchronous requests.
      • Playwright - Python version of the Playwright browser automation library.
      • PyQuery - A jQuery-like library for parsing HTML documents in Python.
      • Helium - High-level Selenium wrapper for easier web automation.
      • BeautifulSoup - A library for parsing HTML and XML documents.
      • Selenium - A tool for automating web applications for testing purposes.
      • BeautifulSoup - A library for parsing HTML and XML documents.
      • Selenium - A tool for automating web applications for testing purposes.
      • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.
      • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.
      • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.
      • Feedparser - A library to parse feeds in Python.
      • Trafilatura - A Python & command-line tool to gather text and metadata on the web.
      • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.
      • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.
      • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.
      • Feedparser - A library to parse feeds in Python.
      • Trafilatura - A Python & command-line tool to gather text and metadata on the web.
      • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.
      • Dirsearch - A web path scanner.
      • MechanicalSoup - A Python library for automating interaction with websites.
      • ScrapeGraph AI - A Python scraper based on AI.
      • Snscrape - A social networking service scraper in Python.
      • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.
      • Dirsearch - A web path scanner.
      • MechanicalSoup - A Python library for automating interaction with websites.
      • ScrapeGraph AI - A Python scraper based on AI.
      • Snscrape - A social networking service scraper in Python.
    • Resources

  • πŸ”’ Mathematics

    • Tools

      • Awesome Math - A curated list of mathematics resources, books, and online courses.
      • 3Blue1Brown - Visual explanations of mathematical concepts through animated videos.
      • MML Bool - Comprehensive resource for mathematics in machine learning.
  • πŸ† Awesome Data Science Repositories

  • πŸ“ˆ Dashboards

  • πŸ“– Natural Language Processing (NLP)

    • Resources

    • Tools

      • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.
      • TextBlob - A simple library for processing textual data.
      • SpaCy - An open-source software library for advanced NLP in Python.
      • TextRank - A library for TextRank algorithm implementation.
      • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.
      • TextBlob - A simple library for processing textual data.
      • SpaCy - An open-source software library for advanced NLP in Python.
      • TextRank - A library for TextRank algorithm implementation.
      • Flair - A simple framework for state-of-the-art NLP.
      • BERT - A transformer-based model for NLP tasks.
      • Transformers - A library for state-of-the-art NLP models.
      • Flair - A simple framework for state-of-the-art NLP.
      • BERT - A transformer-based model for NLP tasks.
      • Transformers - A library for state-of-the-art NLP models.
  • πŸ”’ Mathematics, Statistics & Probability