An open API service indexing awesome lists of open source software.

awesome-data-analysis

πŸš€πŸ“Š 400+ curated resources for data analysis and data science: Python, SQL, ML, Visualization, Dashboards, Cheatsheets, Roadmaps, and Interview Prep. Perfect for beginners and pros!
https://github.com/pavelgrigoryevds/awesome-data-analysis

Last synced: 5 days ago
JSON representation

  • πŸ•ΈοΈ Web Scraping & Crawling

    • Resources

    • Tools

      • BeautifulSoup - A library for parsing HTML and XML documents.
      • Selenium - A tool for automating web applications for testing purposes.
      • BeautifulSoup - A library for parsing HTML and XML documents.
      • Selenium - A tool for automating web applications for testing purposes.
      • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.
      • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.
      • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.
      • Feedparser - A library to parse feeds in Python.
      • Trafilatura - A Python & command-line tool to gather text and metadata on the web.
      • Gerapy - Distributed Crawler Management Framework based on Scrapy, Scrapyd, Django, and Vue.js.
      • TextAttack - A Python framework for adversarial attacks, data augmentation, and model training in NLP.
      • AutoScraper - A smart, automatic, fast, and lightweight web scraper for Python.
      • Feedparser - A library to parse feeds in Python.
      • Trafilatura - A Python & command-line tool to gather text and metadata on the web.
      • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.
      • Dirsearch - A web path scanner.
      • MechanicalSoup - A Python library for automating interaction with websites.
      • ScrapeGraph AI - A Python scraper based on AI.
      • Snscrape - A social networking service scraper in Python.
      • You-Get - A tiny command-line utility to download media contents (videos, audios, images) from the web.
      • Dirsearch - A web path scanner.
      • MechanicalSoup - A Python library for automating interaction with websites.
      • ScrapeGraph AI - A Python scraper based on AI.
      • Snscrape - A social networking service scraper in Python.
  • πŸ“ˆ Dashboards

  • πŸ—ΊοΈ Roadmaps

  • 🐍 Python

    • Resources

    • Data Manipulation with Pandas

    • Useful Python Tools for Data Analysis

      • Pandas-dq - Data type correction and automatic DataFrame cleaning.
      • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.
      • DataCleaner - Python tool for automatically cleaning and preparing datasets.
      • TheFuzz - Fuzzy string matching (Levenshtein distance).
      • PandasAI - Conversational data analysis using LLMs and RAG.
      • DateUtil - Extensions for standard Python datetime features.
      • Fugue - Unified interface for Pandas, Spark, and Dask.
      • Pandas-DataReader - Reads data from various online sources into pandas DataFrames.
      • sklearn-pandas - Bridge between Pandas and Scikit-learn.
      • Pandas-dq - Data type correction and automatic DataFrame cleaning.
      • Vaex - High-performance Python library for lazy Out-of-Core DataFrames.
      • DataCleaner - Python tool for automatically cleaning and preparing datasets.
      • Polars - Multithreaded, vectorized query engine for DataFrames (Rust-powered).
      • TheFuzz - Fuzzy string matching (Levenshtein distance).
      • PandasAI - Conversational data analysis using LLMs and RAG.
      • DateUtil - Extensions for standard Python datetime features.
      • Fugue - Unified interface for Pandas, Spark, and Dask.
      • Pandas-DataReader - Reads data from various online sources into pandas DataFrames.
      • sklearn-pandas - Bridge between Pandas and Scikit-learn.
      • fitter - Figures out the distribution your data comes from.
      • Arrow - Enhanced work with dates and times.
      • fitter - Figures out the distribution your data comes from.
      • Arrow - Enhanced work with dates and times.
      • Pendulum - Alternative to datetime with timezone support.
      • AutoViz - Automatic data visualization in 1 line of code.
      • Datashader - Quickly and accurately render even the largest data.
      • Vizro - Low-code toolkit for building high-quality data visualization apps.
      • Great Tables - Create awesome display tables using Python.
      • DataMapPlot - Create beautiful plots of data maps.
      • Sweetviz - Automatic EDA with dataset comparison.
      • Lux - Automatic DataFrame visualization in Jupyter with a click.
      • Yellowbrick - A suite of visual diagnostic tools for machine learning, extending the Scikit-Learn API.
      • PyOD - Python library for outlier and anomaly detection.
      • Pendulum - Alternative to datetime with timezone support.
      • AutoViz - Automatic data visualization in 1 line of code.
      • Vizro - Low-code toolkit for building high-quality data visualization apps.
      • Great Tables - Create awesome display tables using Python.
      • DataMapPlot - Create beautiful plots of data maps.
      • Datashader - Quickly and accurately render even the largest data.
      • Sweetviz - Automatic EDA with dataset comparison.
      • Lux - Automatic DataFrame visualization in Jupyter with a click.
      • Yellowbrick - A suite of visual diagnostic tools for machine learning, extending the Scikit-Learn API.
      • PyOD - Python library for outlier and anomaly detection.
      • YData Profiling - 1 line of code data quality profiling & exploratory data analysis.
      • Missingno - Visualize missing data patterns in matrix format.
      • Alibi-detect - Algorithms for outlier, adversarial and drift detection.
      • YData Profiling - 1 line of code data quality profiling & exploratory data analysis.
      • Missingno - Visualize missing data patterns in matrix format.
      • Dora - Automate EDA: preprocessing, feature engineering, visualization.
      • FeatureTools - Open-source automated feature engineering.
      • Feature Selector - Tool for dimensionality reduction of machine learning datasets.
      • Dora - Automate EDA: preprocessing, feature engineering, visualization.
      • Alibi-detect - Algorithms for outlier, adversarial and drift detection.
      • FeatureTools - Open-source automated feature engineering.
      • Feature Selector - Tool for dimensionality reduction of machine learning datasets.
      • TSFresh - A Python library for automatically extracting features from time series data.
      • Prince - A Python library for multivariate exploratory data analysis, including PCA, CA, MCA, and more.
      • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.
      • Pytest - Framework for writing small tests.
      • TSFresh - A Python library for automatically extracting features from time series data.
      • Feature Engine - A feature engineering library with Scikit-Learn compatibility.
      • Cerberus - Data validation through schemas.
      • Pandera - Data validation through declarative schemas.
      • PandasVet - Code style validator for Pandas (similar to ESLint).
      • Prefect - Workflow orchestration for building resilient data pipelines.
      • Airflow - Platform for automating data workflows.
      • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.
      • Petl - ETL tool for data cleaning and transformation.
      • DuckDB - In-memory analytical database for fast SQL queries.
      • D-Tale - Interactive GUI for data analysis in a browser.
      • Feature Engine - A feature engineering library with Scikit-Learn compatibility.
      • Prince - A Python library for multivariate exploratory data analysis, including PCA, CA, MCA, and more.
      • Factor Analyzer - A Python package for factor analysis, including exploratory and confirmatory methods.
      • Pytest - Framework for writing small tests.
      • Cerberus - Data validation through schemas.
      • Pandera - Data validation through declarative schemas.
      • PandasVet - Code style validator for Pandas (similar to ESLint).
      • Prefect - Workflow orchestration for building resilient data pipelines.
      • Airflow - Platform for automating data workflows.
      • Apache Arrow - Universal columnar format and multi-language toolbox for fast data interchange.
      • Petl - ETL tool for data cleaning and transformation.
      • DuckDB - In-memory analytical database for fast SQL queries.
      • D-Tale - Interactive GUI for data analysis in a browser.
      • Pandasgui - GUI for viewing and filtering DataFrames.
      • QGrid - Interactive grid for sorting, filtering, and editing DataFrames in Jupyter.
      • PyGWalker - Interactive UIs for visual analysis of pandas DataFrames.
      • Pivottablejs - Interactive PivotTable.js tables in Jupyter.
      • Faker - Generates fake data for testing.
      • Mimesis - Generates realistic test data.
      • Rich - Rich text and beautiful formatting in the terminal.
      • Pandas-log - Logs pandas operations for data transformation tracking.
      • Icecream - Debugging without using print.
      • Pydeps - Python module dependency graphs.
      • PyForest - Automated Python imports for data science.
      • Pandarallel - Parallel operations for pandas DataFrames.
      • Dask - Parallel computing for arrays and DataFrames.
      • Modin - Speeds up Pandas by distributing computations.
      • Sphinx - The Sphinx documentation generator.
      • Pdoc - API documentation for Python projects.
      • Mkdocs - Project documentation with Markdown.
      • OpenPyXL - Read/write Excel files with support for advanced features.
      • Tablib - Exports data to XLSX, JSON, CSV via a single API.
      • Pandasgui - GUI for viewing and filtering DataFrames.
      • QGrid - Interactive grid for sorting, filtering, and editing DataFrames in Jupyter.
      • PyGWalker - Interactive UIs for visual analysis of pandas DataFrames.
      • Rich - Rich text and beautiful formatting in the terminal.
      • Pandas-log - Logs pandas operations for data transformation tracking.
      • Pivottablejs - Interactive PivotTable.js tables in Jupyter.
      • Faker - Generates fake data for testing.
      • Mimesis - Generates realistic test data.
      • Icecream - Debugging without using print.
      • Pydeps - Python module dependency graphs.
      • PyForest - Automated Python imports for data science.
      • Pandarallel - Parallel operations for pandas DataFrames.
      • Dask - Parallel computing for arrays and DataFrames.
      • Modin - Speeds up Pandas by distributing computations.
      • OpenPyXL - Read/write Excel files with support for advanced features.
      • Tablib - Exports data to XLSX, JSON, CSV via a single API.
      • PyPDF2 - Reads and writes PDF files.
      • Sphinx - The Sphinx documentation generator.
      • Pdoc - API documentation for Python projects.
      • Mkdocs - Project documentation with Markdown.
      • Python-docx - Reads and writes Word documents.
      • CleverCSV - Smart CSV reader for messy data.
      • Xlwings - Integration of Python with Excel.
      • Xmltodict - Converts XML to Python dictionaries.
      • Python-markdownify - Convert HTML to Markdown.
      • PyPDF2 - Reads and writes PDF files.
      • Python-docx - Reads and writes Word documents.
      • CleverCSV - Smart CSV reader for messy data.
      • Xlwings - Integration of Python with Excel.
      • Xmltodict - Converts XML to Python dictionaries.
      • Python-markdownify - Convert HTML to Markdown.
      • MarkItDown - Python tool for converting files and office documents to Markdown.
      • Pillow - Image processing library.
      • Ftfy - Fixes broken Unicode strings.
      • MarkItDown - Python tool for converting files and office documents to Markdown.
      • Pillow - Image processing library.
      • Ftfy - Fixes broken Unicode strings.
      • Dataset - JSON-like interface for working with SQL databases.
      • JmesPath - Queries JSON data (SQL-like for JSON).
      • Glom - Transforms nested data structures.
      • Pampy - Pattern matching for Python dictionaries.
      • Geopy - Geocoding addresses and calculating distances.
      • Diagrams - Diagrams as code for cloud system architecture prototyping.
      • Scattertext - Beautiful visualizations of language differences among document types.
      • Pygorithm - A Python module for learning all major algorithms.
      • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.
      • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.
      • Dataset - JSON-like interface for working with SQL databases.
      • JmesPath - Queries JSON data (SQL-like for JSON).
      • Glom - Transforms nested data structures.
      • Pampy - Pattern matching for Python dictionaries.
      • Geopy - Geocoding addresses and calculating distances.
      • Diagrams - Diagrams as code for cloud system architecture prototyping.
      • Scattertext - Beautiful visualizations of language differences among document types.
      • Pygorithm - A Python module for learning all major algorithms.
      • IGraph - A library for creating and manipulating graphs and networks, with bindings for multiple languages.
      • Joblib - A lightweight pipelining library for Python, particularly useful for saving and loading large NumPy arrays.
  • πŸ† Awesome Data Science Repositories

  • πŸ—ƒοΈ SQL & Databases

  • πŸ“Š Data Visualization

    • Resources

    • Tools

      • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.
      • Seaborn - A statistical data visualization library based on Matplotlib.
      • Plotly - A library for creating interactive plots and dashboards.
      • Bokeh - A library for creating interactive visualizations for modern web browsers.
      • HoloViews - A tool for building complex visualizations easily.
      • Geopandas - An extension of Pandas for geospatial data.
      • Folium - A library for visualizing data on interactive maps.
      • Matplotlib - A comprehensive library for creating static, animated, and interactive visualizations in Python.
      • Seaborn - A statistical data visualization library based on Matplotlib.
      • Plotly - A library for creating interactive plots and dashboards.
      • Altair - A declarative statistical visualization library for Python.
      • Bokeh - A library for creating interactive visualizations for modern web browsers.
      • HoloViews - A tool for building complex visualizations easily.
      • Geopandas - An extension of Pandas for geospatial data.
      • Folium - A library for visualizing data on interactive maps.
      • Plotnine - A grammar of graphics for Python.
      • Plotnine - A grammar of graphics for Python.
      • Bqplot - A plotting library for IPython/Jupyter notebooks.
      • PyPalettes - A large (+2500) collection of color maps for Python.
      • Bqplot - A plotting library for IPython/Jupyter notebooks.
      • PyPalettes - A large (+2500) collection of color maps for Python.
  • πŸ“– Natural Language Processing (NLP)

    • Resources

    • Tools

      • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.
      • TextBlob - A simple library for processing textual data.
      • SpaCy - An open-source software library for advanced NLP in Python.
      • TextRank - A library for TextRank algorithm implementation.
      • Natural Language Toolkit (NLTK) - A leading platform for building Python programs to work with human language data.
      • TextBlob - A simple library for processing textual data.
      • SpaCy - An open-source software library for advanced NLP in Python.
      • TextRank - A library for TextRank algorithm implementation.
      • Flair - A simple framework for state-of-the-art NLP.
      • BERT - A transformer-based model for NLP tasks.
      • Transformers - A library for state-of-the-art NLP models.
      • Flair - A simple framework for state-of-the-art NLP.
      • BERT - A transformer-based model for NLP tasks.
      • Transformers - A library for state-of-the-art NLP models.
  • πŸ”’ Mathematics, Statistics & Probability