An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-mining

A curated list of projects in awesome lists tagged with data-mining .

https://github.com/bulutyazilim/awesome-datascience

:memo: An awesome Data Science repository to learn and apply for real world problems.

analytics awesome-list data-mining data-science data-scientists data-visualization deep-learning hacktoberfest machine-learning science

Last synced: 17 Jun 2025

https://github.com/jaidedai/easyocr

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

cnn crnn data-mining deep-learning easyocr image-processing information-retrieval lstm machine-learning ocr optical-character-recognition python pytorch scene-text scene-text-recognition

Last synced: 17 Nov 2025

https://github.com/eriklindernoren/ml-from-scratch

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

data-mining data-science deep-learning deep-reinforcement-learning genetic-algorithm machine-learning machine-learning-from-scratch

Last synced: 11 May 2025

https://github.com/eriklindernoren/ML-From-Scratch

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

data-mining data-science deep-learning deep-reinforcement-learning genetic-algorithm machine-learning machine-learning-from-scratch

Last synced: 14 Mar 2025

https://github.com/JaidedAI/EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

cnn crnn data-mining deep-learning easyocr image-processing information-retrieval lstm machine-learning ocr optical-character-recognition python pytorch scene-text scene-text-recognition

Last synced: 14 Mar 2025

https://github.com/microsoft/lightgbm

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

data-mining decision-trees distributed gbdt gbm gbrt gradient-boosting kaggle lightgbm machine-learning microsoft parallel python r

Last synced: 09 Sep 2025

https://github.com/Microsoft/LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

data-mining decision-trees distributed gbdt gbm gbrt gradient-boosting kaggle lightgbm machine-learning microsoft parallel python r

Last synced: 23 Apr 2025

https://github.com/microsoft/LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

data-mining decision-trees distributed gbdt gbm gbrt gradient-boosting kaggle lightgbm machine-learning microsoft parallel python r

Last synced: 12 Mar 2025

https://github.com/rasbt/python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource

data-mining data-science logistic-regression machine-learning machine-learning-algorithms neural-network python scikit-learn

Last synced: 14 May 2025

https://github.com/tangyudi/ai-learn

人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域

algorithm artificial-intelligence caffe cv data-analysis data-mining data-science deep-learning keras machine-learning mathematics matplotlib nlp numpy pandas python pytorch seaborn tensorflow tensorflow2

Last synced: 14 May 2025

https://github.com/tangyudi/Ai-Learn

人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域

algorithm artificial-intelligence caffe cv data-analysis data-mining data-science deep-learning keras machine-learning mathematics matplotlib nlp numpy pandas python pytorch seaborn tensorflow tensorflow2

Last synced: 07 May 2025

https://github.com/catboost/catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

big-data catboost categorical-features coreml cuda data-mining data-science decision-trees gbdt gbm gpu gpu-computing gradient-boosting kaggle machine-learning python r tutorial

Last synced: 12 May 2025

https://github.com/rasbt/mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.

association-rules data-mining data-science machine-learning python supervised-learning unsupervised-learning

Last synced: 13 May 2025

https://github.com/microsoft/rd-agent

Research and development (R&D) is crucial for the enhancement of industrial productivity, especially in the AI era, where the core aspects of R&D are mainly focused on data and models. We are committed to automating these high-value generic R&D processes through our open source R&D automation tool RD-Agent, which lets AI drive data-driven AI.

agent ai automation data-mining data-science development llm research

Last synced: 12 May 2025

https://github.com/deanmalmgren/textract

extract text from any document. no muss. no fuss.

data-mining natural-language-processing python text-mining

Last synced: 12 May 2025

https://github.com/alibaba/alink

Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.

apriori classification clustering data-mining feature-engineering flink flink-machine-learning flink-ml fm graph-algorithms graph-embedding kafka machine-learning recommender recommender-system regression statistics word2vec xgboost

Last synced: 14 May 2025

https://github.com/alibaba/Alink

Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.

apriori classification clustering data-mining feature-engineering flink flink-machine-learning flink-ml fm graph-algorithms graph-embedding kafka machine-learning recommender recommender-system regression statistics word2vec xgboost

Last synced: 14 Mar 2025

https://github.com/automeris-io/webplotdigitizer

Computer vision assisted tool to extract numerical data from plot images.

charts computer-vision data-mining html javascript reverse-engineering visualization webplotdigitizer

Last synced: 18 Dec 2025

https://github.com/dblalock/bolt

10x faster matrix and vector operations

compression data-mining database machine-learning

Last synced: 15 May 2025

https://github.com/wzbsocialsciencecenter/pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

data-mining image-processing ocr pdf python tables

Last synced: 14 May 2025

https://github.com/WZBSocialScienceCenter/pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

data-mining image-processing ocr pdf python tables

Last synced: 26 Mar 2025

https://github.com/invoice-x/invoice2data

Extract structured data from PDF invoices

data-mining python

Last synced: 14 May 2025

https://github.com/paddlepaddle/research

novel deep learning research works with PaddlePaddle

computer-vision data-mining deep-learning knowledge-graph nlp spatial-temporal

Last synced: 15 May 2025

https://github.com/PaddlePaddle/Research

novel deep learning research works with PaddlePaddle

computer-vision data-mining deep-learning knowledge-graph nlp spatial-temporal

Last synced: 30 Mar 2025

https://github.com/404notf0und/AI-for-Security-Learning

安全场景、基于AI的安全算法和安全数据分析业界实践

data-analysis data-mining machine-learning security

Last synced: 27 Apr 2025

https://github.com/404notf0und/ai-for-security-learning

安全场景、基于AI的安全算法和安全数据分析业界实践

data-analysis data-mining machine-learning security

Last synced: 26 Jan 2026

https://github.com/yimeng-zhang/feature-engineering-and-feature-selection

A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

data-mining feature-engineering feature-extraction feature-selection machine-learning python

Last synced: 16 May 2025

https://github.com/Yimeng-Zhang/feature-engineering-and-feature-selection

A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

data-mining feature-engineering feature-extraction feature-selection machine-learning python

Last synced: 06 May 2025

https://github.com/microsoft/RD-Agent

Research and development (R&D) is crucial for the enhancement of industrial productivity, especially in the AI era, where the core aspects of R&D are mainly focused on data and models. We are committed to automating these high-value generic R&D processes through our open source R&D automation tool RD-Agent, which lets AI drive data-driven AI.

agent ai automation data-mining data-science development llm research

Last synced: 24 Oct 2025

https://github.com/demidovakatya/vvedenie-mashinnoe-obuchenie

:memo: Подборка ресурсов по машинному обучению

collections data-mining data-science deep-learning machine-learning mooc neural-networks nlp russian university

Last synced: 26 Jan 2026

https://github.com/ebay/tsv-utils

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

cli command-line csv d data-mining data-science delimited-files dlang reservoir-sampling sampling shuffle statistics tabular-data tsv uniq

Last synced: 27 Jan 2026

https://github.com/eBay/tsv-utils

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

cli command-line csv d data-mining data-science delimited-files dlang reservoir-sampling sampling shuffle statistics tabular-data tsv uniq

Last synced: 14 Apr 2025

https://github.com/circl/ail-framework

AIL framework - Analysis Information Leak framework. Project moved to https://github.com/ail-project

ail-framework analysis data-mining information-leak information-security leak privacy security security-incidents

Last synced: 14 May 2025

https://github.com/patmartin/dex

Dex : The Data Explorer -- A data visualization tool written in Java/Groovy/JavaFX capable of powerful ETL and publishing web visualizations.

d3 d3js data-analysis data-mining data-science data-visualization datavis datavisualization dataviz groovy java javafx visualization

Last synced: 16 May 2025

https://github.com/PatMartin/Dex

Dex : The Data Explorer -- A data visualization tool written in Java/Groovy/JavaFX capable of powerful ETL and publishing web visualizations.

d3 d3js data-analysis data-mining data-science data-visualization datavis datavisualization dataviz groovy java javafx visualization

Last synced: 04 May 2025

https://github.com/CIRCL/AIL-framework

AIL framework - Analysis Information Leak framework. Project moved to https://github.com/ail-project

ail-framework analysis data-mining information-leak information-security leak privacy security security-incidents

Last synced: 14 Apr 2025

https://github.com/alan-turing-institute/clevercsv

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

csv csv-converter csv-export csv-files csv-format csv-import csv-parser csv-parsing csv-reader csv-reading data-analysis data-mining data-science datascience machine-learning python python-library python3

Last synced: 13 May 2025

https://github.com/alan-turing-institute/CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

csv csv-converter csv-export csv-files csv-format csv-import csv-parser csv-parsing csv-reader csv-reading data-analysis data-mining data-science datascience machine-learning python python-library python3

Last synced: 26 Mar 2025

https://github.com/lightaime/deep_gcns_torch

Pytorch Repo for DeepGCNs (ICCV'2019 Oral, TPAMI'2021), DeeperGCN (arXiv'2020) and GNN1000(ICML'2021): https://www.deepgcns.org

3d-point-clouds bioinformatics cheminformatics computer-vision data-mining deep-gcns deep-learning geometric-deep-learning graph-convolutional-networks graph-neural-networks pytorch science-research social-network

Last synced: 16 May 2025

https://github.com/k0lb3/unitypy

UnityPy is python module that makes it possible to extract/unpack and edit Unity assets

assetstudio data-mining python python3 unity unity-asset unity-asset-extractor unitypack

Last synced: 13 May 2025

https://github.com/K0lb3/UnityPy

UnityPy is python module that makes it possible to extract/unpack and edit Unity assets

assetstudio data-mining python python3 unity unity-asset unity-asset-extractor unitypack

Last synced: 24 Apr 2025

https://github.com/WenjieDu/PyPOTS

A Python toolkit/library for reality-centric machine/deep learning and data mining on partially-observed time series, including SOTA neural network models for scientific analysis tasks of imputation/classification/clustering/forecasting/anomaly detection/cleaning on incomplete industrial (irregularly-sampled) multivariate TS with NaN missing values

classification clustering data-mining data-science deep-learning forecasting healthcare imputation incomplete industrial interpolation machine-learning missing-values missingness neural-network partially-observed-time-series pytorch science-research time-series time-series-analysis

Last synced: 01 Apr 2025

https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

big-data data-analysis data-mining data-processing data-science google-cloud-dataflow

Last synced: 01 May 2025

https://github.com/googlecloudplatform/dataflowjavasdk

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

big-data data-analysis data-mining data-processing data-science google-cloud-dataflow

Last synced: 03 Oct 2025

https://github.com/jerlendds/osintbuddy

Node graphs, OSINT data mining, and plugins. Connect unstructured and public data for transformative insights. The rewrite can be found @ osintbuddy/osintbuddy

data-mining data-visualization information-gathering node-graph ontology osint osint-python plugin-system plugins python3 reconnaissance typescript

Last synced: 18 Jul 2025

https://github.com/ipython-books/cookbook-2nd-code

Code of the IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018 [read-only repository]

computing data-analysis data-mining data-science data-visualization ipython jupyter jupyter-notebook machine-learning numerical-computation python visualization

Last synced: 12 Apr 2025

https://github.com/ashishpatel26/amazing-feature-engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

data-analysis data-mining data-science data-scientists data-visualization deep-learning feature-engineering feature-extraction feature-scaling feature-selection features machine-learning scikit-learn

Last synced: 16 May 2025

https://github.com/ail-project/ail-framework

AIL framework - Analysis Information Leak framework

ail-framework data-mining information-extraction information-security leak

Last synced: 15 May 2025

https://github.com/ashishpatel26/Amazing-Feature-Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.

data-analysis data-mining data-science data-scientists data-visualization deep-learning feature-engineering feature-extraction feature-scaling feature-selection features machine-learning scikit-learn

Last synced: 10 Apr 2025

https://github.com/chris-greening/instascrape

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

beginner-friendly data-mining data-science instagram instagram-data instagram-scraper lightweight python python-scraper python3 webscraping

Last synced: 07 Apr 2025

https://github.com/chaoss/grimoirelab

GrimoireLab: platform for software development analytics and insights

chaoss data-mining data-visualization grimoirelab insights metrics software-analytics

Last synced: 21 Jan 2026

https://github.com/holgerbrandl/krangl

krangl is a {K}otlin DSL for data w{rangl}ing

data-mining datascience java kotlin sql

Last synced: 11 Apr 2025

https://chaoss.github.io/grimoirelab/

GrimoireLab: platform for software development analytics and insights

chaoss data-mining data-visualization grimoirelab insights metrics software-analytics

Last synced: 03 Apr 2025

https://github.com/jchao01/TradingView-data-scraper

Extract price and indicator data from TradingView charts to create ML datasets

algorithmic-trading data-mining json tradingview webscraping

Last synced: 26 Mar 2025

https://github.com/serengil/chefboost

A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4.5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting, Random Forest and Adaboost w/categorical features support for Python

adaboost c45-trees cart categorical-features data-mining data-science decision-trees gbdt gbm gbrt gradient-boosting gradient-boosting-machine gradient-boosting-machines id3 kaggle machine-learning python random-forest regression-tree

Last synced: 14 May 2025

https://github.com/CogComp/cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.

big-data cogcomp data-mining dependency-parsing lemmatization lemmatizer named-entity-recognition natural-language-processing natural-language-understanding ner nlp parts-of-speech-tagging pos pos-tagging relation-extraction similarity tokenizer transliteration

Last synced: 27 Mar 2025

https://github.com/desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 22 Nov 2025

https://github.com/chuanconggao/PrefixSpan-py

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.

bide data-mining feat pattern-mining prefixspan

Last synced: 26 Mar 2025

https://fraud-detection-handbook.github.io/fraud-detection-handbook/

Reproducible Machine Learning for Credit Card Fraud Detection - Practical Handbook

credit-card credit-card-fraud data-mining data-science fraud-detection machine-learning open-data

Last synced: 19 Nov 2025

https://github.com/Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 03 Apr 2025

https://github.com/matrix-profile-foundation/matrixprofile

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.

algorithms anomaly-detection clustering data-mining data-science hacktoberfest matrixprofile motif-discovery python python2 python3 segmentation time-series time-series-analysis

Last synced: 16 May 2025

https://github.com/ScriptSmith/reaper

Social media scraping / data collection tool for the Facebook, Twitter, Reddit, YouTube, Pinterest, and Tumblr APIs

api data-collection data-mining data-scraping facebook gui pinterest reddit scraping socialmedia tumblr twitter youtube

Last synced: 04 Apr 2025

https://github.com/scriptsmith/reaper

Social media scraping / data collection tool for the Facebook, Twitter, Reddit, YouTube, Pinterest, and Tumblr APIs

api data-collection data-mining data-scraping facebook gui pinterest reddit scraping socialmedia tumblr twitter youtube

Last synced: 07 Apr 2025