awesome-generative-ai-data-scientist

A curated list of 100+ resources for building and deploying generative AI specifically focusing on helping you become a Generative AI Data Scientist with LLMs
https://github.com/business-science/awesome-generative-ai-data-scientist

Last synced: 4 days ago
JSON representation

Free Training
- NVIDIA
  - Generative AI Data Scientist Workshops - science.io/ai-register)
  - Generative AI Data Scientist Workshops - science.io/ai-register)
  - Generative AI Data Scientist Workshops - science.io/ai-register)
  - Generative AI Data Scientist Workshops - science.io/ai-register)
  - Generative AI Data Scientist Workshops - science.io/ai-register)
  - Generative AI Data Scientist Workshops - science.io/ai-register)
  - Generative AI Data Scientist Workshops - science.io/ai-register)
Data Science And AI Agents
- Microsoft Data Formulator - formulator) |
- Jupyter Agent - agents/jupyter-agent) |
- Jupyter AI - ai.readthedocs.io/en/latest/) \| [GitHub](https://github.com/jupyterlab/jupyter-ai) |
- PandasAI - ai.com/) \| [GitHub](https://github.com/sinaptik-ai/pandas-ai) |
- WrenAI - source GenBI AI Agent. Text2SQL made Easy! | [Documentation](https://docs.getwren.ai/oss/overview/introduction) \| [GitHub](https://github.com/Canner/WrenAI) |
- Google GenAI Toolbox for Databases - source server that makes it easier to build Gen AI tools for interacting with databases. | [Blog](https://cloud.google.com/blog/products/ai-machine-learning/announcing-gen-ai-toolbox-for-databases-get-started-today) \| [Documentation](https://googleapis.github.io/genai-toolbox/getting-started/introduction/) \| [GitHub](https://github.com/googleapis/genai-toolbox) |
- Vanna AI - ai/vanna) |
Web Parsing (HTML) and Web Crawling
- Scrapling - Fast, and Adaptive Web Scraping for Python. | [GitHub](https://github.com/D4Vinci/Scrapling) |
- Firecrawl - ready markdown or structured data. Scrape, crawl, and extract with a single API. | [Documentation](https://docs.firecrawl.dev/) \| [GitHub](https://github.com/mendableai/firecrawl) |
- GPT Crawler - gpt) \| [GitHub](https://github.com/BuilderIO/gpt-crawler) |
- Gitingest
- Crawl4AI - source, blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. | [Documentation](https://crawl4ai.com/mkdocs/) \| [GitHub](https://github.com/unclecode/crawl4ai) |
- ScrapeGraphAI - ai) |
LLM Memory
- Memobase - Based Memory for GenAI Apps. [Documentation](https://docs.memobase.io/introduction) | [GitHub](https://github.com/memodb-io/memobase)
- Memary
- Mem0 - improving memory layer for LLM applications, enabling personalized AI experiences that save costs and delight users. | [Documentation](https://docs.mem0.ai/) \| [GitHub](https://github.com/mem0ai/mem0) |
AI Frameworks (Build Your Own)
- Pocket Flow - line minimalist LLM framework for Agents, Task Decomposition, RAG, etc. | [Documentation](https://the-pocket.github.io/PocketFlow/) \| [GitHub](https://github.com/The-Pocket/PocketFlow) |
- Google GenAI - genai/) \| [GitHub](https://github.com/googleapis/python-genai) |
- LlamaIndex Workflows - workflows-beta-a-new-way-to-create-complex-ai-applications-with-llamaindex) |
- LlamaIndex - augmented generative AI applications with LLMs. | [Documentation](https://docs.llamaindex.ai/) \| [GitHub](https://github.com/run-llama/llama_index) |
- CrewAI
- AutoGen
- Pydantic AI - grade applications with Generative AI less painful. | [GitHub](https://github.com/pydantic/pydantic-ai) |
- FlatAI - ai) |
- Llama Stack - stack.readthedocs.io/en/latest/index.html) \| [GitHub](https://github.com/meta-llama/llama-stack) |
- Haystack - source AI orchestration framework for building customizable, production-ready LLM applications. | [Documentation](https://docs.haystack.deepset.ai/docs) \| [GitHub](https://github.com/deepset-ai/haystack) |
- Agency Swarm - source agent orchestration framework built on top of the latest OpenAI Assistants API. | [Documentation](https://vrsen.github.io/agency-swarm/) \| [GitHub](https://github.com/VRSEN/agency-swarm) |
- AutoAgent - automated and highly self-developing framework that enables users to create and deploy LLM agents through natural language alone. | [GitHub](https://github.com/HKUDS/AutoAgent) |
- Legion - agnostic framework designed to simplify the creation of sophisticated multi-agent systems. | [Documentation](https://legion.llmp.io/docs) \| [GitHub](https://github.com/LLMP-io/Legion) |
Huggingface Ecosystem
- Huggingface - source platform for machine learning (ML) and artificial intelligence (AI) tools and models. | [Documentation](https://huggingface.co/docs) |
- Sentence Transformers - to Python module for accessing, using, and training state-of-the-art text and image embedding models. | [Documentation](https://sbert.net/) |
Prompt Improvement
- Microsoft PromptWizard - Aware Prompt Optimization Framework. | [GitHub](https://github.com/microsoft/PromptWizard) |
- Promptify
- AutoPrompt - based Prompt Calibration. | [GitHub](https://github.com/Eladlev/AutoPrompt) |
Table of Contents
- Nir Diamant GenAI Agents Hub
- AI Engineering Hub - world AI agent applications, LLM and RAG tutorials, with examples to implement. | [GitHub](https://github.com/patchy631/ai-engineering-hub/tree/main) |
- AI Hedge Fund - powered hedge fund. | [GitHub](https://github.com/virattt/ai-hedge-fund) |
- AI Financial Agent - financial-agent) |
- Awesome LLM Apps - By-Step Tutorials. | [GitHub](https://github.com/Shubhamsaboo/awesome-llm-apps) |
- Structured Report Generation (LangGraph) - to-end process of report planning, web research, and writing. Produces reports of varying and easily configurable formats. | [Video](https://www.youtube.com/watch?v=E04rFNtwFcA) \| [Blog](https://blog.langchain.dev/structured-report-generation-blueprint/) \| [Code](https://github.com/langchain-ai/langchain-nvidia/blob/main/cookbook/structured_report_generation.ipynb) |
- Uber QueryGPT - TW/blog/query-gpt/) |
- StockChat - source alternative to Perplexity Finance. | [GitHub](https://github.com/clchinkc/stockchat) |
LLMOps
- LangWatch - click. Drag and drop interface for LLMOps platform. | [Documentation](https://docs.langwatch.ai/) \| [GitHub](https://github.com/langwatch/langwatch) |
- MLflow
- LLMOps - python-package) |
- Helicone - source LLM observability platform for developers to monitor, debug, and improve production-ready applications. | [Documentation](https://docs.helicone.ai/) \| [GitHub](https://github.com/Helicone/helicone) |
- Agenta - source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place. | [Documentation](https://docs.agenta.ai/) |
Testing and Monitoring (Observability)
- MLflow Tracing and Evaluation - evaluation/index.html) \| [GitHub](https://github.com/mlflow/mlflow) |
- Opik - source platform for evaluating, testing, and monitoring LLM applications. | [GitHub](https://github.com/comet-ml/opik) |
- LangSmith - grade LLM applications. It allows you to closely monitor and evaluate your application, so you can quickly and confidently ship. | [Documentation](https://docs.smith.langchain.com/) \| [GitHub](https://github.com/langchain-ai/langsmith-sdk) |
- Langfuse
Other
- AI Agent Service Toolkit - service-toolkit.streamlit.app/) \| [GitHub](https://github.com/JoshuaC215/agent-service-toolkit) |
- AI Suite
- AdalFlow - optimize LLM applications, from Chatbot, RAG, to Agent by SylphAI. | [GitHub](https://github.com/SylphAI-Inc/AdalFlow) |
- dspy
- LiteLLM
- Microsoft Tiny Troupe - powered multiagent persona simulation for imagination enhancement and business insights. | [GitHub](https://github.com/microsoft/TinyTroupe) |
- Distributed Llama - llama) |
Agents and Tools (Build Your Own)
- Google Agent Development Kit (ADK) - source, code-first Python toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control. | [Documentation](https://google.github.io/adk-docs/) \| [GitHub](https://github.com/google/adk-python) |
- AutoGen AgentChat - guide/agentchat-user-guide/quickstart.html) |
- smolagents
- LangChain Agents
- LangChain Tools
- Agentarium - source framework for creating and managing simulations populated with AI-powered agents. It provides an intuitive platform for designing complex, interactive environments where agents can act, learn, and evolve. | [GitHub](https://github.com/Thytu/Agentarium) |
RAG in R
- Microsoft Azure AI Services
- Google Vertex AI
- AWS Bedrock
- Microsoft Azure AI Services
- Google Vertex AI
- Ragnar - Augmented Generation (RAG) workflows. | [Website](https://tidyverse.github.io/ragnar/) |
Building AI
- GitHub
- GitHub
- GitHub
- GitHub
- GitHub
- GitHub
- GitHub
Microsoft Azure
- Azure Generative AI Examples - examples/tree/main/sdk/python/generative-ai) |
- Microsoft Generative AI for Beginners - ai-for-beginners) |
- Microsoft Intro to Generative AI Course - us/training/paths/introduction-generative-ai/) |
- Azure Generative AI Examples - examples/tree/main/sdk/python/generative-ai) |
LLM Providers
- Meta Llama Models - source AI model you can fine-tune, distill, and deploy anywhere. | [Meta](https://llama.meta.com/) |
- Ollama
- Anthropic Claude - sdk-python) |
- Google Gemini - gemini/generative-ai-python) |
- Grok - python) |
- OpenAI - python) |
- Hugging Face Models
- OpenAI Agents - agent workflows. | [GitHub](https://github.com/openai/openai-agents-python) |
- Google Gemini - gemini/generative-ai-python) |
- Grok - python) |
Vector Databases (RAG)
- FAISS
- NVIDIA NIM - host GPU-accelerated inferencing microservices for pretrained and customized AI models across clouds, data centers, and workstations.
- ChromaDB - core/chroma) |
- FAISS
- Pinecone - io/pinecone-python-client) |
- Milvus - source vector database built to power embedding similarity search and AI applications. | [GitHub](https://github.com/milvus-io/milvus) |
- NVIDIA NIM - host GPU-accelerated inferencing microservices for pretrained and customized AI models across clouds, data centers, and workstations.
- Qdrant - Performance Vector Search at Scale. | [Website](https://qdrant.tech/) |
- ChromaDB - core/chroma) |
- Pinecone - io/pinecone-python-client) |
- Milvus - source vector database built to power embedding similarity search and AI applications. | [GitHub](https://github.com/milvus-io/milvus) |
- SQLite Vec - vec) |
LLM Models
- Ollama
- Anthropic Claude
AI LLM Frameworks
- LangChain
- LlamaIndex - augmented generative AI applications with LLMs.
- LlamaIndex Workflows - complex AI application we see our users building.
- LangGraph - actor applications with LLMs, used to create agent and multi-agent workflows.
- LlamaIndex - augmented generative AI applications with LLMs.
LangChain Ecosystem
- LangGraph - actor applications with LLMs, used to create agent and multi-agent workflows. | [Documentation](https://langchain-ai.github.io/langgraph/) \| [Tutorials](https://github.com/langchain-ai/langgraph/tree/main/docs/docs/tutorials) |
- LangChain - ai/langchain) \| [Cookbook](https://github.com/langchain-ai/langchain/tree/master/cookbook) |
Cookbooks and Examples:
- LangChain Cookbook - to-end examples.
Amazon Web Services (AWS)
- GitHub
Cloud Examples:
- Amazon Bedrock Workshop
- Google Vertex AI Examples
- NVIDIA NIM Anywhere - sized labs and up to production environments.
- NVIDIA NIM Deploy
Google Cloud Platform (GCP)
- GitHub
- GitHub
NVIDIA
- NVIDIA NIM Anywhere - sized labs and up to production environments. | [GitHub](https://github.com/NVIDIA/nim-anywhere) |
- NVIDIA NIM Deploy - deploy) |
- Python AI/ML Tips - science/free-ai-tips) |
- unwind ai
8-Week AI Bootcamp by Business Science
- Find out more about how to build AI with Python, and attend our free AI training session here.
- Find out more about how to build AI with Python, and attend our free AI training session here.
LLM Models and Providers
- Hugging Face Models
Pretraining
- tinygrad
- micrograd
- PyTorch - source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing. | [Website](https://pytorch.org/) |
- TensorFlow - source machine learning library developed by Google. | [Website](https://www.tensorflow.org/) |
- JAX - performance computing and automatic differentiation. | [GitHub](https://github.com/jax-ml/jax) |
Fine-tuning
- Transformers - Hugging Face Transformers is a popular library for Natural Language Processing (NLP) tasks, including fine-tuning large language models.
- Unsloth - 3.5 & Gemma 2-5x faster with 80% less memory! | [GitHub](https://github.com/unslothai/unsloth) |
- LitGPT - performance LLMs with recipes to pretrain, finetune, and deploy at scale. | [GitHub](https://github.com/Lightning-AI/litgpt) |
- AutoTrain - tuning of LLMs and other machine learning tasks. | [GitHub](https://github.com/huggingface/autotrain-advanced) |
Document Parsing
- Embedchain - started/quickstart) \| [GitHub](https://github.com/mem0ai/mem0/tree/main/embedchain) |
- Docling by IBM
- Markitdown by Microsoft
- DocETL - powered data processing and ETL. | [Documentation](https://ucbepic.github.io/docetl/) \| [GitHub](https://github.com/ucbepic/docetl) |
- LangChain Document Loaders
- Unstructured.io - tuning. | [Documentation](https://docs.unstructured.io/welcome) \| [GitHub](https://github.com/Unstructured-IO/unstructured) \| [Paper](https://www.iarpa.gov/images/PropsersDayPDFs/BENGAL/Unstructured.io%20Federal%20Capabilities%20Statement%20for%20IARPA.pdf) |
Miscellaneous
- Pyspur - Based Editor for LLM Workflows
- Browser-Use
- AWS Bedrock - performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon
AI Frameworks (Drag and Drop)
- AutoGen Studio - code interface to rapidly prototype AI agents, enhance them with tools, compose them into teams, and interact with them to accomplish tasks. Built on AutoGen AgentChat. | [Documentation](https://microsoft.github.io/autogen/stable/user-guide/autogenstudio-user-guide/index.html) |
- LangGraph Studio - ai/langgraph-studio) |
- Pyspur - Based Editor for LLM Workflows. | [Documentation](https://docs.pyspur.dev/introduction) \| [GitHub](https://github.com/PySpur-Dev/PySpur) |
- n8n - code workflow automation platform with native AI capabilities. Combine visual building with custom code, self-host or cloud, 400+ integrations. | [Documentation](https://docs.n8n.io/) \| [GitHub](https://github.com/n8n-io/n8n) |
- Langflow - code tool that makes building powerful AI agents and workflows that can use any API, model, or database easier. | [Documentation](https://docs.langflow.org/) \| [GitHub](https://github.com/langflow-ai/langflow) |
Open Source LLM Models
- DeepSeek-R1
- Qwen
- Llama - llama/llama) |
- DeepSeek-R1 - ai/DeepSeek-R1) |
- Qwen
Code Sandbox (Security)
- AutoGen Docker Code Executor
- E2B - source runtime for executing AI-generated code in secure cloud sandboxes. Made for agentic & AI use cases. | [Documentation](https://e2b.dev/docs) \| [GitHub](https://github.com/e2b-dev/e2b) |
Browser Control Agents
- Browser-Use - use.com/) \| [GitHub](https://github.com/browser-use/browser-use) |
- WebUI - use` functionalities. This UI is designed to be user-friendly and enables easy interaction with the browser agent. | [GitHub](https://github.com/browser-use/web-ui) |
- WebRover - powered web agent that combines autonomous browsing with advanced research capabilities. | [GitHub](https://github.com/hrithikkoduri/WebRover) |
Curated Python AI, Data Science, and ML Compilations
- Best of ML Python - tooling/best-of-ml-python) |
- Awesome Python Data Science - python-data-science) |
- LLM Engineer Toolkit - NLP/llm-engineer-toolkit) |
- Awesome Production Machine Learning - production-machine-learning) |
- Awesome AI Agents - dev/awesome-ai-agents) |
LangGraph Extensions
- LangGraph Prebuilt Agents - ai.github.io/langgraph/prebuilt/) |
- LangMem - term memory. | [GitHub](https://github.com/langchain-ai/langmem) |
- LangGraph Supervisor - agent systems using LangGraph. | [GitHub](https://github.com/langchain-ai/langgraph-supervisor) |
- Open Deep Research - source assistant that automates research and produces customizable reports on any topic. | [GitHub](https://github.com/langchain-ai/open_deep_research) |
- LangGraph Reflection - style architecture to check and improve an initial agent's output. | [GitHub](https://github.com/langchain-ai/langgraph-reflection) |
- LangGraph Big Tool - ai/langgraph-bigtool) |
- LangGraph CodeAct - ai/langgraph-codeact) |
- LangGraph Swarm - style multi-agent systems using LangGraph. Agents dynamically hand off control to one another based on their specializations. | [GitHub](https://github.com/langchain-ai/langgraph-swarm-py) |
- LangChain MCP Adapters - ai/langchain-mcp-adapters) |
- AI Data Science Team - powered data science team of agents to help you perform common data science tasks 10X faster. | [GitHub](https://github.com/business-science/ai-data-science-team) |
Paid Courses
- Enroll Here
Huggingface Platform
- Tokenizers
Agents and Tools (Prebuilt)
- Phidata - source platform to build, ship and monitor agentic systems. [Documentation](https://docs.phidata.com/) | [Github](https://github.com/phidatahq/phidata)
- Composio
- Agno (Formerly Phidata) - source platform to build, ship and monitor agentic systems. | [Documentation](https://docs.agno.com/) \| [GitHub](https://github.com/agno-agi/agno) |
Coding Agents
- Qwen-Agent - Agent/tree/main/docs) \| [Examples](https://github.com/QwenLM/Qwen-Agent/tree/main/examples) \| [GitHub](https://github.com/QwenLM/Qwen-Agent) |
Deep Research Agents
- HuggingFace OpenDeepResearch - deep-research) \| [Example](https://github.com/huggingface/smolagents/blob/gaia-submission-r1/examples/open_deep_research/visual_vs_text_browser.ipynb) \| [GitHub](https://github.com/huggingface/smolagents/tree/gaia-submission-r1/examples/open_deep_research) |
- OpenDeepResearcher
Other Popular Interfaces to LLM Models in R
- tidychatmodels - rapp.de/) |
- tidyllm - compatible APIs. | [Website](https://edubruell.github.io/tidyllm/) |
- gemini.R
- ollama-r - r/) |
- rollama
- chatgpt
- groqR - fast LPU (Language Processing Unit) technology directly to your R workflow. | [Website](https://gabrielkaiserqfin.github.io/groqR) |
- gptstudio
- llmR
Curated AI, ML, Data Science Lists
- LLM tools for R - book/r-pkgs.html) |
Ellmer-Verse
- ellmer
- hellmer
- chores - to-automate tasks quickly. | [Documentation](https://simonpcouch.github.io/chores/) |
- ggpal
- gander - performance and low-friction chat experience for data scientists in RStudio and Positron–sort of like completions with Copilot, but it knows how to talk to the objects in your R environment. | [Documentation](https://simonpcouch.github.io/gander/) |
mlverse
- mall - wise over a specified column. | [Website](https://mlverse.github.io/mall/) |
- lang - the-fly. | [Website](https://mlverse.github.io/lang/) |
- chattr

Programming Languages

Python 55 R 16 Jupyter Notebook 15 Go 4 TypeScript 4 C++ 3 Rust 2 C 1

awesome-generative-ai-data-scientist

Free Training

NVIDIA

Data Science And AI Agents

Web Parsing (HTML) and Web Crawling

LLM Memory

AI Frameworks (Build Your Own)

Huggingface Ecosystem

Prompt Improvement

Table of Contents

LLMOps

Testing and Monitoring (Observability)

Other

Agents and Tools (Build Your Own)

RAG in R

Building AI

Microsoft Azure

LLM Providers

Vector Databases (RAG)

LLM Models

AI LLM Frameworks

LangChain Ecosystem

Cookbooks and Examples:

Amazon Web Services (AWS)

Cloud Examples:

Google Cloud Platform (GCP)

NVIDIA

8-Week AI Bootcamp by Business Science

LLM Models and Providers

Pretraining

Fine-tuning

Document Parsing

Miscellaneous

AI Frameworks (Drag and Drop)

Open Source LLM Models

Code Sandbox (Security)

Browser Control Agents

Curated Python AI, Data Science, and ML Compilations

LangGraph Extensions

Paid Courses

Huggingface Platform

Agents and Tools (Prebuilt)

Coding Agents

Deep Research Agents

Other Popular Interfaces to LLM Models in R

Curated AI, ML, Data Science Lists

Ellmer-Verse

mlverse