An open API service indexing awesome lists of open source software.

https://github.com/defrecord/kbol


https://github.com/defrecord/kbol

Last synced: about 1 year ago
JSON representation

Awesome Lists containing this project

README

          

#+TITLE: kbol - Knowledge Base from Technical Books
#+AUTHOR: Jason Walsh
#+EMAIL: j@wal.sh
#+DATE: 2024-11-15

* Overview
A focused RAG (Retrieval Augmented Generation) system for programming books that lets you query and explore your technical library using natural language.

#+begin_src mermaid :file doc/images/rag-flow.png
flowchart LR
PDFs[Technical Books] --> Chunker[Text Chunker]
Chunker --> Embedder[Ollama Embedder]
Embedder --> VectorDB[(Vector DB)]
Query[User Query] --> Semantic[Semantic Search]
VectorDB --> Semantic
Semantic --> LLM[Ollama LLM]
LLM --> Response[Response]
#+end_src

* System Architecture
** Component Overview
#+begin_src mermaid :file doc/images/structure.png
graph TD
CLI[Command Line Interface] --> Core[Core Engine]
Core --> Indexer[Document Indexer]
Core --> Search[Search Engine]
Core --> Embedder[Embedding Service]
Core --> LLM[LLM Service]

Indexer --> Chunker[Text Chunker]
Indexer --> DocTracker[Document Tracker]

Search --> DB[(Vector Database)]

subgraph Processing Pipeline
Chunker --> Embedder
Embedder --> DB
end

subgraph Query Pipeline
Search --> LLM
DB --> Search
end
#+end_src

* Usage
** Processing Books
:PROPERTIES:
:header-args:python: :results output :exports both
:END:

#+begin_src python
from kbol.indexer import BookIndexer
from pathlib import Path

indexer = BookIndexer()
books_dir = Path("data/books")
results = await indexer.process_books(books_dir)
print(f"Processed {len(results)} chunks")
#+end_src

** Example Queries
:PROPERTIES:
:header-args:bash: :tangle scripts/example_queries.sh :mkdirp yes :comments org
:END:

Below are powerful example queries that leverage the diverse technical and theoretical knowledge in the book collection:

*** Technical Architecture Queries
#+begin_src bash
# Technical Architecture Patterns
poetry run kbol query "Compare event-driven, flow-based, and microservice architectural patterns"
poetry run kbol query "What are the key metrics for evaluating microservices architecture success?"
poetry run kbol query "How does Domain-Driven Design evolve when moving from monoliths to microservices?"
#+end_src

*** Functional Programming Deep Dives
#+begin_src bash
# Functional Programming Concepts
poetry run kbol query "Explain monads using examples from Clojure and Scala"
poetry run kbol query "Compare approaches to immutability and state management in Python vs Clojure"
poetry run kbol query "Show patterns for combining functional and reactive programming"
#+end_src

*** Cross-Disciplinary Insights
#+begin_src bash
# Cross-Domain Knowledge Synthesis
poetry run kbol query "How do behavioral science insights inform better API design?"
poetry run kbol query "What principles from critical thinking apply to software architecture?"
poetry run kbol query "How do concepts from systems thinking apply across microservices and macroeconomics?"
#+end_src

*** Practical Development Examples
#+begin_src bash
# Practical Implementation Patterns
poetry run kbol query "Show me Clojure examples of map and reduce with real-world use cases"
poetry run kbol query "What are the best practices for testing event-driven microservices?"
poetry run kbol query "Compare Python and Clojure approaches to handling concurrency"
#+end_src

*** API and Integration Patterns
#+begin_src bash
# API Design and Evolution
poetry run kbol query "What patterns emerge for managing API evolution in microservices?"
poetry run kbol query "How do micro-frontends impact API design and management?"
poetry run kbol query "Compare REST, GraphQL, and event-driven API patterns"
#+end_src

These queries demonstrate the system's ability to synthesize knowledge across:
- Software architecture and systems design
- Functional and reactive programming paradigms
- Cross-disciplinary insights from philosophy and economics
- Practical development patterns and best practices
- Modern API and integration approaches

When executed, =C-c C-v t= will tangle these examples to =scripts/example_queries.sh=, creating a ready-to-use script of example queries.
** Example Knowledge Base Queries
:PROPERTIES:
:header-args:bash: :results output :exports both
:END:

*** ML/AI Foundations
#+begin_src bash :tangle data/answers/foundations_mlp.sh :mkdirp t
# Check ML Engineering fundamentals
time poetry run kbol query "What does the book 'Machine Learning Engineering with Python' cover about MLOps and production deployment?" | tee data/answers/foundations-mlp.md | head
#+end_src

#+RESULTS:
#+begin_example

Found relevant content in:
• MLOps_Engineering_at_Scale
• machinelearningengineeringwithpython_secondedition
• tinymlcookbook

Generating response...
╭─────────────────────────────────── Answer ───────────────────────────────────╮
│ The book "Machine Learning Engineering with Python" by Andrew P. McMahon │
│ covers various aspects of ML Ops (MLOps) as well as how to deploy machine │
#+end_example

#+begin_src bash :tangle data/answers/foundations_fe.sh :mkdirp t
# Verify feature engineering coverage
poetry run kbol query "What practical examples are shown in 'Python Feature Engineering Cookbook'?" | tee data/answers/foundations-fe.md | head
#+end_src

#+RESULTS:
#+begin_example

Found relevant content in:
• 15mathconceptseverydatascientistshouldknow
• Deep_Learning_with_Python
• Practices_of_the_Python_Pro
• machinelearningengineeringwithpython_secondedition

Generating response...
╭─────────────────────────────────── Answer ───────────────────────────────────╮
│ The "Python Feature Engineering Cookbook" provides a range of real-world, │
#+end_example

*** LLM and NLP Content
#+begin_src bash
# Compare LLM books
poetry run kbol query "How do 'Building LLM Powered Applications' and the 'LLM Engineer's Handbook' differ in their coverage of LLM implementation?"
#+end_src

#+begin_src bash
# Explore NLP progression
poetry run kbol query "How does 'Mastering NLP from Foundations to LLMs' structure the learning path from basic NLP to advanced LLMs?"
#+end_src

*** Specialized ML Techniques
#+begin_src bash
# XGBoost applications
poetry run kbol query "What are the key concepts covered in 'XGBoost for Regression Predictive Modeling' regarding time series analysis?"
#+end_src

#+begin_src bash
# Genetic algorithms
poetry run kbol query "What practical Python examples are provided in 'Hands-On Genetic Algorithms with Python'?"
#+end_src

*** Deep Learning and PyTorch
#+begin_src bash
# PyTorch mastery
poetry run kbol query "What advanced PyTorch concepts are covered in 'Mastering PyTorch' versus basic implementations?"
#+end_src

#+begin_src bash
# Compare implementations
poetry run kbol query "How does 'Machine Learning with PyTorch and Scikit-Learn' approach model development differently from 'Mastering PyTorch'?"
#+end_src

*** RAG and Vector Systems
#+begin_src bash
# RAG implementation
poetry run kbol query "How does 'RAG-Driven Generative AI' approach vector databases and embedding strategies?"
#+end_src

#+begin_src bash
# Architecture considerations
poetry run kbol query "What are the main architectural patterns discussed in 'RAG-Driven Generative AI' for building production RAG systems?"
#+end_src

*** Causal Inference
#+begin_src bash
# Compare approaches
poetry run kbol query "Compare the approaches to causal inference between 'Causal Inference and Discovery in Python' and 'Causal Inference in R'"
#+end_src

#+begin_src bash
# Implementation details
poetry run kbol query "What practical examples are provided in 'Causal Inference and Discovery in Python' for causal discovery?"
#+end_src

*** AI Security
#+begin_src bash
# Security applications
poetry run kbol query "What cybersecurity use cases are covered in 'Artificial Intelligence for Cybersecurity'?"
#+end_src

#+begin_src bash
# Implementation patterns
poetry run kbol query "What are the main security patterns and frameworks discussed in 'Artificial Intelligence for Cybersecurity'?"
#+end_src

*** Mathematical Foundations
#+begin_src bash
# Math concepts
poetry run kbol query "What statistical concepts from '15 Math Concepts Every Data Scientist Should Know' are applied in 'Bayesian Analysis with Python'?"
#+end_src

#+begin_src bash
# Bayesian applications
poetry run kbol query "How does 'Bayesian Analysis with Python' implement MCMC sampling in practice?"
#+end_src

*** Time Series Analysis
#+begin_src bash
# Forecasting techniques
poetry run kbol query "How does 'Modern Time Series Forecasting with Python' handle different forecasting techniques?"
#+end_src

#+begin_src bash
# Implementation patterns
poetry run kbol query "What are the main architectural patterns for time series prediction discussed in 'Modern Time Series Forecasting with Python'?"
#+end_src

*** Reinforcement Learning
#+begin_src bash
# Implementation examples
poetry run kbol query "What practical implementations are covered in 'Deep Reinforcement Learning Hands-On'?"
#+end_src

#+begin_src bash
# Advanced concepts
poetry run kbol query "How does 'Deep Reinforcement Learning Hands-On' approach advanced topics like multi-agent systems?"
#+end_src
* Implementation Details
** Vector Database Schema
#+begin_src sql :tangle src/kbol/db/schema.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS book_chunks (
id SERIAL PRIMARY KEY,
book_title TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(384),
page_number INTEGER,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX IF NOT EXISTS book_chunks_embedding_idx ON book_chunks
USING ivfflat (embedding vector_cosine_ops);
#+end_src

** Processing Pipeline
#+begin_src mermaid :file doc/images/pipeline.png
sequenceDiagram
participant PDF as PDF Books
participant Chunker as Text Chunker
participant Embedder as Ollama Embedder
participant DB as Vector DB

PDF->>Chunker: Raw Text
Chunker->>Chunker: Split into Chunks
loop Each Chunk
Chunker->>Embedder: Text Chunk
Embedder->>Embedder: Generate Embedding
Embedder->>DB: Store Chunk + Embedding
end
#+end_src

* Quick Start
1. Setup your environment:
#+begin_src bash
make setup
#+end_src

2. Run the complete demo with a sample book:
#+begin_src bash
make demo
#+end_src

3. Try some example queries:
#+begin_src bash
# Query about specific topics
poetry run kbol query "Explain monads from the functional programming books"

# Find code examples
poetry run kbol query "Show me Clojure examples of map and reduce"

# Compare concepts
poetry run kbol query "Compare Python and Clojure approaches to immutability"
#+end_src

* Development Commands
| Command | Description |
|------------------+--------------------------------------------|
| make setup | Initial setup of development environment |
| make demo | Run complete demo pipeline |
| make load-books | Link books from your collection |
| make process-books| Process books into chunks with embeddings |
| make stats | Show statistics about processed books |
| make clean | Clean generated files and directories |

* Vector Database Schema
The system uses a PostgreSQL database with vector similarity search capabilities:

#+begin_src sql
CREATE TABLE book_chunks (
id SERIAL PRIMARY KEY,
book_title TEXT NOT NULL,
content TEXT NOT NULL,
embedding vector(384),
page_number INTEGER,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);
#+end_src

* License
MIT

* Author
Jason Walsh ([[https://wal.sh][https://wal.sh]])