An open API service indexing awesome lists of open source software.

https://github.com/deliciousboy/llm-chatbot-backend


https://github.com/deliciousboy/llm-chatbot-backend

Last synced: 11 months ago
JSON representation

Awesome Lists containing this project

README

          

# LLM Chatbot

[![Powered by Kedro](https://img.shields.io/badge/powered_by-kedro-ffc900?logo=kedro)](https://kedro.org)

## Overview

A Retrieval-Augmented Generation (RAG) system for scraping website data, embedding text, and answering questions via LLM

## How to install dependencies

Declare any dependencies in `requirements.txt` and `pyproject.toml` for `pip` installation.

### clone the repository
```bash
git clone https://github.com/DeliciousBoy/llm-chatbot-backend.git
cd llm-chatbot-backend
```

### Installing `uv`
this project uses `uv` to manage virtual environments and dependencies for different Python versions. You can install `uv` run:

```bash
curl -Ls https://astral.sh/uv/install.sh | sh
```
Or follow the instructions from the official GitHub repository: https://github.com/astral-sh/uv
Once installed, you can set up the environment with:

### Install with `uv` (Recommended) `This project requires Python 3.11.11`
```bash
uv venv
source .venv/bin/activate # Or .venv/Scripts/activate for Windows
uv pip install -r requirements.txt
uv pip install -e .[dev, docs]
```
If you prefer not to use uv, you can fall back to pip (see below).

### Install with `pip` (Not recommended)
This is not recommended as it may lead to dependency conflicts, especially if you are using different Python versions.
```bash
python -m venv .venv
source .venv/bin/activate # Or .venv/Scripts/activate for Windows
pip install -r requirements.txt
pip install -e .[dev,docs]
```

## How to run Kedro pipeline
This project uses [Kedro](https://kedro.org) to organize data workflows into modular pipelines.

### Avaliable pipelines

| Pipeline Name | Description |
|--------------------|--------------------------------------|
| `data_processing` | Cleans and embeds text data into vectors |
| `web_scraping` | Asynchronously scrapes web content and stores it as raw data |

Each pipeline is defined in `src/llm_chatbot_backend/pipelines/` and can be run individually or as a group. You can also run specific nodes within a pipeline.

```bash
kedro run # Run all pipelines
kedro run --pipeline=web_scraping # Run web scraping pipeline
kedro run --pipeline=data_processing # Run data processing pipeline
```

## Visualize Kedro pipeline
You can visualize the pipeline using Kedro's built-in visualization tool. This will generate a graph of the pipeline nodes and their dependencies.

```bash
kedro viz run --autoreload
```
## Running Scheduled Jobs

This project includes a scheduler using `APScheduler` to automate periodic tasks such as scraping data, generating embeddings, or updating indexes.

To start the scheduler, run:

```bash
python scheduler.py
```

## How to test your Kedro project
this project uses `pytest` to run test cases. You can run your tests with:

```bash
pytest
```

## How to run chat interface
This project includes a Streamlit app for interacting with the chatbot. You can run the app with:

```
streamlit run main.py
```
To run the app locally, make sure the virtual environment is activated and dependencies are installed

## Proejct Structure
This project follows the [Kedro](https://kedro.org) project layout with additional components for web scraping, vector embeddings, and an LLM chatbot interface via Streamlit.
```
πŸ“llm-chatbot-backend/
β”œβ”€β”€ πŸ“conf/ # Kedro configuration files
β”‚ └── πŸ“base/
β”‚ β””β”€β”€πŸ“„catalog.yml # Dataset definitions (inputs/outputs for pipelines)
β”‚ β””β”€β”€πŸ“„parameters.yml # Project-level parameters for nodes/pipelines
β”œβ”€β”€ πŸ“data/ # raw/cleaned/embedded/chromadb
β”œβ”€β”€ πŸ“src/ # Source code (Kedro pipelines, modules)
β”‚ └── πŸ“llm_chatbot_backend/
β”‚ └── πŸ“datasets/ # Custom Kedro dataset classes
β”‚ | └── πŸ“„utf8_json.py # Custom JSON
β”‚ └── πŸ“pipelines/ # All Kedro pipelines
β”‚ └── πŸ“data_processing/
β”‚ | β””β”€β”€πŸ“„nodes.py # Data cleaning / embedding logic
β”‚ | β””β”€β”€πŸ“„pipeline.py # Defines the data_processing pipeline
β”‚ └── πŸ“web_scraping/
β”‚ β””β”€β”€πŸ“„nodes.py # Async scraping logic
β”‚ β””β”€β”€πŸ“„pipeline.py # Defines the web_scraping pipeline
β”œβ”€β”€ πŸ“tests/ # Pytest test cases
β”‚ └── πŸ“pipelines/
β”‚ └── πŸ“data_processing/
β”‚ | β””β”€β”€πŸ“„test_pipeline.py
β”‚ └── πŸ“web_scraping/
| β””β”€β”€πŸ“„test_pipeline.py
β”œβ”€β”€πŸ“„main.py # Streamlit chat interface\
β”œβ”€β”€πŸ“„scheduler.py # Automate Web Scraping Task
β”œβ”€β”€πŸ“„pyproject.toml # Project config & dependencies
β”œβ”€β”€πŸ“„requirements.txt # Pip requirements
β”œβ”€β”€πŸ“„uv.lock # uv dependency lockfile
β””β”€β”€πŸ“„.env # Environment variables
```