An open API service indexing awesome lists of open source software.

https://github.com/sgowdaks/nichirin

RAG and Webcrawler in a single package
https://github.com/sgowdaks/nichirin

llm rag retrieval-augmented-generation scraping webcrawler

Last synced: 5 months ago
JSON representation

RAG and Webcrawler in a single package

Awesome Lists containing this project

README

          

# Nichirin: : A Retrieval Augmented Generation framework combined with webcrawler.
[![image](http://img.shields.io/pypi/v/nichirin.svg)](https://pypi.python.org/pypi/nichirin/)

## Overview
**Nichirin** serves as an advanced layer atop Apache Solr, facilitating seamless data indexing operations.

1. **What is Nichirin?**
- **Nichirin** acts as a surface or layer on top of **Apache Solr**, making data indexing a breeze.
- It abstracts away the complexities of Solr indexing, allowing users to focus on providing their data without worrying about the nitty-gritty details.

2. **Key Features:**
- **Multi-level Crawling**: Performs multi-level web crawling utilizing a depth-first search methodology, with text indexing and retrieval facilitated through Apache Solr.
- **Efficient Indexing**: Integrated Apache Spark for parallel processing of URLs, improving the scalability and efficiency of both web crawling and text indexing.
- **Python package**: Available as a Python package on PyPI for easy installation and integration

## Setup

```bash
# Option 1: install as read only; recommended to use as is
pip install git+https://github.com/sgowdaks/nichirin

# Option 2: install for editable mode; recommended if you'd like to modify code
git clone https://github.com/sgowdaks/nichirin
cd nichirin
pip install -e .

# Option 3: install from PyPI
pip install nichirin
```

## Commands
* `install-solr` to install solr
* `create-core --core ` to create solr core,
* `partition-data --path ` to partition the data
* `pipeline --path ` generate embeddings of the partition data
* `index-solr --data-path --core ` index the data
* `query-solr --input_sen --core_name ` query the data from solr
* `seed-urls --core --urls ` to add the seed urls
* `start-crawler` to start the web crawler
* `start-serve` to start the web server

## Quickstart
1. Begin by executing the `install-solr` command to install the Solr application.
2. Next, create the cores using the `create-core` command.
3. After setting up Solr and creating the cores, add seed URLs by running the `seed-urls` command.
4. Once the seed URLs are added, initiate the crawling process with the `start-crawler` command. Be patient, as this step may take some time.
5. Finally, to view the results, launch the Flask web app using the `start-serve` command.

This starts a service on http://127.0.0.1:5000 by default.

Contributing and Feedback:
We welcome contributions! If you’d like to enhance Nichirin or report issues, feel free to submit a pull request.
For feedback or questions, open an issue on our GitHub repository.