https://github.com/sgowdaks/nichirin
RAG and Webcrawler in a single package
https://github.com/sgowdaks/nichirin
llm rag retrieval-augmented-generation scraping webcrawler
Last synced: 5 months ago
JSON representation
RAG and Webcrawler in a single package
- Host: GitHub
- URL: https://github.com/sgowdaks/nichirin
- Owner: sgowdaks
- License: mit
- Created: 2023-11-16T20:07:33.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-19T21:23:32.000Z (about 2 years ago)
- Last Synced: 2025-05-22T12:32:34.563Z (about 1 year ago)
- Topics: llm, rag, retrieval-augmented-generation, scraping, webcrawler
- Language: Python
- Homepage: https://pypi.org/project/nichirin/
- Size: 1.46 MB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Nichirin: : A Retrieval Augmented Generation framework combined with webcrawler.
[](https://pypi.python.org/pypi/nichirin/)
## Overview
**Nichirin** serves as an advanced layer atop Apache Solr, facilitating seamless data indexing operations.
1. **What is Nichirin?**
- **Nichirin** acts as a surface or layer on top of **Apache Solr**, making data indexing a breeze.
- It abstracts away the complexities of Solr indexing, allowing users to focus on providing their data without worrying about the nitty-gritty details.
2. **Key Features:**
- **Multi-level Crawling**: Performs multi-level web crawling utilizing a depth-first search methodology, with text indexing and retrieval facilitated through Apache Solr.
- **Efficient Indexing**: Integrated Apache Spark for parallel processing of URLs, improving the scalability and efficiency of both web crawling and text indexing.
- **Python package**: Available as a Python package on PyPI for easy installation and integration
## Setup
```bash
# Option 1: install as read only; recommended to use as is
pip install git+https://github.com/sgowdaks/nichirin
# Option 2: install for editable mode; recommended if you'd like to modify code
git clone https://github.com/sgowdaks/nichirin
cd nichirin
pip install -e .
# Option 3: install from PyPI
pip install nichirin
```
## Commands
* `install-solr` to install solr
* `create-core --core ` to create solr core,
* `partition-data --path ` to partition the data
* `pipeline --path ` generate embeddings of the partition data
* `index-solr --data-path --core ` index the data
* `query-solr --input_sen --core_name ` query the data from solr
* `seed-urls --core --urls ` to add the seed urls
* `start-crawler` to start the web crawler
* `start-serve` to start the web server
## Quickstart
1. Begin by executing the `install-solr` command to install the Solr application.
2. Next, create the cores using the `create-core` command.
3. After setting up Solr and creating the cores, add seed URLs by running the `seed-urls` command.
4. Once the seed URLs are added, initiate the crawling process with the `start-crawler` command. Be patient, as this step may take some time.
5. Finally, to view the results, launch the Flask web app using the `start-serve` command.
This starts a service on http://127.0.0.1:5000 by default.

Contributing and Feedback:
We welcome contributions! If you’d like to enhance Nichirin or report issues, feel free to submit a pull request.
For feedback or questions, open an issue on our GitHub repository.