https://github.com/psenger/haystack-needle
haystack.deepset.ai, MongoDb, Python, Ollama
https://github.com/psenger/haystack-needle
Last synced: 10 months ago
JSON representation
haystack.deepset.ai, MongoDb, Python, Ollama
- Host: GitHub
- URL: https://github.com/psenger/haystack-needle
- Owner: psenger
- License: apache-2.0
- Created: 2024-07-20T06:00:59.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-20T06:24:20.000Z (over 1 year ago)
- Last Synced: 2024-07-20T07:34:46.126Z (over 1 year ago)
- Language: Python
- Size: 18.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
tags:
- mongodb
- python
- Haystack
- ollama
- how-to
---
# haystack-needle - MongoDB Web Crawl with Haystack and Ollama
> haystack.deepset.ai, MongoDb, Python, Ollama
This project demonstrates how to set up a web crawl data processing pipeline using MongoDB, Haystack, and Ollama. The primary purpose is to fetch documents from a MongoDB replica set, process them, and retrieve information based on specific queries using the Haystack framework.
## Requirements
- Python 3.8+
- MongoDB
- Haystack
- Ollama
## Project Structure
- `main.py`: The main script that sets up and runs the data processing pipeline.
## Setup Instructions
1. **Clone the Repository**
```bash
git clone https://github.com/psenger/haystack-needle
cd haystack-needle
```
2. **Install Dependencies**
```bash
pip install -r requirements.txt
```
3. **Configure MongoDB Connection**
Make sure your MongoDB replica set is running and accessible. The URI in the script is set to:
```python
uri = "mongodb://mongo-1:27017,mongo-2:27117,mongo-3:27217/web-crawl"
```
Modify this as per your MongoDB setup.
4. **Run the Script**
Execute the `main.py` script to fetch documents from MongoDB, process them with Haystack, and run queries.
```bash
python main.py
```
## Components
### MongoDB
The script connects to a MongoDB replica set without authentication. Ensure your MongoDB instance is configured correctly.
### Haystack
Haystack is used for setting up the document store and the pipeline. Key components include:
- **InMemoryDocumentStore**: Stores documents in memory for retrieval.
- **InMemoryBM25Retriever**: Retrieves relevant documents based on BM25 algorithm.
- **PromptBuilder**: Builds the prompt for the query.
- **OllamaGenerator**: Generates responses using the Ollama model.
### Ollama
Ollama is used to generate responses based on the context and query provided. The model is specified as `llama3` and runs on a local server.
## Script Overview
1. **MongoDB Connection**
Connects to MongoDB and fetches documents:
```python
client = MongoClient(uri_with_options)
db = client['web-crawl']
collection = db['pages']
```
2. **Document Processing**
Processes and converts MongoDB documents into Haystack `Document` objects:
```python
documents = [
Document(
id=doc['_id'],
content=doc.get('content', ''),
meta={
'title': doc.get('title', ''),
'url': doc.get('url', ''),
'ldJsonScripts': [safe_json_loads(script) for script in doc.get('ldJsonScripts', [])],
'imageUrls': doc.get('imageUrls', []),
'pageHrefs': doc.get('pageHrefs', []),
'linkTags': doc.get('linkTags', []),
'metaTags': doc.get('metaTags', [])
}
) for doc in collection.find({})
]
```
3. **Pipeline Configuration**
Sets up the Haystack pipeline with retriever, prompt builder, and Ollama generator:
```python
pipe = Pipeline()
pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", OllamaGenerator(
model="llama3",
url="http://localhost:11434/api/generate",
```
## License
This project is licensed under the Apache License Version 2.0. See the [LICENSE](LICENSE) file for more details.
## Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
## Contact
For any inquiries or support, please contact on [LinkedIn](https://www.linkedin.com/in/philipsenger/).