https://github.com/itsyaasir/pdf-intellect

PDF Intellect is a smart AI-powered tool designed to extract and analyze information from PDF document
https://github.com/itsyaasir/pdf-intellect

ai analyzer conda llama2 llamacpp machine-learning pdf project python smart

Last synced: 7 months ago
JSON representation

PDF Intellect is a smart AI-powered tool designed to extract and analyze information from PDF document

Host: GitHub
URL: https://github.com/itsyaasir/pdf-intellect
Owner: itsyaasir
License: mit
Created: 2023-07-27T05:55:12.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-09-21T17:15:26.000Z (almost 2 years ago)
Last Synced: 2024-10-18T22:00:21.139Z (9 months ago)
Topics: ai, analyzer, conda, llama2, llamacpp, machine-learning, pdf, project, python, smart
Language: Python
Homepage:
Size: 5.58 MB
Stars: 7
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: Readme.md
- License: LICENSE

Awesome Lists containing this project

README

---

# PDF Intellect

PDF Embedding Indexer is a CLI tool designed to process text content from PDF files, generate meaningful embeddings from the text using Sentence Transformers, and store those embeddings in a PostgreSQL database. This allows for quick and efficient similarity searching, providing a useful tool for managing and navigating through a large number of PDF files.

## Features

- Currently supports PDF files only, and is only limited to a single PDF file per command.
- Extracts text content from PDF files and generates embeddings using Sentence Transformers.
- Stores the embeddings in a PostgreSQL database for quick and efficient similarity searching.
- Allows for sentence-level indexing, offering granular search results.
- Stores additional metadata for each document, including file hash, timestamp, and title.
- Prevents duplicate PDFs from being indexed.

## Requirements

- Python 3.8 or higher
- PostgreSQL
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- psycopg2-binary
- sqlalchemy
- pgvector
- python-magic
- pdfminer.six
- nltk

## Setup

1. Ensure that you have Python 3.8 or higher installed.

2. Install PostgreSQL and setup a database for this project.

3. Clone this repository:

```bash
git clone https://github.com/itsyaasir/pdf-intellect.git
```

4. Change into the project directory:

```bash
cd pdf-intellect
```

5. You can create and activate a Conda or a virtual environment:

- For Conda environment:

Run the provided setup script to create a Conda environment and install the necessary packages:

```bash
bash conda_setup.sh
```

- For virtual environment:

Create a virtual environment:

```bash
python -m venv venv
```

Activate the virtual environment:
- On Unix or MacOS, run:

```bash
source venv/bin/activate
```

- On Windows, run:

```bash
venv\Scripts\activate
```

Install the required packages:

```bash
pip install -r requirements.txt
```

6. Run the provided setup script to setup the database:

You will need to prov

```bash
bash db_setup.sh
```

7. Modify the environment variables in `config.py` if necessary.

8. Depending on your model, you might need to adjust the prompt template to match the model's input format.
You can check the default template in `app/llama.py`.

## Usage

To index a PDF:

```bash
python main.py index
```

To search for similar content given a query:

```bash
python main.py search ""
```

To use the PDF with LLM:

```bash
python main.py query <"query">
```

These commands will print their results to the console.

## Contributing

Please feel free to fork this repository and contribute. When submitting your changes, please ensure that your code is well-commented and that you have tested your changes.

## License

This project is licensed under the terms of the MIT license. See [LICENSE](LICENSE) for more details.

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/itsyaasir/pdf-intellect

Awesome Lists containing this project

README