https://github.com/kowshik24/pineconepdfextractor

PineconePDFExtractor is a Python library for extracting text from PDF files for pinecone.
https://github.com/kowshik24/pineconepdfextractor

Last synced: 4 months ago
JSON representation

PineconePDFExtractor is a Python library for extracting text from PDF files for pinecone.

Host: GitHub
URL: https://github.com/kowshik24/pineconepdfextractor
Owner: kowshik24
License: other
Created: 2024-01-20T18:25:12.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-01-22T13:50:05.000Z (over 1 year ago)
Last Synced: 2025-03-04T09:17:01.646Z (4 months ago)
Language: Python
Size: 13.7 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # PineconePDFExtractor

PineconePDFExtractor is a Python library for extracting text from PDF files for pinecone.

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install PineconePDFExtractor.

## Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1dTGaIHgq5KtV06Sno5-4EGUwQiiKXfbE?usp=sharing)

```bash

pip install PineconePDFExtractor

```

## Check the latest version here:

https://pypi.org/project/PineconePDFExtractor/

## Usage

```python

from pdf.PineconePDFExtractor import PdfProcessor

# Create a PineconePDFExtractor instance with a batch size of 200

extractor = PdfProcessor(200)

# Process a list of PDF files

result = extractor.process_files(['file1.pdf', 'file2.pdf'])

# The result is a dictionary with the batch size and a list of documents

# Each document is a dictionary with the id (file name without extension), metadata (number of pages), source (file path), and text (extracted text)

## Example result

# {

#   'batch_size': 200,

#   'documents': [

#     {

#       'id': 'file1',

#       'metadata': {

#         'pages': 1

#       },

#       'source': 'file1.pdf',

#       'text': 'This is the extracted text from file1.pdf'

#     },

#     {

#       'id': 'file2',

#       'metadata': {

#         'pages': 2

#       },

#       'source': 'file2.pdf',

#       'text': 'This is the extracted text from file2.pdf'

#     }

#   ]

# }

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kowshik24/pineconepdfextractor

Awesome Lists containing this project

README