Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kowshik24/pineconepdfextractor
PineconePDFExtractor is a Python library for extracting text from PDF files for pinecone.
https://github.com/kowshik24/pineconepdfextractor
Last synced: 25 days ago
JSON representation
PineconePDFExtractor is a Python library for extracting text from PDF files for pinecone.
- Host: GitHub
- URL: https://github.com/kowshik24/pineconepdfextractor
- Owner: kowshik24
- License: other
- Created: 2024-01-20T18:25:12.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-01-22T13:50:05.000Z (12 months ago)
- Last Synced: 2024-04-23T23:54:34.694Z (9 months ago)
- Language: Python
- Size: 13.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PineconePDFExtractor
PineconePDFExtractor is a Python library for extracting text from PDF files for pinecone.
## Installation
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install PineconePDFExtractor.
## Google Colab
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1dTGaIHgq5KtV06Sno5-4EGUwQiiKXfbE?usp=sharing)```bash
pip install PineconePDFExtractor
```## Check the latest version here:
https://pypi.org/project/PineconePDFExtractor/## Usage
```python
from pdf.PineconePDFExtractor import PdfProcessor# Create a PineconePDFExtractor instance with a batch size of 200
extractor = PdfProcessor(200)# Process a list of PDF files
result = extractor.process_files(['file1.pdf', 'file2.pdf'])# The result is a dictionary with the batch size and a list of documents
# Each document is a dictionary with the id (file name without extension), metadata (number of pages), source (file path), and text (extracted text)## Example result
# {
# 'batch_size': 200,
# 'documents': [
# {
# 'id': 'file1',
# 'metadata': {
# 'pages': 1
# },
# 'source': 'file1.pdf',
# 'text': 'This is the extracted text from file1.pdf'
# },
# {
# 'id': 'file2',
# 'metadata': {
# 'pages': 2
# },
# 'source': 'file2.pdf',
# 'text': 'This is the extracted text from file2.pdf'
# }
# ]
# }