Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cllspy/pypte

The PDF Text Extractor API allows users to upload PDF files and receive the extracted text from those files. This API is built using FastAPI and leverages the PyMuPDF library for efficient text extraction.
https://github.com/cllspy/pypte

fastapi pdf-extractor pdf-to-text postman python

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/cllspy/pypte
Owner: CllsPy
Created: 2024-11-20T15:49:53.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2024-11-20T16:54:19.000Z (about 1 month ago)
Last Synced: 2024-11-20T17:40:49.804Z (about 1 month ago)
Topics: fastapi, pdf-extractor, pdf-to-text, postman, python
Language: Python
Homepage:
Size: 0 Bytes
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# **PYTHON PDF TEXT EXTRACTOR**

## **Overview**
The **PDF Text Extractor API** allows users to upload PDF files and receive the extracted text from those files. This API is built using FastAPI and leverages the PyMuPDF library for efficient text extraction.

### **Base URL**
```
http://127.0.0.1:8000
```

## **Endpoints**

### **1. Root Endpoint**
**URL:** `/`
**Method:** `GET`

#### **Description:**
Returns a welcome message indicating that the API is running.

#### **Example Response:**
```json
{
"message": "Welcome to the PDF Text Extractor API!"
}
```

### **2. Extract Text from PDF**
**URL:** `/extract-text/`
**Method:** `POST`

#### **Description:**
Accepts a PDF file as input and returns the extracted text.

#### **Request Parameters:**
- **file** (required): A PDF file to extract text from.
- Type: `File`
- Format: PDF file (`.pdf`)

#### **Example Request (Swagger UI):**
1. Navigate to `http://127.0.0.1:8000/docs`.
2. Open the `/extract-text/` endpoint.
3. Click **Try it out**.
4. Upload a PDF file using the file input.
5. Click **Execute** to process the file.

#### **Example Request (`curl`):**
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/extract-text/' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected]'
```

#### **Response:**
- **200 OK**: Text extracted successfully.
- **400 Bad Request**: Invalid file type (e.g., not a PDF).
- **500 Internal Server Error**: Issue with PDF processing.

#### **Example Response:**
```json
{
"filename": "example.pdf",
"text": "This is the extracted text from the PDF."
}
```

## **How to Run the API**

### **1. Prerequisites**
Ensure you have Python 3.7+ installed. Install the required libraries:
```bash
pip install fastapi uvicorn pymupdf python-multipart
```

### **2. Running the Application**
Save the API code in a file called `main.py`, then run the following command:
```bash
uvicorn main:app --reload
```

### **3. Access the API**
- Swagger UI: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
- Redoc UI: [http://127.0.0.1:8000/redoc](http://127.0.0.1:8000/redoc)

## **Testing the API**

### **1. Using Swagger UI**
1. Go to [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs).
2. Select the `/extract-text/` endpoint.
3. Click **Try it out**.
4. Upload a PDF file.
5. Click **Execute** to see the extracted text in the response.

### **2. Using `curl`**
Test the `/extract-text/` endpoint using the command line:
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/extract-text/' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@path_to_your_pdf.pdf'
```

### **3. Using Postman**
1. Set the request type to `POST` and the URL to `http://127.0.0.1:8000/extract-text/`.
2. Under the **Body** tab, select `form-data`.
3. Add a key named `file` with type set to `File`.
4. Upload the PDF file and click **Send**.

## **Error Handling**

| **Error Code** | **Description** |
|----------------|--------------------------------------------------|
| `400` | Invalid file type. Only PDF files are supported. |
| `500` | An error occurred while processing the PDF file. |

## **Example Use Cases**
- Extract text from invoices, academic papers, or legal documents.
- Process multiple PDF files for text mining or analysis.
- Integrate into document management systems for automated text extraction.

---