Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cllspy/pypte
The PDF Text Extractor API allows users to upload PDF files and receive the extracted text from those files. This API is built using FastAPI and leverages the PyMuPDF library for efficient text extraction.
https://github.com/cllspy/pypte
fastapi pdf-extractor pdf-to-text postman python
Last synced: about 1 month ago
JSON representation
The PDF Text Extractor API allows users to upload PDF files and receive the extracted text from those files. This API is built using FastAPI and leverages the PyMuPDF library for efficient text extraction.
- Host: GitHub
- URL: https://github.com/cllspy/pypte
- Owner: CllsPy
- Created: 2024-11-20T15:49:53.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2024-11-20T16:54:19.000Z (about 1 month ago)
- Last Synced: 2024-11-20T17:40:49.804Z (about 1 month ago)
- Topics: fastapi, pdf-extractor, pdf-to-text, postman, python
- Language: Python
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# **PYTHON PDF TEXT EXTRACTOR**
## **Overview**
The **PDF Text Extractor API** allows users to upload PDF files and receive the extracted text from those files. This API is built using FastAPI and leverages the PyMuPDF library for efficient text extraction.### **Base URL**
```
http://127.0.0.1:8000
```## **Endpoints**
### **1. Root Endpoint**
**URL:** `/`
**Method:** `GET`#### **Description:**
Returns a welcome message indicating that the API is running.#### **Example Response:**
```json
{
"message": "Welcome to the PDF Text Extractor API!"
}
```### **2. Extract Text from PDF**
**URL:** `/extract-text/`
**Method:** `POST`#### **Description:**
Accepts a PDF file as input and returns the extracted text.#### **Request Parameters:**
- **file** (required): A PDF file to extract text from.
- Type: `File`
- Format: PDF file (`.pdf`)#### **Example Request (Swagger UI):**
1. Navigate to `http://127.0.0.1:8000/docs`.
2. Open the `/extract-text/` endpoint.
3. Click **Try it out**.
4. Upload a PDF file using the file input.
5. Click **Execute** to process the file.#### **Example Request (`curl`):**
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/extract-text/' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected]'
```#### **Response:**
- **200 OK**: Text extracted successfully.
- **400 Bad Request**: Invalid file type (e.g., not a PDF).
- **500 Internal Server Error**: Issue with PDF processing.#### **Example Response:**
```json
{
"filename": "example.pdf",
"text": "This is the extracted text from the PDF."
}
```## **How to Run the API**
### **1. Prerequisites**
Ensure you have Python 3.7+ installed. Install the required libraries:
```bash
pip install fastapi uvicorn pymupdf python-multipart
```### **2. Running the Application**
Save the API code in a file called `main.py`, then run the following command:
```bash
uvicorn main:app --reload
```### **3. Access the API**
- Swagger UI: [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
- Redoc UI: [http://127.0.0.1:8000/redoc](http://127.0.0.1:8000/redoc)## **Testing the API**
### **1. Using Swagger UI**
1. Go to [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs).
2. Select the `/extract-text/` endpoint.
3. Click **Try it out**.
4. Upload a PDF file.
5. Click **Execute** to see the extracted text in the response.### **2. Using `curl`**
Test the `/extract-text/` endpoint using the command line:
```bash
curl -X 'POST' \
'http://127.0.0.1:8000/extract-text/' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@path_to_your_pdf.pdf'
```### **3. Using Postman**
1. Set the request type to `POST` and the URL to `http://127.0.0.1:8000/extract-text/`.
2. Under the **Body** tab, select `form-data`.
3. Add a key named `file` with type set to `File`.
4. Upload the PDF file and click **Send**.## **Error Handling**
| **Error Code** | **Description** |
|----------------|--------------------------------------------------|
| `400` | Invalid file type. Only PDF files are supported. |
| `500` | An error occurred while processing the PDF file. |## **Example Use Cases**
- Extract text from invoices, academic papers, or legal documents.
- Process multiple PDF files for text mining or analysis.
- Integrate into document management systems for automated text extraction.---