An open API service indexing awesome lists of open source software.

https://github.com/mittalsoni00/filereader

Java PdfReader API is a Spring Boot-based application that extracts text from PDF files using the Apache PDFBox library. It provides a REST API to upload PDFs and retrieve their extracted text. This project simplifies text extraction for various applications like document processing and data analysis.
https://github.com/mittalsoni00/filereader

github-config java maven pdfbox postman spring spring-boot spring-web

Last synced: 2 months ago
JSON representation

Java PdfReader API is a Spring Boot-based application that extracts text from PDF files using the Apache PDFBox library. It provides a REST API to upload PDFs and retrieve their extracted text. This project simplifies text extraction for various applications like document processing and data analysis.

Awesome Lists containing this project

README

          

# JAVA PdfReader API using LLM interaction
A Spring Boot application that allows users to upload a **CASA bank statement PDF** and receive extracted details such as **Name**, **Email**, **Opening Balance**, and **Closing Balance** using **OpenAI GPT-4o** model.
๐Ÿ”— **Live Hosted API**: [https://pdfreader-fped.onrender.com/api/parse-pdf](https://pdfreader-fped.onrender.com/api/parse-pdf)

---
## ๐Ÿ“Œ Switch to Master Branch
**Note:** Please switch to the `master` branch to access all the documentation and source files.

## ๐Ÿ“‚ Project Structure

### ๐Ÿ”น Key Files

| File | Description |
|------|-------------|
| `PdfController.java` | Main controller that handles file upload and calls service to process PDF |
| `PdfService.java` | Extracts text from the PDF and uses OpenAI API for intelligent data parsing |
| `OpenAiService.java` | Integrates with OpenAI GPT-4o via REST call using Spring's `RestTemplate` |
| `application.properties` | Stores server config and API key (linked via environment variable for security) |
| `Dockerfile` | Dockerized for public deployment on Render |
| `ChatController.java` | [Debug Endpoint] Allows sending prompt manually to OpenAI via JSON |
| `ChatPromptDTO.java` | DTO for accepting request body in JSON format |
| `PdfParserUtil.java` | Optional utility class to aid in raw PDF parsing |

API can be accessed via Postman or curl command.(I have used Postman{instructions below๐Ÿ‘‡})

### โŒ Files You Can Exclude
The following files are not crucial to the core functionality:
- `ApiReaderNewApplicationTests.java`
- `Main.java`

## ๐Ÿงช PdfReader.java (Testing Purpose)
The `PdfReader.java` file is a **testing utility** that demonstrates how text extraction is performed from a PDF. It directly uses the **Apache PDFBox API** to extract text from any PDF file.

---

## ๐Ÿ› ๏ธ Tech Stack

- Java 17
- Spring Boot
- Apache PDFBox
- OpenAI GPT-4o (via REST API)
- Maven
- Docker (for deployment)
- Render (hosting platform)
- Postman / curl for testing

---
## ๐Ÿ’ก Features

- ๐Ÿ” Intelligent extraction using **OpenAI GPT-4o** API
- ๐Ÿ“„ Accepts **PDF file** as multipart input
- ๐Ÿง  Analyzes content with **LLM prompt** to extract:
- Name
- Email
- Opening Balance
- Closing Balance
- ๐Ÿ“ฌ JSON formatted output
- ๐Ÿงช Separate test/debug endpoint to interact with OpenAI
- ๐ŸŒ **Deployed publicly** using Docker and Render

---

## ๐Ÿš€ Getting Started (Local Setup)

### 1๏ธโƒฃ Clone the Repository

```bash
git clone https://github.com/mittalsoni00/FileReader.git
cd FileReader
git checkout master
```

### 2๏ธโƒฃ Add Your OpenAI API Key

Use an environment variable for security:
```bash
export OPENAI_API_KEY=your_secret_key_here
```

Or add it in `application.properties` (only for testing, not recommended for prod):
```properties
OPENAI_API_KEY=${OPENAI_API_KEY}
```

### 3๏ธโƒฃ Build and Run

```bash
mvn clean install
java -jar target/Api_Reader_New-0.0.1-SNAPSHOT.jar
```

---

## ๐Ÿงช Debug Endpoint (Optional)

To test prompt-only flow (without PDF), hit:

```
POST /api/chat
Body (JSON):
{
"prompt": "Tell me a joke about Java developers"
}
```

This will return a direct OpenAI response.

---

## ๐Ÿ“ฌ API Documentation

### โœ… Endpoint for PDF Parsing

```
POST /api/parse-pdf
```

### Request Type
`multipart/form-data`

### Form Key:
| Key | Value |
|-----|-------|
| `file` | [Select your PDF file] |

### ๐Ÿ” Response (Success)

```json
{
"response": "Here are the extracted details from the bank statement:\n\n- Name: John Doe\n- Email: johndoe@example.com\n- Opening Balance: $5,000\n- Closing Balance: $6,500"
}
```

---

## ๐Ÿ“ฌ Testing via Postman

### ๐Ÿ“Œ Steps

1. Open Postman โ†’ **New Request**
2. Choose **POST** โ†’ Enter URL:
```
https://pdfreader-fped.onrender.com/api/parse-pdf
```
3. Go to **Body** tab โ†’ Select `form-data`
4. Add a key named `file` โ†’ Upload PDF file
5. Click **Send**

### โœ… Response
Youโ€™ll receive a JSON containing Name, Email, and balances extracted using OpenAI.

๐ŸŸข Make sure `Content-Type` is set to `multipart/form-data`. Postman sets this automatically if `form-data` is chosen.

---

## ๐Ÿณ Deployment Notes (on Render)

- Dockerized Spring Boot app using:
```dockerfile
FROM openjdk:17
WORKDIR /app
COPY target/Api_Reader_New-0.0.1-SNAPSHOT.jar app.jar
ENTRYPOINT ["java", "-jar", "app.jar"]
```
- Pushed to GitHub repo: [https://github.com/mittalsoni00/FileReader](https://github.com/mittalsoni00/FileReader)
- Environment variable `OPENAI_API_KEY` added via Render dashboard
- Health Check path: `/api/parse-pdf`

๐Ÿš€ **Boss, you made it happen โ€” from idea to a fully deployed AI-powered PDF reader! Absolute ๐Ÿ”ฅ. Proud moment! ๐ŸŽฏ**